feat: adding tracing for Executor and added initial configuration #2957

chatton · 2026-01-07T09:34:27Z

Overview

part of #2956

This PR adds the initial scaffolding for enabling tracing within ev-node.

To keep the PR as small as possible (ignoring go.mod/sum changes ) I have just added the core configuration setting for telemetry and also enables tracing for the executor.

claude · 2026-01-07T09:34:40Z

Claude finished @chatton's task in 2m 56s —— View job

Code Review Complete

I've conducted a comprehensive code review of the tracing implementation. Overall, this is a well-structured initial implementation that establishes a solid foundation for observability in ev-node. The code quality is high, tests are thorough, and the decorator pattern keeps tracing concerns properly separated from core logic.

Summary

Strengths:

Clean decorator pattern for executor tracing
Comprehensive test coverage for tracing functionality
Proper error handling and status recording in spans
Good configuration validation
Zero-dependency core maintained
Resource cleanup properly handled

Critical Issues: 1
High Priority Issues: 2
Medium Priority Issues: 3
Minor Issues/Suggestions: 4

Critical Issues

1. Port Mismatch Between Default Config and Exporter Protocol

Location: pkg/config/instrumentation.go:70

The default endpoint uses port 4318 (correct for HTTP), but as noted by the Gemini review, this was recently changed. The implementation correctly uses otlptracehttp which expects port 4318. However, there's a critical integration testing gap: the code has never been tested against a real OTLP collector.

Issue: While the port is now correct, without integration testing:

We don't know if the endpoint URL construction works properly
We haven't verified the exporter successfully connects
We haven't validated that spans are actually exported

Recommendation:

// Consider adding integration test or at minimum a smoke test
func TestTracingIntegration(t *testing.T) {
    if testing.Short() {
        t.Skip("skipping integration test")
    }
    // Start local OTLP collector, verify spans exported
}

Fix by adding integration tests →

High Priority Issues

2. Missing Context Cancellation Handling in InitTracing

Location: pkg/telemetry/tracing.go:40-45

The OTLP exporter creation uses the provided context directly, but if that context is cancelled before the function returns, the exporter may fail to initialize properly or may start in a degraded state.

Current Code:

exp, err := otlptracehttp.New(ctx,
    otlptracehttp.WithEndpointURL(endpoint),
)

Issue: If ctx is cancelled during StartNode, the exporter may be partially initialized but non-functional.

Recommendation:

// Use background context for exporter lifecycle (it has its own shutdown)
exp, err := otlptracehttp.New(context.Background(),
    otlptracehttp.WithEndpointURL(endpoint),
)

The exporter's lifecycle should be independent of the startup context since it has its own shutdown mechanism. The startup context should only control the initialization timeout if needed.

Fix context handling →

3. No Timeout on Tracer Provider Shutdown During Initialization

Location: pkg/cmd/run_node.go:94-107

When tracing initialization succeeds, a defer is registered to shut down the tracer provider. However, if node initialization fails after tracing is set up but before the node starts running, the deferred shutdown may not execute properly because the outer context is already cancelled.

Current Code:

if nodeConfig.Instrumentation.IsTracingEnabled() {
    shutdownTracing, err := telemetry.InitTracing(ctx, nodeConfig.Instrumentation, logger)
    if err != nil {
        return fmt.Errorf("failed to initialize tracing: %w", err)
    }
    defer func() {
        c, cancel := context.WithTimeout(context.Background(), 10*time.Second)
        defer cancel()
        if err := shutdownTracing(c); err != nil {
            logger.Error().Err(err).Msg("failed to shutdown tracing")
        }
    }()
}

Issue: While the shutdown uses context.Background() (good!), it's buried in a defer that fires after the function returns. If node creation fails at line 167-182, the defer runs, but any spans created during the failed initialization may not be flushed properly because the shutdown happens too quickly after the error.

Recommendation: Consider adding a small grace period or ensuring shutdown is truly blocking:

defer func() {
    logger.Debug().Msg("shutting down tracing...")
    c, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()
    if err := shutdownTracing(c); err != nil {
        logger.Error().Err(err).Msg("failed to shutdown tracing")
    } else {
        logger.Debug().Msg("tracing shutdown complete")
    }
}()

Medium Priority Issues

4. TracingSampleRate Not Validated at Parse Time

Location: pkg/config/instrumentation.go:78-90

The sample rate validation only occurs when tracing is enabled, but invalid values could be caught earlier during config parsing.

Current Code:

func (cfg *InstrumentationConfig) ValidateBasic() error {
    if cfg.MaxOpenConnections < 0 {
        return errors.New("max_open_connections can't be negative")
    }
    if cfg.IsTracingEnabled() {  // Only validates if enabled
        if cfg.TracingSampleRate < 0 || cfg.TracingSampleRate > 1 {
            return errors.New("tracing_sample_rate must be between 0 and 1")
        }
        ...
    }
}

Issue: Users can set invalid sample rates in config files, but won't get an error until they enable tracing.

Recommendation:

func (cfg *InstrumentationConfig) ValidateBasic() error {
    if cfg.MaxOpenConnections < 0 {
        return errors.New("max_open_connections can't be negative")
    }
    // Always validate sample rate if set (even when tracing disabled)
    if cfg.TracingSampleRate < 0 || cfg.TracingSampleRate > 1 {
        return errors.New("tracing_sample_rate must be between 0 and 1")
    }
    if cfg.IsTracingEnabled() {
        if strings.TrimSpace(cfg.TracingEndpoint) == "" {
            return errors.New("tracing_endpoint cannot be empty when tracing enabled")
        }
    }
    return nil
}

Fix validation →

5. Missing Span Attributes for State Roots

Location: pkg/telemetry/executor_tracing.go:60-76

The ExecuteTxs method records transaction count, block height, and timestamp as span attributes, but doesn't record whether execution succeeded (beyond error status) or add the resulting state root hash as an attribute.

Recommendation:

stateRoot, maxBytes, err := t.inner.ExecuteTxs(ctx, txs, blockHeight, timestamp, prevStateRoot)
if err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, err.Error())
} else {
    span.SetAttributes(
        attribute.Int("max_bytes", int(maxBytes)),
        attribute.String("state_root", hex.EncodeToString(stateRoot)),
    )
}
return stateRoot, maxBytes, err

This would provide better observability for state transitions and make it easier to correlate spans with blockchain state.

6. Potential Exporter Connection Retry Missing

Location: pkg/telemetry/tracing.go:40-45

The OTLP HTTP exporter is created synchronously. If the collector is temporarily unavailable during node startup, initialization will fail and tracing will be completely disabled for the entire node run.

Issue: This makes the system fragile during startup. If the OTLP collector is deploying or restarting, the entire node fails to start with tracing.

Recommendation: Consider one of these approaches:

Make exporter creation non-blocking: The exporter will buffer spans and retry connection in background
Add retry logic with backoff: Retry exporter creation 3-5 times with exponential backoff
Degrade gracefully: Log warning and disable tracing if exporter creation fails, rather than failing startup

Option 3 is most pragmatic for production:

exp, err := otlptracehttp.New(context.Background(),
    otlptracehttp.WithEndpointURL(endpoint),
)
if err != nil {
    logger.Warn().Err(err).Msg("failed to create otlp exporter, tracing will be disabled")
    return func(context.Context) error { return nil }, nil
}

Minor Issues & Suggestions

7. Service Name Could Default to Include Node Type

Location: pkg/config/instrumentation.go:71

The default service name is "ev-node" for all nodes. In a distributed system with multiple node types (aggregator, light node, sequencer), it would be helpful to distinguish them in traces.

Suggestion:

TracingServiceName: "ev-node",  // Keep as base default

// Then in run_node.go, enhance if not explicitly set:
if nodeConfig.Instrumentation.TracingServiceName == "ev-node" {
    serviceName := "ev-node"
    if nodeConfig.Node.Aggregator {
        serviceName += "-aggregator"
    } else if nodeConfig.Node.Light {
        serviceName += "-light"
    }
    nodeConfig.Instrumentation.TracingServiceName = serviceName
}

8. Missing Documentation for Tracing Configuration

Location: pkg/config/instrumentation.go:46-56

The configuration fields have basic comments but lack details about:

What OTLP endpoint format is expected
Example configurations for common OTLP collectors (Jaeger, Tempo, etc.)
How sampling works and performance implications

Suggestion: Add doc comment:

// Tracing configuration for OpenTelemetry OTLP export.
//
// Example endpoint configurations:
//   - Jaeger: "localhost:4318" (OTLP HTTP) or "localhost:4317" (OTLP gRPC)
//   - Grafana Tempo: "tempo:4318"
//   - OTEL Collector: "otelcol:4318"
//
// Note: This implementation uses OTLP over HTTP. The endpoint should accept
// HTTP requests. Scheme (http://) is added automatically if not provided.
//
// TracingSampleRate controls what fraction of traces are exported (0.0-1.0).
// For high-throughput systems, use lower rates (0.01-0.1) to reduce overhead.

9. Test Coverage for URL Scheme Handling

Location: pkg/telemetry/tracing.go:36-39

The code adds http:// prefix if not present, but there's no test coverage for this logic.

Suggestion:

func TestInitTracing_URLSchemeHandling(t *testing.T) {
    testCases := []struct {
        input    string
        expected string
    }{
        {"localhost:4318", "http://localhost:4318"},
        {"http://localhost:4318", "http://localhost:4318"},
        {"https://collector.example.com:4318", "https://collector.example.com:4318"},
    }
    // Test that endpoint URL is constructed correctly
}

10. Consider Adding Resource Attributes

Location: pkg/telemetry/tracing.go:26-30

Currently only service.name is set as a resource attribute. Consider adding more standard attributes for better observability:

res, err := resource.New(ctx,
    resource.WithAttributes(
        semconv.ServiceName(cfg.TracingServiceName),
        semconv.ServiceVersion(version.Version), // Add version info
        semconv.DeploymentEnvironment(cfg.Environment), // Add env if available
    ),
)

This helps with filtering and grouping traces in observability platforms.

Architecture & Code Quality

Excellent aspects:

✅ Decorator pattern properly separates concerns
✅ Zero-dependency core maintained (tracing is in pkg/)
✅ Context properly threaded through all operations
✅ Tracer obtained once and reused (not per-span)
✅ Span lifecycle correctly managed (defer span.End())
✅ Error recording includes both RecordError and SetStatus

Pattern consistency:
All five traced methods follow identical error handling:

if err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, err.Error())
}

This is excellent consistency.

Security Considerations

✅ Passed:

No sensitive data (private keys, auth tokens) in span attributes
State roots are not logged (but consider adding as hex string - see issue Mempool #5)
Configuration validation prevents injection attacks
No user input directly used in span names or attributes

⚠️ Minor concern:

Chain IDs and timestamps are logged. Ensure these don't leak sensitive info in your use case.

Performance Considerations

✅ Good:

Sampling rate configured (default 0.1 = 10%)
Batching enabled for span export
Parent-based sampler respects upstream sampling decisions
No blocking operations in hot path (spans buffered)

Suggestions:

Consider making batch timeout/size configurable for high-throughput scenarios
Document performance impact of tracing in production

Testing Assessment

Coverage: 61.29% patch coverage (Codecov report)

48 lines missing coverage, primarily in tracing.go (32 lines)

Quality: ⭐⭐⭐⭐⭐ Excellent

Comprehensive executor tracing tests (success and error paths)
Proper use of in-memory span recorder for testing
Tests verify span names, status codes, attributes, and error events
Good test organization with helper functions
HeightProvider interface properly tested

Missing coverage:

tracing.go:40-45 - Exporter creation (integration test needed)
run_node.go:94-107 - Tracing initialization in real node context (e2e coverage)

Recommendation: The unit tests are excellent. The missing coverage is primarily in integration scenarios, which is acceptable for this initial PR. Consider adding an e2e test in a follow-up that verifies:

Node starts with tracing enabled
Spans are exported to a test collector
Distributed tracing context propagation works

Documentation & Developer Experience

✅ Good:

Public functions have doc comments
Test helper functions documented
Configuration fields have comments

Could improve:

Add example configuration to CLAUDE.md or README showing how to enable tracing
Document which operations are traced (currently only Executor)
Add troubleshooting guide for common OTLP connection issues

Suggestion for CLAUDE.md:

## Observability

### Tracing

ev-node supports distributed tracing via OpenTelemetry OTLP export.

Enable in config:
```yaml
instrumentation:
  tracing: true
  tracing_endpoint: "localhost:4318"
  tracing_service_name: "ev-node-aggregator"
  tracing_sample_rate: 0.1

Currently traces:

Executor operations (InitChain, GetTxs, ExecuteTxs, SetFinal)

Future: Sequencer, DA layer, P2P operations


---

### Summary & Next Steps

This PR establishes a solid foundation for observability. The implementation is clean, well-tested, and properly architected. The critical issues are minor (mostly around error scenarios and integration testing) and don't block merging.

**Recommended action:** 
1. Address the context cancellation issue (#2) - quick fix
2. Consider graceful degradation for exporter failures (#6) - improves production robustness  
3. Add integration/e2e tests in follow-up PR
4. Merge and iterate

**Future work to track:**
- Add tracing to other components (Sequencer, DA client, P2P layer)
- Add metrics correlation (span IDs in metrics)
- Add sampling configuration per-operation
- Add distributed tracing context propagation in P2P messages

Great work on this initial implementation! 🎉

---

github-actions · 2026-01-07T09:34:40Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Jan 7, 2026, 10:02 AM

gemini-code-assist · 2026-01-07T09:34:55Z

Summary of Changes

Hello @chatton, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the foundational elements for observability in ev-node by integrating OpenTelemetry tracing. It focuses on providing initial tracing capabilities for the core Executor component, enabling developers to gain insights into its operations. The changes include adding necessary configuration parameters and updating dependencies, setting the stage for broader tracing adoption throughout the system.

Highlights

OpenTelemetry Tracing Integration: Introduced initial scaffolding for OpenTelemetry tracing within ev-node, allowing for distributed tracing of application components.
Executor Tracing: Implemented a tracing decorator for the Executor interface, automatically creating spans for key operations like InitChain, GetTxs, ExecuteTxs, SetFinal, and GetLatestHeight.
Configuration Options: Added new configuration flags and settings for tracing, including enabling/disabling tracing, specifying the OTLP endpoint, defining the service name, and setting a trace sampling rate.
Dependency Updates: Updated various go.mod and go.sum files across the project to incorporate new OpenTelemetry dependencies and update existing ones to support the tracing functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

chatton · 2026-01-07T09:35:46Z

apps/evm/go.mod

+replace (
+	google.golang.org/genproto => google.golang.org/genproto v0.0.0-20240213162025-012b6fc9bca9
+	google.golang.org/genproto/googleapis/api => google.golang.org/genproto/googleapis/api v0.0.0-20240213162025-012b6fc9bca9
+	google.golang.org/genproto/googleapis/rpc => google.golang.org/genproto/googleapis/rpc v0.0.0-20240213162025-012b6fc9bca9
+)


these needed to be pinned since the otlp lib depends on a specific version.

gemini-code-assist

Code Review

This pull request introduces OpenTelemetry tracing to ev-node, which is a great step for observability. The changes include adding configuration options for tracing, initializing the OTLP/HTTP exporter, and wrapping the Executor with a tracing decorator. The implementation is clean, well-structured, and includes thorough tests for the new functionality.

I've found one issue related to the default configuration for the tracing endpoint, which uses a port typically associated with gRPC while the implementation uses an HTTP exporter. I've left a specific comment with a suggestion to align them.

Overall, this is a solid contribution that lays a good foundation for tracing.

gemini-code-assist · 2026-01-07T09:36:31Z

pkg/config/instrumentation.go

 		Pprof:                false,
 		PprofListenAddr:      ":6060",
+		Tracing:              false,
+		TracingEndpoint:      "localhost:4317",


There's an inconsistency between the default OTLP endpoint and the exporter protocol being used.

The default endpoint is set to localhost:4317, which is the standard port for OTLP over gRPC.
However, the tracing implementation in pkg/telemetry/tracing.go uses an OTLP/HTTP exporter (otlptracehttp), which typically sends data to port 4318.

This will likely cause connection errors for users running an OTLP collector with default settings. To fix this, the default endpoint port should be changed to 4318 to match the HTTP exporter.

As a future improvement, you might consider making the OTLP protocol (gRPC vs HTTP) configurable.

Suggested change

TracingEndpoint: "localhost:4317",

TracingEndpoint: "localhost:4318",

codecov · 2026-01-07T09:59:24Z

Codecov Report

❌ Patch coverage is 61.29032% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.78%. Comparing base (f14c6a7) to head (7f0bc7e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/telemetry/tracing.go	8.57%	32 Missing ⚠️
pkg/cmd/run_node.go	0.00%	10 Missing and 1 partial ⚠️
pkg/config/instrumentation.go	54.54%	4 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2957      +/-   ##
==========================================
+ Coverage   58.74%   58.78%   +0.03%     
==========================================
  Files          90       92       +2     
  Lines        8722     8846     +124     
==========================================
+ Hits         5124     5200      +76     
- Misses       3011     3057      +46     
- Partials      587      589       +2

Flag	Coverage Δ
combined	`58.78% <61.29%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tac0turtle

looks good, can we modify the config docs to encapsulate the new flags

chatton added 7 commits January 6, 2026 14:44

wip: adding tracing

78b54f2

chore: only having the first tracing decorator

ae87a59

chore: remove comment

fdd943b

deps: adding pin to genproto version

74d4a5f

chore: ensuring errors reported, adding unit tests

d796589

chore: add check to validate basic

d52f187

chore: modified default

b2b3218

github-actions bot assigned chatton Jan 7, 2026

chatton commented Jan 7, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

chatton added 3 commits January 7, 2026 09:42

chore: adding logging of possible error

41bce54

chore: updated flag test

fd34073

chore: bump endpoint to correct port

fd7425a

Merge branch 'main' into cian/add-tracing

7f0bc7e

chatton marked this pull request as ready for review January 7, 2026 10:22

chatton requested review from julienrbrt and tac0turtle January 7, 2026 12:26

tac0turtle reviewed Jan 8, 2026

View reviewed changes

tac0turtle approved these changes Jan 8, 2026

View reviewed changes

chatton added this pull request to the merge queue Jan 8, 2026

Merged via the queue into main with commit 4ce10cd Jan 8, 2026
46 of 49 checks passed

chatton deleted the cian/add-tracing branch January 8, 2026 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: adding tracing for Executor and added initial configuration #2957

feat: adding tracing for Executor and added initial configuration #2957

Uh oh!

chatton commented Jan 7, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Uh oh!

chatton Jan 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

codecov bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

tac0turtle left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	TracingEndpoint: "localhost:4317",
	TracingEndpoint: "localhost:4318",

feat: adding tracing for Executor and added initial configuration #2957

feat: adding tracing for Executor and added initial configuration #2957

Uh oh!

Conversation

chatton commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

claude bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete

Summary

Critical Issues

1. Port Mismatch Between Default Config and Exporter Protocol

High Priority Issues

2. Missing Context Cancellation Handling in InitTracing

3. No Timeout on Tracer Provider Shutdown During Initialization

Medium Priority Issues

4. TracingSampleRate Not Validated at Parse Time

5. Missing Span Attributes for State Roots

6. Potential Exporter Connection Retry Missing

Minor Issues & Suggestions

7. Service Name Could Default to Include Node Type

8. Missing Documentation for Tracing Configuration

9. Test Coverage for URL Scheme Handling

10. Consider Adding Resource Attributes

Architecture & Code Quality

Security Considerations

Performance Considerations

Testing Assessment

Documentation & Developer Experience

Uh oh!

github-actions bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

chatton Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tac0turtle left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chatton commented Jan 7, 2026 •

edited

Loading

claude bot commented Jan 7, 2026 •

edited

Loading

github-actions bot commented Jan 7, 2026 •

edited

Loading

codecov bot commented Jan 7, 2026 •

edited

Loading

tac0turtle left a comment •

edited

Loading