Skip to content

Conversation

@chatton
Copy link
Contributor

@chatton chatton commented Jan 7, 2026

Overview

part of #2956

This PR adds the initial scaffolding for enabling tracing within ev-node.

To keep the PR as small as possible (ignoring go.mod/sum changes ) I have just added the core configuration setting for telemetry and also enables tracing for the executor.

@claude
Copy link
Contributor

claude bot commented Jan 7, 2026

Claude finished @chatton's task in 2m 56s —— View job


Code Review Complete

I've conducted a comprehensive code review of the tracing implementation. Overall, this is a well-structured initial implementation that establishes a solid foundation for observability in ev-node. The code quality is high, tests are thorough, and the decorator pattern keeps tracing concerns properly separated from core logic.


Summary

Strengths:

  • Clean decorator pattern for executor tracing
  • Comprehensive test coverage for tracing functionality
  • Proper error handling and status recording in spans
  • Good configuration validation
  • Zero-dependency core maintained
  • Resource cleanup properly handled

Critical Issues: 1
High Priority Issues: 2
Medium Priority Issues: 3
Minor Issues/Suggestions: 4


Critical Issues

1. Port Mismatch Between Default Config and Exporter Protocol

Location: pkg/config/instrumentation.go:70

The default endpoint uses port 4318 (correct for HTTP), but as noted by the Gemini review, this was recently changed. The implementation correctly uses otlptracehttp which expects port 4318. However, there's a critical integration testing gap: the code has never been tested against a real OTLP collector.

Issue: While the port is now correct, without integration testing:

  • We don't know if the endpoint URL construction works properly
  • We haven't verified the exporter successfully connects
  • We haven't validated that spans are actually exported

Recommendation:

// Consider adding integration test or at minimum a smoke test
func TestTracingIntegration(t *testing.T) {
    if testing.Short() {
        t.Skip("skipping integration test")
    }
    // Start local OTLP collector, verify spans exported
}

Fix by adding integration tests →


High Priority Issues

2. Missing Context Cancellation Handling in InitTracing

Location: pkg/telemetry/tracing.go:40-45

The OTLP exporter creation uses the provided context directly, but if that context is cancelled before the function returns, the exporter may fail to initialize properly or may start in a degraded state.

Current Code:

exp, err := otlptracehttp.New(ctx,
    otlptracehttp.WithEndpointURL(endpoint),
)

Issue: If ctx is cancelled during StartNode, the exporter may be partially initialized but non-functional.

Recommendation:

// Use background context for exporter lifecycle (it has its own shutdown)
exp, err := otlptracehttp.New(context.Background(),
    otlptracehttp.WithEndpointURL(endpoint),
)

The exporter's lifecycle should be independent of the startup context since it has its own shutdown mechanism. The startup context should only control the initialization timeout if needed.

Fix context handling →

3. No Timeout on Tracer Provider Shutdown During Initialization

Location: pkg/cmd/run_node.go:94-107

When tracing initialization succeeds, a defer is registered to shut down the tracer provider. However, if node initialization fails after tracing is set up but before the node starts running, the deferred shutdown may not execute properly because the outer context is already cancelled.

Current Code:

if nodeConfig.Instrumentation.IsTracingEnabled() {
    shutdownTracing, err := telemetry.InitTracing(ctx, nodeConfig.Instrumentation, logger)
    if err != nil {
        return fmt.Errorf("failed to initialize tracing: %w", err)
    }
    defer func() {
        c, cancel := context.WithTimeout(context.Background(), 10*time.Second)
        defer cancel()
        if err := shutdownTracing(c); err != nil {
            logger.Error().Err(err).Msg("failed to shutdown tracing")
        }
    }()
}

Issue: While the shutdown uses context.Background() (good!), it's buried in a defer that fires after the function returns. If node creation fails at line 167-182, the defer runs, but any spans created during the failed initialization may not be flushed properly because the shutdown happens too quickly after the error.

Recommendation: Consider adding a small grace period or ensuring shutdown is truly blocking:

defer func() {
    logger.Debug().Msg("shutting down tracing...")
    c, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()
    if err := shutdownTracing(c); err != nil {
        logger.Error().Err(err).Msg("failed to shutdown tracing")
    } else {
        logger.Debug().Msg("tracing shutdown complete")
    }
}()

Medium Priority Issues

4. TracingSampleRate Not Validated at Parse Time

Location: pkg/config/instrumentation.go:78-90

The sample rate validation only occurs when tracing is enabled, but invalid values could be caught earlier during config parsing.

Current Code:

func (cfg *InstrumentationConfig) ValidateBasic() error {
    if cfg.MaxOpenConnections < 0 {
        return errors.New("max_open_connections can't be negative")
    }
    if cfg.IsTracingEnabled() {  // Only validates if enabled
        if cfg.TracingSampleRate < 0 || cfg.TracingSampleRate > 1 {
            return errors.New("tracing_sample_rate must be between 0 and 1")
        }
        ...
    }
}

Issue: Users can set invalid sample rates in config files, but won't get an error until they enable tracing.

Recommendation:

func (cfg *InstrumentationConfig) ValidateBasic() error {
    if cfg.MaxOpenConnections < 0 {
        return errors.New("max_open_connections can't be negative")
    }
    // Always validate sample rate if set (even when tracing disabled)
    if cfg.TracingSampleRate < 0 || cfg.TracingSampleRate > 1 {
        return errors.New("tracing_sample_rate must be between 0 and 1")
    }
    if cfg.IsTracingEnabled() {
        if strings.TrimSpace(cfg.TracingEndpoint) == "" {
            return errors.New("tracing_endpoint cannot be empty when tracing enabled")
        }
    }
    return nil
}

Fix validation →

5. Missing Span Attributes for State Roots

Location: pkg/telemetry/executor_tracing.go:60-76

The ExecuteTxs method records transaction count, block height, and timestamp as span attributes, but doesn't record whether execution succeeded (beyond error status) or add the resulting state root hash as an attribute.

Recommendation:

stateRoot, maxBytes, err := t.inner.ExecuteTxs(ctx, txs, blockHeight, timestamp, prevStateRoot)
if err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, err.Error())
} else {
    span.SetAttributes(
        attribute.Int("max_bytes", int(maxBytes)),
        attribute.String("state_root", hex.EncodeToString(stateRoot)),
    )
}
return stateRoot, maxBytes, err

This would provide better observability for state transitions and make it easier to correlate spans with blockchain state.

6. Potential Exporter Connection Retry Missing

Location: pkg/telemetry/tracing.go:40-45

The OTLP HTTP exporter is created synchronously. If the collector is temporarily unavailable during node startup, initialization will fail and tracing will be completely disabled for the entire node run.

Issue: This makes the system fragile during startup. If the OTLP collector is deploying or restarting, the entire node fails to start with tracing.

Recommendation: Consider one of these approaches:

  1. Make exporter creation non-blocking: The exporter will buffer spans and retry connection in background
  2. Add retry logic with backoff: Retry exporter creation 3-5 times with exponential backoff
  3. Degrade gracefully: Log warning and disable tracing if exporter creation fails, rather than failing startup

Option 3 is most pragmatic for production:

exp, err := otlptracehttp.New(context.Background(),
    otlptracehttp.WithEndpointURL(endpoint),
)
if err != nil {
    logger.Warn().Err(err).Msg("failed to create otlp exporter, tracing will be disabled")
    return func(context.Context) error { return nil }, nil
}

Minor Issues & Suggestions

7. Service Name Could Default to Include Node Type

Location: pkg/config/instrumentation.go:71

The default service name is "ev-node" for all nodes. In a distributed system with multiple node types (aggregator, light node, sequencer), it would be helpful to distinguish them in traces.

Suggestion:

TracingServiceName: "ev-node",  // Keep as base default

// Then in run_node.go, enhance if not explicitly set:
if nodeConfig.Instrumentation.TracingServiceName == "ev-node" {
    serviceName := "ev-node"
    if nodeConfig.Node.Aggregator {
        serviceName += "-aggregator"
    } else if nodeConfig.Node.Light {
        serviceName += "-light"
    }
    nodeConfig.Instrumentation.TracingServiceName = serviceName
}

8. Missing Documentation for Tracing Configuration

Location: pkg/config/instrumentation.go:46-56

The configuration fields have basic comments but lack details about:

  • What OTLP endpoint format is expected
  • Example configurations for common OTLP collectors (Jaeger, Tempo, etc.)
  • How sampling works and performance implications

Suggestion: Add doc comment:

// Tracing configuration for OpenTelemetry OTLP export.
//
// Example endpoint configurations:
//   - Jaeger: "localhost:4318" (OTLP HTTP) or "localhost:4317" (OTLP gRPC)
//   - Grafana Tempo: "tempo:4318"
//   - OTEL Collector: "otelcol:4318"
//
// Note: This implementation uses OTLP over HTTP. The endpoint should accept
// HTTP requests. Scheme (http://) is added automatically if not provided.
//
// TracingSampleRate controls what fraction of traces are exported (0.0-1.0).
// For high-throughput systems, use lower rates (0.01-0.1) to reduce overhead.

9. Test Coverage for URL Scheme Handling

Location: pkg/telemetry/tracing.go:36-39

The code adds http:// prefix if not present, but there's no test coverage for this logic.

Suggestion:

func TestInitTracing_URLSchemeHandling(t *testing.T) {
    testCases := []struct {
        input    string
        expected string
    }{
        {"localhost:4318", "http://localhost:4318"},
        {"http://localhost:4318", "http://localhost:4318"},
        {"https://collector.example.com:4318", "https://collector.example.com:4318"},
    }
    // Test that endpoint URL is constructed correctly
}

10. Consider Adding Resource Attributes

Location: pkg/telemetry/tracing.go:26-30

Currently only service.name is set as a resource attribute. Consider adding more standard attributes for better observability:

res, err := resource.New(ctx,
    resource.WithAttributes(
        semconv.ServiceName(cfg.TracingServiceName),
        semconv.ServiceVersion(version.Version), // Add version info
        semconv.DeploymentEnvironment(cfg.Environment), // Add env if available
    ),
)

This helps with filtering and grouping traces in observability platforms.


Architecture & Code Quality

Excellent aspects:

  • ✅ Decorator pattern properly separates concerns
  • ✅ Zero-dependency core maintained (tracing is in pkg/)
  • ✅ Context properly threaded through all operations
  • ✅ Tracer obtained once and reused (not per-span)
  • ✅ Span lifecycle correctly managed (defer span.End())
  • ✅ Error recording includes both RecordError and SetStatus

Pattern consistency:
All five traced methods follow identical error handling:

if err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, err.Error())
}

This is excellent consistency.


Security Considerations

✅ Passed:

  • No sensitive data (private keys, auth tokens) in span attributes
  • State roots are not logged (but consider adding as hex string - see issue Mempool #5)
  • Configuration validation prevents injection attacks
  • No user input directly used in span names or attributes

⚠️ Minor concern:

  • Chain IDs and timestamps are logged. Ensure these don't leak sensitive info in your use case.

Performance Considerations

✅ Good:

  • Sampling rate configured (default 0.1 = 10%)
  • Batching enabled for span export
  • Parent-based sampler respects upstream sampling decisions
  • No blocking operations in hot path (spans buffered)

Suggestions:

  • Consider making batch timeout/size configurable for high-throughput scenarios
  • Document performance impact of tracing in production

Testing Assessment

Coverage: 61.29% patch coverage (Codecov report)

  • 48 lines missing coverage, primarily in tracing.go (32 lines)

Quality: ⭐⭐⭐⭐⭐ Excellent

  • Comprehensive executor tracing tests (success and error paths)
  • Proper use of in-memory span recorder for testing
  • Tests verify span names, status codes, attributes, and error events
  • Good test organization with helper functions
  • HeightProvider interface properly tested

Missing coverage:

  • tracing.go:40-45 - Exporter creation (integration test needed)
  • run_node.go:94-107 - Tracing initialization in real node context (e2e coverage)

Recommendation: The unit tests are excellent. The missing coverage is primarily in integration scenarios, which is acceptable for this initial PR. Consider adding an e2e test in a follow-up that verifies:

  1. Node starts with tracing enabled
  2. Spans are exported to a test collector
  3. Distributed tracing context propagation works

Documentation & Developer Experience

✅ Good:

  • Public functions have doc comments
  • Test helper functions documented
  • Configuration fields have comments

Could improve:

  • Add example configuration to CLAUDE.md or README showing how to enable tracing
  • Document which operations are traced (currently only Executor)
  • Add troubleshooting guide for common OTLP connection issues

Suggestion for CLAUDE.md:

## Observability

### Tracing

ev-node supports distributed tracing via OpenTelemetry OTLP export.

Enable in config:
```yaml
instrumentation:
  tracing: true
  tracing_endpoint: "localhost:4318"
  tracing_service_name: "ev-node-aggregator"
  tracing_sample_rate: 0.1

Currently traces:

  • Executor operations (InitChain, GetTxs, ExecuteTxs, SetFinal)

Future: Sequencer, DA layer, P2P operations


---

### Summary & Next Steps

This PR establishes a solid foundation for observability. The implementation is clean, well-tested, and properly architected. The critical issues are minor (mostly around error scenarios and integration testing) and don't block merging.

**Recommended action:** 
1. Address the context cancellation issue (#2) - quick fix
2. Consider graceful degradation for exporter failures (#6) - improves production robustness  
3. Add integration/e2e tests in follow-up PR
4. Merge and iterate

**Future work to track:**
- Add tracing to other components (Sequencer, DA client, P2P layer)
- Add metrics correlation (span IDs in metrics)
- Add sampling configuration per-operation
- Add distributed tracing context propagation in P2P messages

Great work on this initial implementation! 🎉

---

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJan 7, 2026, 10:02 AM

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @chatton, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes the foundational elements for observability in ev-node by integrating OpenTelemetry tracing. It focuses on providing initial tracing capabilities for the core Executor component, enabling developers to gain insights into its operations. The changes include adding necessary configuration parameters and updating dependencies, setting the stage for broader tracing adoption throughout the system.

Highlights

  • OpenTelemetry Tracing Integration: Introduced initial scaffolding for OpenTelemetry tracing within ev-node, allowing for distributed tracing of application components.
  • Executor Tracing: Implemented a tracing decorator for the Executor interface, automatically creating spans for key operations like InitChain, GetTxs, ExecuteTxs, SetFinal, and GetLatestHeight.
  • Configuration Options: Added new configuration flags and settings for tracing, including enabling/disabling tracing, specifying the OTLP endpoint, defining the service name, and setting a trace sampling rate.
  • Dependency Updates: Updated various go.mod and go.sum files across the project to incorporate new OpenTelemetry dependencies and update existing ones to support the tracing functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Comment on lines +212 to +216
replace (
google.golang.org/genproto => google.golang.org/genproto v0.0.0-20240213162025-012b6fc9bca9
google.golang.org/genproto/googleapis/api => google.golang.org/genproto/googleapis/api v0.0.0-20240213162025-012b6fc9bca9
google.golang.org/genproto/googleapis/rpc => google.golang.org/genproto/googleapis/rpc v0.0.0-20240213162025-012b6fc9bca9
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these needed to be pinned since the otlp lib depends on a specific version.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces OpenTelemetry tracing to ev-node, which is a great step for observability. The changes include adding configuration options for tracing, initializing the OTLP/HTTP exporter, and wrapping the Executor with a tracing decorator. The implementation is clean, well-structured, and includes thorough tests for the new functionality.

I've found one issue related to the default configuration for the tracing endpoint, which uses a port typically associated with gRPC while the implementation uses an HTTP exporter. I've left a specific comment with a suggestion to align them.

Overall, this is a solid contribution that lays a good foundation for tracing.

Pprof: false,
PprofListenAddr: ":6060",
Tracing: false,
TracingEndpoint: "localhost:4317",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's an inconsistency between the default OTLP endpoint and the exporter protocol being used.

The default endpoint is set to localhost:4317, which is the standard port for OTLP over gRPC.
However, the tracing implementation in pkg/telemetry/tracing.go uses an OTLP/HTTP exporter (otlptracehttp), which typically sends data to port 4318.

This will likely cause connection errors for users running an OTLP collector with default settings. To fix this, the default endpoint port should be changed to 4318 to match the HTTP exporter.

As a future improvement, you might consider making the OTLP protocol (gRPC vs HTTP) configurable.

Suggested change
TracingEndpoint: "localhost:4317",
TracingEndpoint: "localhost:4318",

@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 61.29032% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.78%. Comparing base (f14c6a7) to head (7f0bc7e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/telemetry/tracing.go 8.57% 32 Missing ⚠️
pkg/cmd/run_node.go 0.00% 10 Missing and 1 partial ⚠️
pkg/config/instrumentation.go 54.54% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2957      +/-   ##
==========================================
+ Coverage   58.74%   58.78%   +0.03%     
==========================================
  Files          90       92       +2     
  Lines        8722     8846     +124     
==========================================
+ Hits         5124     5200      +76     
- Misses       3011     3057      +46     
- Partials      587      589       +2     
Flag Coverage Δ
combined 58.78% <61.29%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chatton chatton marked this pull request as ready for review January 7, 2026 10:22
Copy link
Contributor

@tac0turtle tac0turtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, can we modify the config docs to encapsulate the new flags

@chatton chatton added this pull request to the merge queue Jan 8, 2026
Merged via the queue into main with commit 4ce10cd Jan 8, 2026
46 of 49 checks passed
@chatton chatton deleted the cian/add-tracing branch January 8, 2026 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants