-
Notifications
You must be signed in to change notification settings - Fork 245
feat: adding tracing for Executor and added initial configuration #2957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Claude finished @chatton's task in 2m 56s —— View job Code Review CompleteI've conducted a comprehensive code review of the tracing implementation. Overall, this is a well-structured initial implementation that establishes a solid foundation for observability in ev-node. The code quality is high, tests are thorough, and the decorator pattern keeps tracing concerns properly separated from core logic. SummaryStrengths:
Critical Issues: 1 Critical Issues1. Port Mismatch Between Default Config and Exporter ProtocolLocation: The default endpoint uses port Issue: While the port is now correct, without integration testing:
Recommendation: // Consider adding integration test or at minimum a smoke test
func TestTracingIntegration(t *testing.T) {
if testing.Short() {
t.Skip("skipping integration test")
}
// Start local OTLP collector, verify spans exported
}Fix by adding integration tests → High Priority Issues2. Missing Context Cancellation Handling in InitTracingLocation: The OTLP exporter creation uses the provided context directly, but if that context is cancelled before the function returns, the exporter may fail to initialize properly or may start in a degraded state. Current Code: exp, err := otlptracehttp.New(ctx,
otlptracehttp.WithEndpointURL(endpoint),
)Issue: If Recommendation: // Use background context for exporter lifecycle (it has its own shutdown)
exp, err := otlptracehttp.New(context.Background(),
otlptracehttp.WithEndpointURL(endpoint),
)The exporter's lifecycle should be independent of the startup context since it has its own shutdown mechanism. The startup context should only control the initialization timeout if needed. 3. No Timeout on Tracer Provider Shutdown During InitializationLocation: When tracing initialization succeeds, a defer is registered to shut down the tracer provider. However, if node initialization fails after tracing is set up but before the node starts running, the deferred shutdown may not execute properly because the outer context is already cancelled. Current Code: if nodeConfig.Instrumentation.IsTracingEnabled() {
shutdownTracing, err := telemetry.InitTracing(ctx, nodeConfig.Instrumentation, logger)
if err != nil {
return fmt.Errorf("failed to initialize tracing: %w", err)
}
defer func() {
c, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if err := shutdownTracing(c); err != nil {
logger.Error().Err(err).Msg("failed to shutdown tracing")
}
}()
}Issue: While the shutdown uses Recommendation: Consider adding a small grace period or ensuring shutdown is truly blocking: defer func() {
logger.Debug().Msg("shutting down tracing...")
c, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if err := shutdownTracing(c); err != nil {
logger.Error().Err(err).Msg("failed to shutdown tracing")
} else {
logger.Debug().Msg("tracing shutdown complete")
}
}()Medium Priority Issues4. TracingSampleRate Not Validated at Parse TimeLocation: The sample rate validation only occurs when tracing is enabled, but invalid values could be caught earlier during config parsing. Current Code: func (cfg *InstrumentationConfig) ValidateBasic() error {
if cfg.MaxOpenConnections < 0 {
return errors.New("max_open_connections can't be negative")
}
if cfg.IsTracingEnabled() { // Only validates if enabled
if cfg.TracingSampleRate < 0 || cfg.TracingSampleRate > 1 {
return errors.New("tracing_sample_rate must be between 0 and 1")
}
...
}
}Issue: Users can set invalid sample rates in config files, but won't get an error until they enable tracing. Recommendation: func (cfg *InstrumentationConfig) ValidateBasic() error {
if cfg.MaxOpenConnections < 0 {
return errors.New("max_open_connections can't be negative")
}
// Always validate sample rate if set (even when tracing disabled)
if cfg.TracingSampleRate < 0 || cfg.TracingSampleRate > 1 {
return errors.New("tracing_sample_rate must be between 0 and 1")
}
if cfg.IsTracingEnabled() {
if strings.TrimSpace(cfg.TracingEndpoint) == "" {
return errors.New("tracing_endpoint cannot be empty when tracing enabled")
}
}
return nil
}5. Missing Span Attributes for State RootsLocation: The Recommendation: stateRoot, maxBytes, err := t.inner.ExecuteTxs(ctx, txs, blockHeight, timestamp, prevStateRoot)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
} else {
span.SetAttributes(
attribute.Int("max_bytes", int(maxBytes)),
attribute.String("state_root", hex.EncodeToString(stateRoot)),
)
}
return stateRoot, maxBytes, errThis would provide better observability for state transitions and make it easier to correlate spans with blockchain state. 6. Potential Exporter Connection Retry MissingLocation: The OTLP HTTP exporter is created synchronously. If the collector is temporarily unavailable during node startup, initialization will fail and tracing will be completely disabled for the entire node run. Issue: This makes the system fragile during startup. If the OTLP collector is deploying or restarting, the entire node fails to start with tracing. Recommendation: Consider one of these approaches:
Option 3 is most pragmatic for production: exp, err := otlptracehttp.New(context.Background(),
otlptracehttp.WithEndpointURL(endpoint),
)
if err != nil {
logger.Warn().Err(err).Msg("failed to create otlp exporter, tracing will be disabled")
return func(context.Context) error { return nil }, nil
}Minor Issues & Suggestions7. Service Name Could Default to Include Node TypeLocation: The default service name is Suggestion: TracingServiceName: "ev-node", // Keep as base default
// Then in run_node.go, enhance if not explicitly set:
if nodeConfig.Instrumentation.TracingServiceName == "ev-node" {
serviceName := "ev-node"
if nodeConfig.Node.Aggregator {
serviceName += "-aggregator"
} else if nodeConfig.Node.Light {
serviceName += "-light"
}
nodeConfig.Instrumentation.TracingServiceName = serviceName
}8. Missing Documentation for Tracing ConfigurationLocation: The configuration fields have basic comments but lack details about:
Suggestion: Add doc comment: // Tracing configuration for OpenTelemetry OTLP export.
//
// Example endpoint configurations:
// - Jaeger: "localhost:4318" (OTLP HTTP) or "localhost:4317" (OTLP gRPC)
// - Grafana Tempo: "tempo:4318"
// - OTEL Collector: "otelcol:4318"
//
// Note: This implementation uses OTLP over HTTP. The endpoint should accept
// HTTP requests. Scheme (http://) is added automatically if not provided.
//
// TracingSampleRate controls what fraction of traces are exported (0.0-1.0).
// For high-throughput systems, use lower rates (0.01-0.1) to reduce overhead.9. Test Coverage for URL Scheme HandlingLocation: The code adds Suggestion: func TestInitTracing_URLSchemeHandling(t *testing.T) {
testCases := []struct {
input string
expected string
}{
{"localhost:4318", "http://localhost:4318"},
{"http://localhost:4318", "http://localhost:4318"},
{"https://collector.example.com:4318", "https://collector.example.com:4318"},
}
// Test that endpoint URL is constructed correctly
}10. Consider Adding Resource AttributesLocation: Currently only res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceName(cfg.TracingServiceName),
semconv.ServiceVersion(version.Version), // Add version info
semconv.DeploymentEnvironment(cfg.Environment), // Add env if available
),
)This helps with filtering and grouping traces in observability platforms. Architecture & Code QualityExcellent aspects:
Pattern consistency: if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
}This is excellent consistency. Security Considerations✅ Passed:
Performance Considerations✅ Good:
Suggestions:
Testing AssessmentCoverage: 61.29% patch coverage (Codecov report)
Quality: ⭐⭐⭐⭐⭐ Excellent
Missing coverage:
Recommendation: The unit tests are excellent. The missing coverage is primarily in integration scenarios, which is acceptable for this initial PR. Consider adding an e2e test in a follow-up that verifies:
Documentation & Developer Experience✅ Good:
Could improve:
Suggestion for CLAUDE.md: ## Observability
### Tracing
ev-node supports distributed tracing via OpenTelemetry OTLP export.
Enable in config:
```yaml
instrumentation:
tracing: true
tracing_endpoint: "localhost:4318"
tracing_service_name: "ev-node-aggregator"
tracing_sample_rate: 0.1Currently traces:
Future: Sequencer, DA layer, P2P operations |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
Summary of ChangesHello @chatton, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes the foundational elements for observability in ev-node by integrating OpenTelemetry tracing. It focuses on providing initial tracing capabilities for the core Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
| replace ( | ||
| google.golang.org/genproto => google.golang.org/genproto v0.0.0-20240213162025-012b6fc9bca9 | ||
| google.golang.org/genproto/googleapis/api => google.golang.org/genproto/googleapis/api v0.0.0-20240213162025-012b6fc9bca9 | ||
| google.golang.org/genproto/googleapis/rpc => google.golang.org/genproto/googleapis/rpc v0.0.0-20240213162025-012b6fc9bca9 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these needed to be pinned since the otlp lib depends on a specific version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces OpenTelemetry tracing to ev-node, which is a great step for observability. The changes include adding configuration options for tracing, initializing the OTLP/HTTP exporter, and wrapping the Executor with a tracing decorator. The implementation is clean, well-structured, and includes thorough tests for the new functionality.
I've found one issue related to the default configuration for the tracing endpoint, which uses a port typically associated with gRPC while the implementation uses an HTTP exporter. I've left a specific comment with a suggestion to align them.
Overall, this is a solid contribution that lays a good foundation for tracing.
pkg/config/instrumentation.go
Outdated
| Pprof: false, | ||
| PprofListenAddr: ":6060", | ||
| Tracing: false, | ||
| TracingEndpoint: "localhost:4317", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an inconsistency between the default OTLP endpoint and the exporter protocol being used.
The default endpoint is set to localhost:4317, which is the standard port for OTLP over gRPC.
However, the tracing implementation in pkg/telemetry/tracing.go uses an OTLP/HTTP exporter (otlptracehttp), which typically sends data to port 4318.
This will likely cause connection errors for users running an OTLP collector with default settings. To fix this, the default endpoint port should be changed to 4318 to match the HTTP exporter.
As a future improvement, you might consider making the OTLP protocol (gRPC vs HTTP) configurable.
| TracingEndpoint: "localhost:4317", | |
| TracingEndpoint: "localhost:4318", |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2957 +/- ##
==========================================
+ Coverage 58.74% 58.78% +0.03%
==========================================
Files 90 92 +2
Lines 8722 8846 +124
==========================================
+ Hits 5124 5200 +76
- Misses 3011 3057 +46
- Partials 587 589 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, can we modify the config docs to encapsulate the new flags
Overview
part of #2956
This PR adds the initial scaffolding for enabling tracing within ev-node.
To keep the PR as small as possible (ignoring go.mod/sum changes ) I have just added the core configuration setting for telemetry and also enables tracing for the executor.