Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Jan 6, 2026

Replaces #2836

alpe added 28 commits November 12, 2025 15:16
* main:
  fix: remove duplicate error logging in light node shutdown (#2841)
  chore: fix incorrect function name in comment (#2840)
  chore: remove sequencer go.mod (#2837)
* main:
  build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851)
  build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844)
  build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843)
  build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
* main:
  chore: reduce log noise (#2864)
  fix: sync service for non zero height starts with empty store (#2834)
  build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861)
  chore: minor improvement for docs (#2862)
* main:
  chore: bump da (#2866)
  chore: bump  core (#2865)
* main:
  chore: fix some comments (#2874)
  chore: bump node in evm-single (#2875)
  refactor(syncer,cache): use compare and swap loop and add comments (#2873)
  refactor: use state da height as well (#2872)
  refactor: retrieve highest da height in cache (#2870)
  chore: change from event count to start and end height (#2871)
* main:
  chore: remove extra github action yml file (#2882)
  fix(execution/evm): verify payload status (#2863)
  feat: fetch included da height from store (#2880)
  chore: better output on errors (#2879)
  refactor!: create da client and split cache interface (#2878)
  chore!: rename `evm-single` and `grpc-single` (#2839)
  build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876)
  chore: parallel cache de/serialization (#2868)
  chore: bump blob size (#2877)
* main:
  build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900)
  refactor(block): centralize timeout in client (#2903)
  build(deps): Bump the all-go group across 2 directories with 3 updates (#2898)
  chore: bump default timeout (#2902)
  fix: revert default db (#2897)
  refactor: remove obsolete // +build tag (#2899)
  fix:da visualiser namespace  (#2895)
  refactor: omit unnecessary reassignment (#2892)
  build(deps): Bump the all-go group across 5 directories with 6 updates (#2881)
  chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889)
  fix: ensure consistent network ID usage in P2P subscriber (#2884)
  build(deps): Bump golangci/golangci-lint-action from 9.0.0 to 9.1.0 (#2885)
  build(deps): Bump actions/checkout from 5 to 6 (#2886)
* main: (34 commits)
  feat: make reaper poll duration configurable (#2951)
  chore!: move sequencers to pkg (#2931)
  feat: Ensure Header integrity on DA (#2948)
  feat(testda): add header support with GetHeaderByHeight method (#2946)
  chore: improve code comments clarity (#2947)
  chore(sequencers): optimize store check (#2945)
  fix: make evm_execution more robust (#2942)
  fix(sequencers/single): deterministic queue (#2938)
  fix(block): fix init logic sequencer for da epoch fetching (#2926)
  feat: use DA timestamp (#2939)
  chore: improve code comments clarity (#2943)
  build(deps): bump libp2p (#2937)
  build(deps): Bump actions/cache from 4 to 5 (#2934)
  build(deps): Bump actions/download-artifact from 6 to 7 (#2933)
  build(deps): Bump actions/upload-artifact from 5 to 6 (#2932)
  feat: DA Client remove interface part 3, replace types with new code (#2910)
  DA Client remove interface: Part 2.5, create e2e test to validate that a blob is posted in DA layer. (#2920)
  fix(syncing): skip forced txs checks for p2p blocks (#2922)
  build(deps): Bump the all-go group across 5 directories with 5 updates (#2919)
  chore: loosen syncer state check (#2927)
  ...
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedJan 8, 2026, 9:55 AM

@claude
Copy link
Contributor

claude bot commented Jan 6, 2026

Claude finished @alpe's task in 2m 49s —— View job


Code Review: High Availability via RAFT

I've completed a comprehensive code review of this Raft high availability implementation. This is a substantial architectural change that introduces distributed consensus for leader election and state replication. Below are my findings organized by priority.


Critical Issues

1. Error Handling in node/full.go (Lines 113-119) - BLOCKING

Issue: Naked return statements after error checks will swallow errors and return (nil, nil) instead of propagating the error.

Location: node/full.go:113-119

case nodeConfig.Node.Aggregator && !nodeConfig.Raft.Enable:
    if leaderElection, err = newSingleRoleElector(leaderFactory); err != nil {
        return  // ❌ Wrong: swallows error, returns (nil, nil)
    }
case !nodeConfig.Node.Aggregator && !nodeConfig.Raft.Enable:
    if leaderElection, err = newSingleRoleElector(followerFactory); err != nil {
        return  // ❌ Wrong: swallows error, returns (nil, nil)
    }

Fix: Return the error explicitly:

if leaderElection, err = newSingleRoleElector(leaderFactory); err != nil {
    return nil, err
}

Impact: This will cause silent failures during node startup when Raft is disabled. Critical for production.

Fix this →


2. Restrictive Bootstrap Check in pkg/raft/node.go (Lines 112-115) - BLOCKING

Issue: The Start() method fails fast if Bootstrap is false, preventing non-bootstrap nodes from joining an existing cluster.

Location: pkg/raft/node.go:112-115

if !n.config.Bootstrap {
    // it is intended to fail fast here. at this stage only bootstrap mode is supported.
    return fmt.Errorf("raft cluster requires bootstrap mode")
}

Problem: The comment says "at this stage only bootstrap mode is supported," but this prevents horizontal scaling. New nodes joining an existing cluster will fail to start.

Fix: Only execute bootstrap logic when Bootstrap is true, otherwise allow the node to start and join via AddPeer:

if n.config.Bootstrap {
    // Bootstrap logic here...
} 
// Allow non-bootstrap nodes to continue
return nil

Impact: Prevents cluster expansion. This is a fundamental limitation for a HA system.

Fix this →


High Priority Issues

3. Race Condition in DynamicLeaderElection (pkg/raft/election.go:96-108)

Issue: Potential race between leadership check and state sync that could lead to double signing.

Location: pkg/raft/election.go:96-108

The code attempts to prevent double signing by waiting for raft messages to land and checking sync status. However, there's a TOCTOU (time-of-check-time-of-use) vulnerability:

raftSynced := d.node.waitForMsgsLanded(d.node.Config().SendTimeout) == nil
if d.node.leaderID() != d.node.NodeID() {  // ❌ Leadership could change here
    d.logger.Info().Msg("lost leadership during sync wait")
    continue
}
if !raftSynced || !runnable.IsSynced(d.node.GetState()) {
    // Transfer leadership...
}

Recommendation:

  1. Use atomic operations or a mutex to ensure leadership state doesn't change during this critical section
  2. Add explicit verification that leadership hasn't changed between the sync check and starting leader operations
  3. Consider adding a comment explaining the double-sign prevention mechanism

Impact: Low probability but high severity - could result in chain fork if two nodes sign different blocks at same height.


4. Unbuffered Channel Risk in FSM.Apply (pkg/raft/node.go:293-299)

Issue: The FSM drops messages on slow consumers with only a warning.

Location: pkg/raft/node.go:293-299

if f.applyCh != nil {
    select {
    case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
    default:
        f.logger.Warn().Msg("apply channel full, dropping message")
    }
}

Problem: While the comment says "Followers can sync from DA or other peers," silently dropping state updates can lead to:

  • Stale state on followers
  • Increased load on DA and P2P layers
  • Difficulty debugging sync issues

Recommendations:

  1. Add metrics to track dropped messages
  2. Include more context in the warning (e.g., current buffer size, consecutive drops)
  3. Consider implementing backpressure or circuit breaker pattern
  4. Document the recovery mechanism more clearly

5. Typo in Log Message (pkg/raft/node.go:122)

Location: pkg/raft/node.go:122

n.logger.Info().Msg("Boostrap raft cluster")  // ❌ "Boostrap" should be "Bootstrap"

Minor but should be fixed for professionalism.


Medium Priority Issues

6. Debug Log at Info Level (block/internal/syncing/syncer.go:184)

Location: block/internal/syncing/syncer.go:184

s.logger.Debug().Uint64("header_height", state.LastSubmittedDaHeaderHeight)...Msg("received raft block state")

The code appears correct (using Debug level), but Gemini's review mentioned this has a +++ prefix suggesting it was temporary debug code. Verify this is intentional.


7. Inconsistent Naming Convention (pkg/rpc/server/http.go:143-146)

Location: pkg/rpc/server/http.go:143-146

NodeId   string `json:"node_id"`  // ❌ Should be NodeID per Go conventions

Go convention is to use ID not Id for initialisms. Should be:

NodeID   string `json:"node_id"`

Architecture & Design Observations

Strengths

  1. Clean Separation: The leaderElection interface and DynamicLeaderElection implementation provide good abstraction
  2. Failover Design: The failoverState pattern cleanly encapsulates mode-switching logic
  3. Test Coverage: Good test coverage for critical paths (election tests, node tests)
  4. State Validation: assertValid() function provides basic state consistency checks

Concerns

  1. Complexity: The dynamic mode switching adds significant complexity. Ensure operational runbooks document:

    • How to safely add/remove nodes
    • Recovery procedures for split-brain scenarios
    • Monitoring and alerting requirements
  2. Test Coverage Gap: Code coverage dropped from 58.74% to 56.77%. Key areas needing more tests:

    • pkg/raft/node.go (12.12% coverage)
    • block/internal/syncing/raft_retriever.go (0% coverage)
    • node/failover.go (74.45% - needs edge case coverage)
  3. Error Context: Some error messages lack context (e.g., "not leader" in Broadcast). Consider adding node ID and current leader ID.


Security Considerations

1. Leadership Transfer Security

The leadership transfer mechanism in election.go correctly tries to prevent double-signing, but:

  • Ensure DA submission is idempotent
  • Document what happens if leadership transfer fails during the grace period
  • Consider adding explicit fencing tokens

2. Raft Transport Security

Observation: The code uses plain TCP for Raft communication:

transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)

Recommendation: For production deployments:

  • Document that Raft communication should happen over a trusted network
  • Consider adding TLS support for Raft transport
  • Add configuration validation to warn if Raft is exposed on public interfaces

3. Credential Management

No obvious credential leakage issues found. Good practices observed:

  • No hardcoded secrets
  • Proper use of JWT secrets from configuration
  • Auth tokens properly marked with #nosec comment

Performance & Resource Efficiency

1. Goroutine Management

Good practices observed:

  • Proper use of sync.WaitGroup for goroutine lifecycle
  • Context-based cancellation throughout
  • Deferred cleanup functions

2. Memory Management

Concern: The RaftBlockState is cloned frequently:

func (n *Node) GetState() *RaftBlockState {
    return proto.Clone(n.fsm.state.Load()).(*RaftBlockState)  // ❌ Clone on every read
}

This is safe but could be expensive under high read load. Consider:

  • Adding a read-heavy cache if this becomes a bottleneck
  • Documenting why the clone is necessary (prevents mutation)

Concurrency Patterns

Strong Points

  1. Atomic Pointers: Good use of atomic.Pointer for state management
  2. Channel Patterns: Proper buffering and select statements
  3. Context Propagation: Consistent use of context for cancellation

Areas for Improvement

  1. Lock Granularity: raftRetriever uses a mutex for start/stop. Consider if this could be simplified with atomics.

  2. Error Channel Buffering: In DynamicLeaderElection.Run, the error channel has buffer size 1:

errCh := make(chan error, 1)

This means if both leader and follower workers error simultaneously, one error is lost. Consider:

  • Increasing buffer size
  • Using errgroup for error aggregation
  • Or documenting why losing the second error is acceptable

Testing Recommendations

  1. Add Chaos Tests: Test leadership changes during:

    • Block production
    • DA submission
    • Network partitions
  2. Integration Tests: Add tests for:

    • Node restart scenarios
    • Rolling upgrades
    • State synchronization after extended downtime
  3. Load Tests: Verify performance with:

    • High transaction throughput
    • Frequent leadership changes
    • Multiple nodes (3, 5, 7)

Configuration Validation

The Raft configuration validation looks good:

  • Proper bounds checking on timeouts
  • Peer address validation
  • Duplicate detection

Minor Enhancement: Consider validating that HeartbeatTimeout < LeaderLeaseTimeout to prevent configuration issues.


Documentation Gaps

  1. ADR Reference: The PR title mentions ADR023 but the ADR isn't linked. Add a reference in the PR description.

  2. Operational Guide: Critical need for documentation on:

    • Minimum cluster size (appears to be 3 based on tests)
    • Quorum requirements
    • Recovery procedures
    • Monitoring dashboards
  3. Migration Path: How do existing single-node deployments migrate to HA?


Pre-Merge Checklist

Before merging, ensure:

  • Fix critical error handling issues (items ABCI interface #1 and Peer discovery #2 above)
  • Run make lint and make test-all to ensure all checks pass
  • Verify E2E tests pass consistently (not just once)
  • Update documentation with:
    • Configuration examples for 3-node cluster
    • Migration guide from single-node
    • Troubleshooting guide
  • Add monitoring/metrics for:
    • Leadership changes
    • Raft apply latency
    • Dropped messages count
    • Sync status per node
  • Security review of Raft transport (TLS recommendation)

Summary

This is a well-structured implementation of Raft-based high availability. The architecture is sound, with good separation of concerns and proper use of Go concurrency patterns. However, there are two critical bugs that must be fixed before merge:

  1. Error handling in node/full.go (swallowed errors)
  2. Bootstrap restriction in pkg/raft/node.go (prevents cluster expansion)

The code demonstrates good engineering practices overall, but would benefit from:

  • Additional test coverage (especially integration tests)
  • Enhanced documentation
  • More defensive error handling
  • Production-readiness improvements (monitoring, TLS)

Recommendation: Fix the two critical issues, then this can proceed to merge with a plan to address medium-priority items in follow-up PRs.


Tasks Completed

  • Examine critical error handling issues identified by Gemini
  • Review core Raft implementation (pkg/raft/*)
  • Review failover mechanism (node/failover.go)
  • Review block processing integration
  • Review configuration and API changes
  • Assess test coverage and quality
  • Check security considerations
  • Review concurrency patterns
  • Compile final feedback

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Rollkit network's resilience and fault tolerance by integrating the Raft consensus algorithm. The changes enable a robust failover mechanism, allowing nodes to seamlessly transition between leader and follower roles. This ensures continuous operation and consistent state across the network, even if individual nodes experience failures. The update touches various core components, from dependency management and block processing to RPC services and node lifecycle management, all aimed at achieving high availability.

Highlights

  • Raft Consensus Integration: Introduced Raft consensus for high availability, enabling dynamic leader election and state replication. This involves new dependencies on github.com/hashicorp/raft and github.com/hashicorp/raft-boltdb.
  • Node Architecture Refactoring: The node's core logic has been refactored to support a failover mechanism, allowing nodes to dynamically switch between aggregator (leader) and sync (follower) modes based on Raft's leadership status. This includes changes to how P2P clients are initialized and passed to node startup commands.
  • Block Processing with Raft: The block executor now checks Raft leadership before producing new blocks and proposes the block state to the Raft cluster after execution, ensuring state consistency across the cluster.
  • Cache and State Management Enhancements: Added methods to retrieve and persist the last submitted DA header and data heights from the cache. Logic was introduced to skip in-flight (temporary) headers and data during cache retrieval to prevent processing incomplete blocks.
  • New RPC Endpoints and Configuration: Added new RPC endpoints for querying Raft node status (e.g., /raft/node) and introduced new configuration flags for Raft-specific settings like node_id, raft_addr, bootstrap, and peers.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces high availability via RAFT consensus, which is a significant architectural change. The implementation is extensive, touching many parts of the system from node startup to block processing and syncing. Overall, the changes are well-structured, introducing clear interfaces for RAFT integration and refactoring the node components to support dynamic role switching between leader and follower. I've identified a few critical issues related to error handling and startup logic that need to be addressed, along with some medium-severity issues for improving code clarity and maintainability. Great work on this complex feature.

I am having trouble creating individual review comments. Click here to see my feedback.

node/full.go (113-115)

high

The naked return here will not propagate the error from newSingleRoleElector. This will cause the function to return (nil, nil), masking the underlying error. The error should be returned to the caller.

		if leaderElection, err = newSingleRoleElector(leaderFactory); err != nil {
			return nil, err
		}

node/full.go (117-119)

high

Similar to the previous case, the naked return here will swallow the error from newSingleRoleElector. The error should be propagated up to the caller.

		if leaderElection, err = newSingleRoleElector(followerFactory); err != nil {
			return nil, err
		}

pkg/raft/node.go (111-113)

high

This check prevents a node from starting if Bootstrap is false, which is problematic for nodes joining an existing cluster. A new node attempting to join will fail to start. The bootstrap logic should only execute if n.config.Bootstrap is true, and the function should return nil otherwise, allowing non-bootstrap nodes to start and join a cluster.

block/internal/cache/pending_headers.go (69-71)

medium

The method name GetLastSubmittedDataHeight is misleading as it's part of the PendingHeaders struct. For clarity and consistency, it should be renamed to GetLastSubmittedHeaderHeight.

This change will also require updating the call site in block/internal/cache/manager.go.

func (ph *PendingHeaders) GetLastSubmittedHeaderHeight() uint64 {
	return ph.base.getLastSubmittedHeight()
}

block/internal/executing/executor.go (570-572)

medium

The explicit type conversion types.Tx(tx) is redundant since types.Tx is an alias for []byte, and tx is already of type []byte. The change to a direct assignment is good, but it seems this loop could be replaced with a single, more efficient append call.

	data.Txs = append(data.Txs, batchData.Transactions...)

block/internal/syncing/syncer.go (184)

medium

This log message seems to be for debugging purposes, indicated by the +++ prefix. It should be logged at the Debug level instead of Info to avoid cluttering the logs in a production environment.

			s.logger.Debug().Uint64("header_height", state.LastSubmittedDAHeaderHeight).Uint64("data_height", state.LastSubmittedDADataHeight).Msg("received raft block state")

pkg/rpc/server/http.go (143-146)

medium

To adhere to Go's naming conventions for initialisms, the struct field NodeId should be renamed to NodeID.

				NodeID   string `json:"node_id"`
			}{
				IsLeader: raftNode.IsLeader(),
				NodeID:   raftNode.NodeID(),

@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 40.06342% with 567 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.80%. Comparing base (f14c6a7) to head (adc05ed).

Files with missing lines Patch % Lines
pkg/raft/node.go 12.12% 174 Missing ⚠️
pkg/raft/node_mock.go 45.40% 74 Missing and 21 partials ⚠️
block/internal/syncing/raft_retriever.go 0.00% 63 Missing ⚠️
node/full.go 32.30% 37 Missing and 7 partials ⚠️
block/internal/syncing/syncer.go 25.45% 39 Missing and 2 partials ⚠️
node/failover.go 74.45% 22 Missing and 13 partials ⚠️
block/internal/executing/executor.go 5.55% 30 Missing and 4 partials ⚠️
pkg/raft/election.go 80.00% 12 Missing and 5 partials ⚠️
pkg/rpc/server/http.go 6.66% 13 Missing and 1 partial ⚠️
block/components.go 27.27% 7 Missing and 1 partial ⚠️
... and 13 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2954      +/-   ##
==========================================
- Coverage   58.74%   56.80%   -1.95%     
==========================================
  Files          90       97       +7     
  Lines        8722     9492     +770     
==========================================
+ Hits         5124     5392     +268     
- Misses       3011     3476     +465     
- Partials      587      624      +37     
Flag Coverage Δ
combined 56.80% <40.06%> (-1.95%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants