-
Notifications
You must be signed in to change notification settings - Fork 245
feat: High availabilty via RAFT #2954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* main: build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851) build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844) build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
* main: chore: fix some comments (#2874) chore: bump node in evm-single (#2875) refactor(syncer,cache): use compare and swap loop and add comments (#2873) refactor: use state da height as well (#2872) refactor: retrieve highest da height in cache (#2870) chore: change from event count to start and end height (#2871)
* main: chore: remove extra github action yml file (#2882) fix(execution/evm): verify payload status (#2863) feat: fetch included da height from store (#2880) chore: better output on errors (#2879) refactor!: create da client and split cache interface (#2878) chore!: rename `evm-single` and `grpc-single` (#2839) build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876) chore: parallel cache de/serialization (#2868) chore: bump blob size (#2877)
* main: build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900) refactor(block): centralize timeout in client (#2903) build(deps): Bump the all-go group across 2 directories with 3 updates (#2898) chore: bump default timeout (#2902) fix: revert default db (#2897) refactor: remove obsolete // +build tag (#2899) fix:da visualiser namespace (#2895) refactor: omit unnecessary reassignment (#2892) build(deps): Bump the all-go group across 5 directories with 6 updates (#2881) chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889) fix: ensure consistent network ID usage in P2P subscriber (#2884) build(deps): Bump golangci/golangci-lint-action from 9.0.0 to 9.1.0 (#2885) build(deps): Bump actions/checkout from 5 to 6 (#2886)
* main: (34 commits) feat: make reaper poll duration configurable (#2951) chore!: move sequencers to pkg (#2931) feat: Ensure Header integrity on DA (#2948) feat(testda): add header support with GetHeaderByHeight method (#2946) chore: improve code comments clarity (#2947) chore(sequencers): optimize store check (#2945) fix: make evm_execution more robust (#2942) fix(sequencers/single): deterministic queue (#2938) fix(block): fix init logic sequencer for da epoch fetching (#2926) feat: use DA timestamp (#2939) chore: improve code comments clarity (#2943) build(deps): bump libp2p (#2937) build(deps): Bump actions/cache from 4 to 5 (#2934) build(deps): Bump actions/download-artifact from 6 to 7 (#2933) build(deps): Bump actions/upload-artifact from 5 to 6 (#2932) feat: DA Client remove interface part 3, replace types with new code (#2910) DA Client remove interface: Part 2.5, create e2e test to validate that a blob is posted in DA layer. (#2920) fix(syncing): skip forced txs checks for p2p blocks (#2922) build(deps): Bump the all-go group across 5 directories with 5 updates (#2919) chore: loosen syncer state check (#2927) ...
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
|
Claude finished @alpe's task in 2m 49s —— View job Code Review: High Availability via RAFTI've completed a comprehensive code review of this Raft high availability implementation. This is a substantial architectural change that introduces distributed consensus for leader election and state replication. Below are my findings organized by priority. Critical Issues1. Error Handling in node/full.go (Lines 113-119) - BLOCKINGIssue: Naked Location: case nodeConfig.Node.Aggregator && !nodeConfig.Raft.Enable:
if leaderElection, err = newSingleRoleElector(leaderFactory); err != nil {
return // ❌ Wrong: swallows error, returns (nil, nil)
}
case !nodeConfig.Node.Aggregator && !nodeConfig.Raft.Enable:
if leaderElection, err = newSingleRoleElector(followerFactory); err != nil {
return // ❌ Wrong: swallows error, returns (nil, nil)
}Fix: Return the error explicitly: if leaderElection, err = newSingleRoleElector(leaderFactory); err != nil {
return nil, err
}Impact: This will cause silent failures during node startup when Raft is disabled. Critical for production. 2. Restrictive Bootstrap Check in pkg/raft/node.go (Lines 112-115) - BLOCKINGIssue: The Location: if !n.config.Bootstrap {
// it is intended to fail fast here. at this stage only bootstrap mode is supported.
return fmt.Errorf("raft cluster requires bootstrap mode")
}Problem: The comment says "at this stage only bootstrap mode is supported," but this prevents horizontal scaling. New nodes joining an existing cluster will fail to start. Fix: Only execute bootstrap logic when if n.config.Bootstrap {
// Bootstrap logic here...
}
// Allow non-bootstrap nodes to continue
return nilImpact: Prevents cluster expansion. This is a fundamental limitation for a HA system. High Priority Issues3. Race Condition in DynamicLeaderElection (pkg/raft/election.go:96-108)Issue: Potential race between leadership check and state sync that could lead to double signing. Location: The code attempts to prevent double signing by waiting for raft messages to land and checking sync status. However, there's a TOCTOU (time-of-check-time-of-use) vulnerability: raftSynced := d.node.waitForMsgsLanded(d.node.Config().SendTimeout) == nil
if d.node.leaderID() != d.node.NodeID() { // ❌ Leadership could change here
d.logger.Info().Msg("lost leadership during sync wait")
continue
}
if !raftSynced || !runnable.IsSynced(d.node.GetState()) {
// Transfer leadership...
}Recommendation:
Impact: Low probability but high severity - could result in chain fork if two nodes sign different blocks at same height. 4. Unbuffered Channel Risk in FSM.Apply (pkg/raft/node.go:293-299)Issue: The FSM drops messages on slow consumers with only a warning. Location: if f.applyCh != nil {
select {
case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
default:
f.logger.Warn().Msg("apply channel full, dropping message")
}
}Problem: While the comment says "Followers can sync from DA or other peers," silently dropping state updates can lead to:
Recommendations:
5. Typo in Log Message (pkg/raft/node.go:122)Location: n.logger.Info().Msg("Boostrap raft cluster") // ❌ "Boostrap" should be "Bootstrap"Minor but should be fixed for professionalism. Medium Priority Issues6. Debug Log at Info Level (block/internal/syncing/syncer.go:184)Location: s.logger.Debug().Uint64("header_height", state.LastSubmittedDaHeaderHeight)...Msg("received raft block state")The code appears correct (using Debug level), but Gemini's review mentioned this has a 7. Inconsistent Naming Convention (pkg/rpc/server/http.go:143-146)Location: NodeId string `json:"node_id"` // ❌ Should be NodeID per Go conventionsGo convention is to use NodeID string `json:"node_id"`Architecture & Design ObservationsStrengths
Concerns
Security Considerations1. Leadership Transfer SecurityThe leadership transfer mechanism in
2. Raft Transport SecurityObservation: The code uses plain TCP for Raft communication: transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)Recommendation: For production deployments:
3. Credential ManagementNo obvious credential leakage issues found. Good practices observed:
Performance & Resource Efficiency1. Goroutine ManagementGood practices observed:
2. Memory ManagementConcern: The func (n *Node) GetState() *RaftBlockState {
return proto.Clone(n.fsm.state.Load()).(*RaftBlockState) // ❌ Clone on every read
}This is safe but could be expensive under high read load. Consider:
Concurrency PatternsStrong Points
Areas for Improvement
errCh := make(chan error, 1)This means if both leader and follower workers error simultaneously, one error is lost. Consider:
Testing Recommendations
Configuration ValidationThe Raft configuration validation looks good:
Minor Enhancement: Consider validating that Documentation Gaps
Pre-Merge ChecklistBefore merging, ensure:
SummaryThis is a well-structured implementation of Raft-based high availability. The architecture is sound, with good separation of concerns and proper use of Go concurrency patterns. However, there are two critical bugs that must be fixed before merge:
The code demonstrates good engineering practices overall, but would benefit from:
Recommendation: Fix the two critical issues, then this can proceed to merge with a plan to address medium-priority items in follow-up PRs. Tasks Completed
|
Summary of ChangesHello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Rollkit network's resilience and fault tolerance by integrating the Raft consensus algorithm. The changes enable a robust failover mechanism, allowing nodes to seamlessly transition between leader and follower roles. This ensures continuous operation and consistent state across the network, even if individual nodes experience failures. The update touches various core components, from dependency management and block processing to RPC services and node lifecycle management, all aimed at achieving high availability. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces high availability via RAFT consensus, which is a significant architectural change. The implementation is extensive, touching many parts of the system from node startup to block processing and syncing. Overall, the changes are well-structured, introducing clear interfaces for RAFT integration and refactoring the node components to support dynamic role switching between leader and follower. I've identified a few critical issues related to error handling and startup logic that need to be addressed, along with some medium-severity issues for improving code clarity and maintainability. Great work on this complex feature.
I am having trouble creating individual review comments. Click here to see my feedback.
node/full.go (113-115)
The naked return here will not propagate the error from newSingleRoleElector. This will cause the function to return (nil, nil), masking the underlying error. The error should be returned to the caller.
if leaderElection, err = newSingleRoleElector(leaderFactory); err != nil {
return nil, err
}
node/full.go (117-119)
Similar to the previous case, the naked return here will swallow the error from newSingleRoleElector. The error should be propagated up to the caller.
if leaderElection, err = newSingleRoleElector(followerFactory); err != nil {
return nil, err
}
pkg/raft/node.go (111-113)
This check prevents a node from starting if Bootstrap is false, which is problematic for nodes joining an existing cluster. A new node attempting to join will fail to start. The bootstrap logic should only execute if n.config.Bootstrap is true, and the function should return nil otherwise, allowing non-bootstrap nodes to start and join a cluster.
block/internal/cache/pending_headers.go (69-71)
The method name GetLastSubmittedDataHeight is misleading as it's part of the PendingHeaders struct. For clarity and consistency, it should be renamed to GetLastSubmittedHeaderHeight.
This change will also require updating the call site in block/internal/cache/manager.go.
func (ph *PendingHeaders) GetLastSubmittedHeaderHeight() uint64 {
return ph.base.getLastSubmittedHeight()
}
block/internal/executing/executor.go (570-572)
The explicit type conversion types.Tx(tx) is redundant since types.Tx is an alias for []byte, and tx is already of type []byte. The change to a direct assignment is good, but it seems this loop could be replaced with a single, more efficient append call.
data.Txs = append(data.Txs, batchData.Transactions...)block/internal/syncing/syncer.go (184)
This log message seems to be for debugging purposes, indicated by the +++ prefix. It should be logged at the Debug level instead of Info to avoid cluttering the logs in a production environment.
s.logger.Debug().Uint64("header_height", state.LastSubmittedDAHeaderHeight).Uint64("data_height", state.LastSubmittedDADataHeight).Msg("received raft block state")
pkg/rpc/server/http.go (143-146)
To adhere to Go's naming conventions for initialisms, the struct field NodeId should be renamed to NodeID.
NodeID string `json:"node_id"`
}{
IsLeader: raftNode.IsLeader(),
NodeID: raftNode.NodeID(),
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2954 +/- ##
==========================================
- Coverage 58.74% 56.80% -1.95%
==========================================
Files 90 97 +7
Lines 8722 9492 +770
==========================================
+ Hits 5124 5392 +268
- Misses 3011 3476 +465
- Partials 587 624 +37
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Replaces #2836