-
Notifications
You must be signed in to change notification settings - Fork 47
feat: F3 e2e lifecycle #1469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: F3 e2e lifecycle #1469
Conversation
fbaa095 to
b34142d
Compare
b34142d to
31feb85
Compare
91db005 to
cbce51c
Compare
39e59d6 to
bfdc6f7
Compare
93a1066 to
1747f78
Compare
…t and execute logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| /// | ||
| /// This method should only be called from consensus code path which | ||
| /// contains the lightclient verifier. No additional validation is | ||
| /// performed here as it's expected to be done by the verifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation in F3 Light Client update_state
High Severity
The update_state function in state.rs has no validation logic—it unconditionally replaces light_client_state and returns Ok(()). However, multiple tests expect it to reject invalid updates with USR_ILLEGAL_ARGUMENT: test_update_state_non_advancing_height expects rejection when height doesn't advance, test_instance_id_skip_rejected expects rejection when instance_id skips values, and test_empty_epochs_rejected also expects an error. These tests will fail because the validation they expect does not exist in the implementation.
Additional Locations (2)
| assert!(result.is_err()); | ||
| let err = result.unwrap_err(); | ||
| assert_eq!(err.exit_code(), ExitCode::USR_ILLEGAL_ARGUMENT); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test name doesn't match test behavior
Low Severity
The test test_empty_epochs_rejected claims to test rejection of "empty finalized_epochs" per its comment on line 399, but it creates a state with Some(10) for latest_finalized_height rather than None. This makes it identical to test_update_state_non_advancing_height instead of testing the distinct case of a missing/empty finalized height. If the intent was to test rejection of None, the test should use create_test_state(1, None, ...).
| if !proof_config.enabled { | ||
| tracing::info!("F3 proof service disabled in configuration"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that this case is for full subnet nodes that want to follow the subnet with F3-based top-down, but don't follow the parent chain because they are not active validators?
| let cached = self | ||
| .proof_cache | ||
| .get_epoch_proof_with_certificate(msg.height) | ||
| .ok_or_else(|| { | ||
| anyhow::anyhow!( | ||
| "proof bundle not found in local cache for height {}", | ||
| msg.height | ||
| ) | ||
| })?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd better comply with CometBFT requirements and ensure that ProcessProposal is fully deterministic. In order to do that, we have to verify the certificate deterministically against the F3 power table stored in the F3 light client actor's current state. If the certificate is valid then we should not reject the proposal.
In order to avoid voting for the proposal for which we locally don't yet have the corresponding data, I think we could be waiting (e.g. by polling) for the corresponding entry to appear in the proof cache before accepting the proposal. This should be pretty safe, but we absolutely have to deterministically verify the validity of the proposed certificate first and immediately reject the proposal if the certificate happens to be invalid, otherwise we might end up waiting for a non-existent certificate made up by a Byzantine proposer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked into this matter deeper. From the consensus protocol (Tendermint/CometBFT) perspective, this intermittent proposal rejection should be acceptable. I think the PrepareProposal-ProcessProposal coherence requirement (see Requirement 3) is excessively strong and could be relaxed to eventual coherence while still preserving liveness. Alternatively, we could consider validators lagging behind the proposer w.r.t. the parent chain as temporarily (benignly) faulty (since they are not fully operational, even though otherwise correct) and allow them intermittently rejecting such proposals for that reason.
There might be another, more subtle, concern related to accountability. In principle, this could affect misbehavior detection mechanisms, e.g. intermittently rejecting validators might accuse accepting validators for apparent misbehavior, or the other way around. Currently, CometBFT detects only two types of misbehavior: duplicate votes and light client attacks, so this concern does not apply. Though, this may change in future, however unlikely.
As of current CometBFT's source code, there are the following immediate consequences of rejecting a proposal:
- updating the
ProposalReceiveCountmetric with 'rejected' status; - logging a diagnostic warning that the proposer may be misbehaving;
- casting a nil prevote in the round, which is indistinguishable from a proposal timeout.
The first two could be slightly misleading if we decide to keep the intermittent proposal rejection.
In any case, we have to ensure that, at each height, all correct validators eventually accept any correct proposer's proposal. This should actually hold in the current implementation even with the intermittent proposal rejection because correct proposers propose either none or exactly the same parent chain extension (the next finalized single tipset). However, if we decide to keep this intermittent proposal rejection, we should add comments into the block proposal generation code explaining the additional assumptions, which the code has to ensure.
So I would still recommend to follow the approach I suggested in the above comment. I would even hesitate waiting for the entry to appear in the proof cache here, in ProcessProposal path because it could interfere with consensus timeouts in subtle ways, and I'm not sure how well CometBFT is tested in such scenarios. On the other hand, we have to do the waiting in BeginBlock/DeliverTx/EndBlock (superseded by FinalizeBlock in v0.38); at that step, the block is decided and there's no consensus timer. BTW, I checked the source code, and I couldn't find any timeout for the ABCI RPC, so handling those methods could take as long as necessary.
There is a hypothetical scenario where a subnet gets suddenly partitioned from the parent chain in such a way that only few subnet validators can observe the latest finalized tipsets. In that case, if we don't do the intermittent proposal rejection, the whole subnet could halt until the connectivity restores, trying to execute a proposal for which too few validators have the data in their local cache. Your current approach would tolerate this scenario better, in some sense. However, even then Byzantine validators could theoretically compromise liveness (half of votes in a super-majority quorum can be Byzantine). I think a better solution would be always using vote-extension-based certs in proposals (which requires a super-majority quorum and also work for fallback mode), while taking advantage of F3 only locally.
| // Epoch must advance by exactly 1 relative to the latest finalized epoch in state. | ||
| // | ||
| // At genesis this is `None` (no finality yet). In that case we skip the check here; the | ||
| // cache lookup (and later execution) will still enforce that we only process epochs we | ||
| // have proofs for. | ||
| if let Some(prev_finalized) = f3_state.latest_finalized_height { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this would allow the proposer of the first parent chain update since the genesis skip arbitrarily many epochs at the beginning of the certified chain extension. I think we have to set latest_finalized_height in genesis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what genesis_epoch in the legacy top-down for?
| tracing::debug!(instance = instance_id, "updated F3LightClientActor state"); | ||
|
|
||
| // Mark epoch as committed in cache. | ||
| if let Err(e) = self.mark_committed(epoch, instance_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If epoch is not the last one finalized by this certificate then we probably should not mark instance_id yet.
| // Convert BigInt -> u64 (saturating if too large). | ||
| // Power should be non-negative; we ignore the sign here and keep the magnitude. | ||
| let (_sign, digits) = pe.power.to_u64_digits(); | ||
| let power = if digits.is_empty() { | ||
| 0 | ||
| } else if digits.len() == 1 { | ||
| digits[0] | ||
| } else { | ||
| u64::MAX | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64 bits may not be enough. I haven't check this thoroughly myself, but here's what Claude thinks:
Storage power values on Filecoin can be extremely large (representing Quality-Adjusted Power across the entire network), so they need arbitrary precision integers
Evidence Supporting the Statement:
- Historical Network Size Exceeded uint64:
Filecoin network peaked at 17 EiB in Q3 2022
17 EiB = 19,599,665,578,516,398,592 bytes
uint64 max = 18,446,744,073,709,551,615 bytes
The peak exceeded uint64 by ~1 EiB!
- Quality-Adjusted Power Multipliers:
Verified deals in Filecoin receive a 10x power multiplier
Even current capacity (3 EiB) → 30 EiB with 10x QAP
30 EiB vastly exceeds uint64 range
- Current Capacity Still Large:
Current network: ~3.0 EiB (Q3 2025)
While this fits in uint64, it's close enough that:Calculations involving sums and products could overflow
Future growth requires headroom
Safety margins are essential
So, at very least, we should raise an error if there's an overflow rather than silently saturate to u64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, we could store just the power table's CID in the actor, though we'd need to fetch the whole table from the endpoint at start and keep it updated somewhere off-chain. (Assuming state sync may only happen on bootstrap, CometBFT wouldn't need to know about this.)
| // Execute F3-specific logic (certificate validation, proof extraction, state updates) | ||
| let (msgs, validator_changes, instance_id) = | ||
| f3.extract_messages_and_validator_changes(state, &msg)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here or inside extract_messages_and_validator_changes, we have to keep trying until the corresponding entry appears in the proof cache. We cannot return error if it's not yet there because this creates a non-deterministic behavior in the consensus execution path.
| f3.extract_messages_and_validator_changes(state, &msg)?; | ||
|
|
||
| // Commit parent finality to gateway | ||
| let finality = IPCParentFinality::new(msg.height as i64, vec![]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably need to get the tipset key (the 2nd parameter) from the F3 cert.
| /// Generalized top-down finality structure | ||
| #[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq)] | ||
| pub struct GeneralisedTopDown { | ||
| /// The chain epoch this finality is for (height) | ||
| pub height: ChainEpoch, | ||
| /// The certificate that certifies finality (type-specific, proof is fetched from local cache) | ||
| pub certificate: Certificate, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could omit the height field and commit all the epochs finalized by the cert at once. Though, this won't be so easy when we include the data with integrity proofs, due to potential issues with the block size limits.
| // The last tipset in the certificate has no child tipset inside this certificate, so it | ||
| // cannot be proven yet. We only treat the epochs we generated proofs for as "finalized | ||
| // tipsets" for verification purposes. | ||
| let finalized_tipsets = { | ||
| let parents: Vec<FinalizedTipset> = | ||
| tipset_pairs.iter().map(|(p, _)| p.clone()).collect(); | ||
| FinalizedTipsets::from(parents.as_slice()) | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, verify_proof_bundle_with_tipsets wants to verify the child tipsets as well, so we should use the whole cert.ec_chain as finalized_tipsets.
| self.verifier | ||
| .verify_proof_bundle_with_tipsets(&proof_bundle, &finalized_tipsets) | ||
| .with_context(|| format!("Failed to verify proof for epoch {}", parent_epoch))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, there's no verification of continuity of top-down event nonces, yet.
| // Epoch must advance by exactly 1 relative to the latest finalized epoch in state. | ||
| // | ||
| // At genesis this is `None` (no finality yet). In that case we skip the check here; the | ||
| // cache lookup (and later execution) will still enforce that we only process epochs we | ||
| // have proofs for. | ||
| if let Some(prev_finalized) = f3_state.latest_finalized_height { | ||
| if msg.height != prev_finalized + 1 { | ||
| bail!( | ||
| "epoch is not sequential: message height {} != expected {}", | ||
| msg.height, | ||
| prev_finalized + 1 | ||
| ); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there may be empty epochs with no tipsets, so we can't require epoch numbers to be strictly sequential.
| let cached = self | ||
| .proof_cache | ||
| .get_epoch_proof_with_certificate(msg.height) | ||
| .ok_or_else(|| { | ||
| anyhow::anyhow!( | ||
| "proof bundle not found in local cache for height {}", | ||
| msg.height | ||
| ) | ||
| })?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked into this matter deeper. From the consensus protocol (Tendermint/CometBFT) perspective, this intermittent proposal rejection should be acceptable. I think the PrepareProposal-ProcessProposal coherence requirement (see Requirement 3) is excessively strong and could be relaxed to eventual coherence while still preserving liveness. Alternatively, we could consider validators lagging behind the proposer w.r.t. the parent chain as temporarily (benignly) faulty (since they are not fully operational, even though otherwise correct) and allow them intermittently rejecting such proposals for that reason.
There might be another, more subtle, concern related to accountability. In principle, this could affect misbehavior detection mechanisms, e.g. intermittently rejecting validators might accuse accepting validators for apparent misbehavior, or the other way around. Currently, CometBFT detects only two types of misbehavior: duplicate votes and light client attacks, so this concern does not apply. Though, this may change in future, however unlikely.
As of current CometBFT's source code, there are the following immediate consequences of rejecting a proposal:
- updating the
ProposalReceiveCountmetric with 'rejected' status; - logging a diagnostic warning that the proposer may be misbehaving;
- casting a nil prevote in the round, which is indistinguishable from a proposal timeout.
The first two could be slightly misleading if we decide to keep the intermittent proposal rejection.
In any case, we have to ensure that, at each height, all correct validators eventually accept any correct proposer's proposal. This should actually hold in the current implementation even with the intermittent proposal rejection because correct proposers propose either none or exactly the same parent chain extension (the next finalized single tipset). However, if we decide to keep this intermittent proposal rejection, we should add comments into the block proposal generation code explaining the additional assumptions, which the code has to ensure.
So I would still recommend to follow the approach I suggested in the above comment. I would even hesitate waiting for the entry to appear in the proof cache here, in ProcessProposal path because it could interfere with consensus timeouts in subtle ways, and I'm not sure how well CometBFT is tested in such scenarios. On the other hand, we have to do the waiting in BeginBlock/DeliverTx/EndBlock (superseded by FinalizeBlock in v0.38); at that step, the block is decided and there's no consensus timer. BTW, I checked the source code, and I couldn't find any timeout for the ABCI RPC, so handling those methods could take as long as necessary.
There is a hypothetical scenario where a subnet gets suddenly partitioned from the parent chain in such a way that only few subnet validators can observe the latest finalized tipsets. In that case, if we don't do the intermittent proposal rejection, the whole subnet could halt until the connectivity restores, trying to execute a proposal for which too few validators have the data in their local cache. Your current approach would tolerate this scenario better, in some sense. However, even then Byzantine validators could theoretically compromise liveness (half of votes in a super-majority quorum can be Byzantine). I think a better solution would be always using vote-extension-based certs in proposals (which requires a super-majority quorum and also work for fallback mode), while taking advantage of F3 only locally.
| /// Generalized top-down finality with extensible certificate types | ||
| GeneralisedTopDown(GeneralisedTopDown), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we implement subnet-specific (vote-extension-based) certs in future, we might want to be able to justify a parent chain extension with a subnet-specific cert, but also include available F3 certs to keep the F3 actor state up-to-date. (We might even end up not using F3 at all for justifying parent chain extensions, but just for updating the state.) Also, if we plan to support self-contained proposals (i.e. include top-down event data with integrity proofs), we would probably prefer delivering those in separate messages of a dedicated type, because one finality cert could justify multiple following integrity proof bundles. So I was thinking of message types like ParentFinalityCert and TopDownBundle, maybe even separate ParentChainExtension specifying one of the previously certified references chosen by the proposer (see the proposal) and optionally conveying a chain of block headers ending in that reference, which proof bundles could refer to. WDYT?
Closes #1441 and #1442
Note
Cursor Bugbot is generating a summary for commit 8857277. Configure here.