feat: F3 e2e lifecycle #1469

karlem · 2025-10-28T15:52:55Z

Closes #1441 and #1442

Note

^{Cursor Bugbot is generating a summary for commit 8857277. Configure here.}

fendermint/vm/interpreter/src/fvm/topdown.rs

fendermint/actors/f3-light-client/src/state.rs

…eline

…t and execute logic

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-16T19:59:06Z

fendermint/actors/f3-light-client/src/state.rs

-    ///
-    /// This method should only be called from consensus code path which
-    /// contains the lightclient verifier. No additional validation is
-    /// performed here as it's expected to be done by the verifier.


Missing validation in F3 Light Client update_state

High Severity

The update_state function in state.rs has no validation logic—it unconditionally replaces light_client_state and returns Ok(()). However, multiple tests expect it to reject invalid updates with USR_ILLEGAL_ARGUMENT: test_update_state_non_advancing_height expects rejection when height doesn't advance, test_instance_id_skip_rejected expects rejection when instance_id skips values, and test_empty_epochs_rejected also expects an error. These tests will fail because the validation they expect does not exist in the implementation.

Additional Locations (2)

fendermint/actors/f3-light-client/src/lib.rs#L205-L241

fendermint/actors/f3-light-client/src/lib.rs#L356-L390

cursor · 2026-01-16T19:59:06Z

fendermint/actors/f3-light-client/src/lib.rs

+        assert!(result.is_err());
+        let err = result.unwrap_err();
+        assert_eq!(err.exit_code(), ExitCode::USR_ILLEGAL_ARGUMENT);
+    }


Test name doesn't match test behavior

Low Severity

The test test_empty_epochs_rejected claims to test rejection of "empty finalized_epochs" per its comment on line 399, but it creates a state with Some(10) for latest_finalized_height rather than None. This makes it identical to test_update_state_non_advancing_height instead of testing the distinct case of a missing/empty finalized height. If the intent was to test rejection of None, the test should use create_test_state(1, None, ...).

sergefdrv · 2026-01-22T13:15:02Z

fendermint/app/src/service/topdown.rs

+    if !proof_config.enabled {
+        tracing::info!("F3 proof service disabled in configuration");


Do I understand correctly that this case is for full subnet nodes that want to follow the subnet with F3-based top-down, but don't follow the parent chain because they are not active validators?

sergefdrv · 2026-01-22T20:14:44Z

fendermint/vm/interpreter/src/fvm/f3_topdown.rs

+        let cached = self
+            .proof_cache
+            .get_epoch_proof_with_certificate(msg.height)
+            .ok_or_else(|| {
+                anyhow::anyhow!(
+                    "proof bundle not found in local cache for height {}",
+                    msg.height
+                )
+            })?;


We'd better comply with CometBFT requirements and ensure that ProcessProposal is fully deterministic. In order to do that, we have to verify the certificate deterministically against the F3 power table stored in the F3 light client actor's current state. If the certificate is valid then we should not reject the proposal.

In order to avoid voting for the proposal for which we locally don't yet have the corresponding data, I think we could be waiting (e.g. by polling) for the corresponding entry to appear in the proof cache before accepting the proposal. This should be pretty safe, but we absolutely have to deterministically verify the validity of the proposed certificate first and immediately reject the proposal if the certificate happens to be invalid, otherwise we might end up waiting for a non-existent certificate made up by a Byzantine proposer.

I've looked into this matter deeper. From the consensus protocol (Tendermint/CometBFT) perspective, this intermittent proposal rejection should be acceptable. I think the PrepareProposal-ProcessProposal coherence requirement (see Requirement 3) is excessively strong and could be relaxed to eventual coherence while still preserving liveness. Alternatively, we could consider validators lagging behind the proposer w.r.t. the parent chain as temporarily (benignly) faulty (since they are not fully operational, even though otherwise correct) and allow them intermittently rejecting such proposals for that reason.

There might be another, more subtle, concern related to accountability. In principle, this could affect misbehavior detection mechanisms, e.g. intermittently rejecting validators might accuse accepting validators for apparent misbehavior, or the other way around. Currently, CometBFT detects only two types of misbehavior: duplicate votes and light client attacks, so this concern does not apply. Though, this may change in future, however unlikely.

As of current CometBFT's source code, there are the following immediate consequences of rejecting a proposal:

updating the ProposalReceiveCount metric with 'rejected' status;

logging a diagnostic warning that the proposer may be misbehaving;

casting a nil prevote in the round, which is indistinguishable from a proposal timeout.

The first two could be slightly misleading if we decide to keep the intermittent proposal rejection.

In any case, we have to ensure that, at each height, all correct validators eventually accept any correct proposer's proposal. This should actually hold in the current implementation even with the intermittent proposal rejection because correct proposers propose either none or exactly the same parent chain extension (the next finalized single tipset). However, if we decide to keep this intermittent proposal rejection, we should add comments into the block proposal generation code explaining the additional assumptions, which the code has to ensure.

So I would still recommend to follow the approach I suggested in the above comment. I would even hesitate waiting for the entry to appear in the proof cache here, in ProcessProposal path because it could interfere with consensus timeouts in subtle ways, and I'm not sure how well CometBFT is tested in such scenarios. On the other hand, we have to do the waiting in BeginBlock/DeliverTx/EndBlock (superseded by FinalizeBlock in v0.38); at that step, the block is decided and there's no consensus timer. BTW, I checked the source code, and I couldn't find any timeout for the ABCI RPC, so handling those methods could take as long as necessary.

There is a hypothetical scenario where a subnet gets suddenly partitioned from the parent chain in such a way that only few subnet validators can observe the latest finalized tipsets. In that case, if we don't do the intermittent proposal rejection, the whole subnet could halt until the connectivity restores, trying to execute a proposal for which too few validators have the data in their local cache. Your current approach would tolerate this scenario better, in some sense. However, even then Byzantine validators could theoretically compromise liveness (half of votes in a super-majority quorum can be Byzantine). I think a better solution would be always using vote-extension-based certs in proposals (which requires a super-majority quorum and also work for fallback mode), while taking advantage of F3 only locally.

sergefdrv · 2026-01-22T20:25:56Z

fendermint/vm/interpreter/src/fvm/f3_topdown.rs

+        // Epoch must advance by exactly 1 relative to the latest finalized epoch in state.
+        //
+        // At genesis this is `None` (no finality yet). In that case we skip the check here; the
+        // cache lookup (and later execution) will still enforce that we only process epochs we
+        // have proofs for.
+        if let Some(prev_finalized) = f3_state.latest_finalized_height {


I'm afraid this would allow the proposer of the first parent chain update since the genesis skip arbitrarily many epochs at the beginning of the certified chain extension. I think we have to set latest_finalized_height in genesis.

Is this what genesis_epoch in the legacy top-down for?

sergefdrv · 2026-01-22T20:39:29Z

fendermint/vm/interpreter/src/fvm/f3_topdown.rs

+        tracing::debug!(instance = instance_id, "updated F3LightClientActor state");
+
+        // Mark epoch as committed in cache.
+        if let Err(e) = self.mark_committed(epoch, instance_id) {


If epoch is not the last one finalized by this certificate then we probably should not mark instance_id yet.

sergefdrv · 2026-01-22T21:01:37Z

fendermint/vm/interpreter/src/fvm/f3_topdown.rs

+                    // Convert BigInt -> u64 (saturating if too large).
+                    // Power should be non-negative; we ignore the sign here and keep the magnitude.
+                    let (_sign, digits) = pe.power.to_u64_digits();
+                    let power = if digits.is_empty() {
+                        0
+                    } else if digits.len() == 1 {
+                        digits[0]
+                    } else {
+                        u64::MAX
+                    };


64 bits may not be enough. I haven't check this thoroughly myself, but here's what Claude thinks:

Storage power values on Filecoin can be extremely large (representing Quality-Adjusted Power across the entire network), so they need arbitrary precision integers

Evidence Supporting the Statement:

Historical Network Size Exceeded uint64:

Filecoin network peaked at 17 EiB in Q3 2022
17 EiB = 19,599,665,578,516,398,592 bytes
uint64 max = 18,446,744,073,709,551,615 bytes
The peak exceeded uint64 by ~1 EiB!

Quality-Adjusted Power Multipliers:

Verified deals in Filecoin receive a 10x power multiplier
Even current capacity (3 EiB) → 30 EiB with 10x QAP
30 EiB vastly exceeds uint64 range

Current Capacity Still Large:

Current network: ~3.0 EiB (Q3 2025)
While this fits in uint64, it's close enough that:

Calculations involving sums and products could overflow
Future growth requires headroom
Safety margins are essential

So, at very least, we should raise an error if there's an overflow rather than silently saturate to u64.

In principle, we could store just the power table's CID in the actor, though we'd need to fetch the whole table from the endpoint at start and keep it updated somewhere off-chain. (Assuming state sync may only happen on bootstrap, CometBFT wouldn't need to know about this.)

sergefdrv · 2026-01-22T22:13:32Z

fendermint/vm/interpreter/src/fvm/topdown.rs

+        // Execute F3-specific logic (certificate validation, proof extraction, state updates)
+        let (msgs, validator_changes, instance_id) =
+            f3.extract_messages_and_validator_changes(state, &msg)?;


Here or inside extract_messages_and_validator_changes, we have to keep trying until the corresponding entry appears in the proof cache. We cannot return error if it's not yet there because this creates a non-deterministic behavior in the consensus execution path.

sergefdrv · 2026-01-22T22:16:00Z

fendermint/vm/interpreter/src/fvm/topdown.rs

+            f3.extract_messages_and_validator_changes(state, &msg)?;
+
+        // Commit parent finality to gateway
+        let finality = IPCParentFinality::new(msg.height as i64, vec![]);


Probably need to get the tipset key (the 2nd parameter) from the F3 cert.

sergefdrv · 2026-01-22T22:36:20Z

fendermint/vm/message/src/ipc.rs

+/// Generalized top-down finality structure
+#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq)]
+pub struct GeneralisedTopDown {
+    /// The chain epoch this finality is for (height)
+    pub height: ChainEpoch,
+    /// The certificate that certifies finality (type-specific, proof is fetched from local cache)
+    pub certificate: Certificate,
+}


We could omit the height field and commit all the epochs finalized by the cert at once. Though, this won't be so easy when we include the data with integrity proofs, due to potential issues with the block size limits.

sergefdrv · 2026-01-22T22:50:01Z

fendermint/vm/topdown/proof-service/src/service.rs

+        // The last tipset in the certificate has no child tipset inside this certificate, so it
+        // cannot be proven yet. We only treat the epochs we generated proofs for as "finalized
+        // tipsets" for verification purposes.
+        let finalized_tipsets = {
+            let parents: Vec<FinalizedTipset> =
+                tipset_pairs.iter().map(|(p, _)| p.clone()).collect();
+            FinalizedTipsets::from(parents.as_slice())
+        };


IIUC, verify_proof_bundle_with_tipsets wants to verify the child tipsets as well, so we should use the whole cert.ec_chain as finalized_tipsets.

sergefdrv · 2026-01-22T22:52:11Z

fendermint/vm/topdown/proof-service/src/service.rs

+            self.verifier
+                .verify_proof_bundle_with_tipsets(&proof_bundle, &finalized_tipsets)
+                .with_context(|| format!("Failed to verify proof for epoch {}", parent_epoch))?;


Apparently, there's no verification of continuity of top-down event nonces, yet.

sergefdrv · 2026-01-23T08:43:58Z

fendermint/vm/interpreter/src/fvm/f3_topdown.rs

+        // Epoch must advance by exactly 1 relative to the latest finalized epoch in state.
+        //
+        // At genesis this is `None` (no finality yet). In that case we skip the check here; the
+        // cache lookup (and later execution) will still enforce that we only process epochs we
+        // have proofs for.
+        if let Some(prev_finalized) = f3_state.latest_finalized_height {
+            if msg.height != prev_finalized + 1 {
+                bail!(
+                    "epoch is not sequential: message height {} != expected {}",
+                    msg.height,
+                    prev_finalized + 1
+                );
+            }
+        }


Actually, there may be empty epochs with no tipsets, so we can't require epoch numbers to be strictly sequential.

sergefdrv · 2026-01-23T15:59:28Z

fendermint/vm/interpreter/src/fvm/f3_topdown.rs

+        let cached = self
+            .proof_cache
+            .get_epoch_proof_with_certificate(msg.height)
+            .ok_or_else(|| {
+                anyhow::anyhow!(
+                    "proof bundle not found in local cache for height {}",
+                    msg.height
+                )
+            })?;


I've looked into this matter deeper. From the consensus protocol (Tendermint/CometBFT) perspective, this intermittent proposal rejection should be acceptable. I think the PrepareProposal-ProcessProposal coherence requirement (see Requirement 3) is excessively strong and could be relaxed to eventual coherence while still preserving liveness. Alternatively, we could consider validators lagging behind the proposer w.r.t. the parent chain as temporarily (benignly) faulty (since they are not fully operational, even though otherwise correct) and allow them intermittently rejecting such proposals for that reason.

There might be another, more subtle, concern related to accountability. In principle, this could affect misbehavior detection mechanisms, e.g. intermittently rejecting validators might accuse accepting validators for apparent misbehavior, or the other way around. Currently, CometBFT detects only two types of misbehavior: duplicate votes and light client attacks, so this concern does not apply. Though, this may change in future, however unlikely.

As of current CometBFT's source code, there are the following immediate consequences of rejecting a proposal:

updating the ProposalReceiveCount metric with 'rejected' status;

logging a diagnostic warning that the proposer may be misbehaving;

casting a nil prevote in the round, which is indistinguishable from a proposal timeout.

The first two could be slightly misleading if we decide to keep the intermittent proposal rejection.

In any case, we have to ensure that, at each height, all correct validators eventually accept any correct proposer's proposal. This should actually hold in the current implementation even with the intermittent proposal rejection because correct proposers propose either none or exactly the same parent chain extension (the next finalized single tipset). However, if we decide to keep this intermittent proposal rejection, we should add comments into the block proposal generation code explaining the additional assumptions, which the code has to ensure.

So I would still recommend to follow the approach I suggested in the above comment. I would even hesitate waiting for the entry to appear in the proof cache here, in ProcessProposal path because it could interfere with consensus timeouts in subtle ways, and I'm not sure how well CometBFT is tested in such scenarios. On the other hand, we have to do the waiting in BeginBlock/DeliverTx/EndBlock (superseded by FinalizeBlock in v0.38); at that step, the block is decided and there's no consensus timer. BTW, I checked the source code, and I couldn't find any timeout for the ABCI RPC, so handling those methods could take as long as necessary.

There is a hypothetical scenario where a subnet gets suddenly partitioned from the parent chain in such a way that only few subnet validators can observe the latest finalized tipsets. In that case, if we don't do the intermittent proposal rejection, the whole subnet could halt until the connectivity restores, trying to execute a proposal for which too few validators have the data in their local cache. Your current approach would tolerate this scenario better, in some sense. However, even then Byzantine validators could theoretically compromise liveness (half of votes in a super-majority quorum can be Byzantine). I think a better solution would be always using vote-extension-based certs in proposals (which requires a super-majority quorum and also work for fallback mode), while taking advantage of F3 only locally.

sergefdrv · 2026-01-23T19:21:05Z

fendermint/vm/message/src/ipc.rs

+    /// Generalized top-down finality with extensible certificate types
+    GeneralisedTopDown(GeneralisedTopDown),


When we implement subnet-specific (vote-extension-based) certs in future, we might want to be able to justify a parent chain extension with a subnet-specific cert, but also include available F3 certs to keep the F3 actor state up-to-date. (We might even end up not using F3 at all for justifying parent chain extensions, but just for updating the state.) Also, if we plan to support self-contained proposals (i.e. include top-down event data with integrity proofs), we would probably prefer delivering those in separate messages of a dedicated type, because one finality cert could justify multiple following integrity proof bundles. So I was thinking of message types like ParentFinalityCert and TopDownBundle, maybe even separate ParentChainExtension specifying one of the previously certified references chosen by the proposer (see the proposal) and optionally conveying a chain of block headers ending in that reference, which proof bundles could refer to. WDYT?

karlem changed the title ~~feat: init lifecycle~~ feat: F3 e2e lifecycle Oct 29, 2025

karlem force-pushed the f3-lifecycle branch from aecf7f5 to a864009 Compare October 29, 2025 18:25

karlem force-pushed the f3-proofs-cache branch from fbaa095 to b34142d Compare October 29, 2025 18:26

karlem force-pushed the f3-lifecycle branch from a864009 to 1e5125d Compare October 29, 2025 18:27

karlem force-pushed the f3-proofs-cache branch from b34142d to 31feb85 Compare November 4, 2025 15:47

karlem force-pushed the f3-lifecycle branch 2 times, most recently from 91db005 to cbce51c Compare November 4, 2025 17:20

sergefdrv reviewed Nov 4, 2025

View reviewed changes

fendermint/vm/interpreter/src/fvm/topdown.rs Outdated Show resolved Hide resolved

karlem force-pushed the f3-lifecycle branch from cbce51c to 0514cd9 Compare November 5, 2025 16:12

karlem force-pushed the f3-proofs-cache branch from 39e59d6 to bfdc6f7 Compare November 5, 2025 22:26

karlem mentioned this pull request Nov 5, 2025

feat: add F3 proofs cache #1457

Merged

sergefdrv reviewed Nov 10, 2025

View reviewed changes

fendermint/actors/f3-light-client/src/state.rs Outdated Show resolved Hide resolved

karlem force-pushed the f3-proofs-cache branch from 93a1066 to 1747f78 Compare November 28, 2025 19:49

karlem force-pushed the f3-lifecycle branch from 82d5f04 to 9a782ce Compare December 1, 2025 20:45

karlem force-pushed the f3-lifecycle branch from 9a782ce to 5c188b8 Compare December 15, 2025 21:34

Base automatically changed from f3-proofs-cache to main December 18, 2025 16:15

karlem added 14 commits December 19, 2025 16:30

feat: add f3 cert actor

1fa0693

feat: add fetching from parent

264e69d

feat: add extra checks and tests

0c436bd

feat: multiple epochs in certificate

07160fd

fix: clippy

236feab

feat: fix comments

1b5ac3b

feat: fix comment

d7935f5

fix: e2e tests

f9ac821

feat: implement coments changes

993153e

feat: add proofs service skeleton

6d5734b

feat: add persistence and include proofs libraryr

0736fa6

feat: add perstance, real libraries, wather

506de2a

feat: implement cache e2e

ad80adb

feat: debug issues + make functional

25f5d1c

karlem added 14 commits December 19, 2025 17:37

fix: e2e tests

a7051da

fix: clippy

1e4a297

feat: implement coments changes

41ffc97

feat: add persistence and include proofs libraryr

6e5f8a9

feat: add perstance, real libraries, wather

b9a05aa

feat: implement cache e2e

02493a1

feat: debug issues + make functional

b7f7e30

feat: prepare for review, add debug tooling, add observibility

5d54fea

feat: init lifecycle

9c26724

feat: progress with top down manager

bfb692c

fix: revert genesis and manifest changes to match f3-proofs-cache bas…

9970708

…eline

feat: finish implementing e2e

d5396a2

feat: makes changes after rebase

fbefdde

feat: rebase cache

801c388

karlem force-pushed the f3-lifecycle branch from 5c188b8 to 801c388 Compare December 19, 2025 16:40

karlem added 9 commits December 19, 2025 17:44

fix: after rebase

c493897

feat: introduce generialised approach with local f3 cache

3901b96

feat: make it configurable

a6ce3b9

feat: cleanup topdown moduel and node startup

baaadfb

feat: cleanup topdown moduel and node startup

f259d44

feat: improve the F3 topdown to rely on local cache and fix the attes…

c427732

…t and execute logic

feat: cleanup, fix logic add tests

b0481e3

feat: add extraction tests

38506c9

feat: loggig and integration test

8857277

karlem marked this pull request as ready for review January 16, 2026 19:52

karlem requested a review from a team as a code owner January 16, 2026 19:52

cursor bot reviewed Jan 16, 2026

View reviewed changes

sergefdrv requested changes Jan 22, 2026

View reviewed changes

sergefdrv reviewed Jan 23, 2026

View reviewed changes

		if !proof_config.enabled {
		tracing::info!("F3 proof service disabled in configuration");

		/// Generalized top-down finality with extensible certificate types
		GeneralisedTopDown(GeneralisedTopDown),

feat: F3 e2e lifecycle #1469

Are you sure you want to change the base?

feat: F3 e2e lifecycle #1469

Uh oh!

Conversation

karlem commented Oct 28, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Missing validation in F3 Light Client update_state

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Test name doesn't match test behavior

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karlem commented Oct 28, 2025 •

edited by cursor bot

Loading