Skip to content

Comments

Replace dual-sync-async persistence panic with Watch contract#4436

Open
joostjager wants to merge 3 commits intolightningdevkit:mainfrom
joostjager:revert-mixed-mode-check
Open

Replace dual-sync-async persistence panic with Watch contract#4436
joostjager wants to merge 3 commits intolightningdevkit:mainfrom
joostjager:revert-mixed-mode-check

Conversation

@joostjager
Copy link
Contributor

@joostjager joostjager commented Feb 24, 2026

Fix false panic by relaxing the Watch contract from a global "pick one mode" restriction to a per-channel rule: don't return Completed while prior InProgress updates are still pending for that channel.

@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Feb 24, 2026

👋 Thanks for assigning @TheBlueMatt as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@joostjager
Copy link
Contributor Author

joostjager commented Feb 24, 2026

I didn't add a dedicated test for the InProgress -> complete -> Completed progression since it's already well-covered by existing tests that exercise this flow without disabling the contract check: test_monitor_update_fail_cs, test_monitor_update_fail_no_rebroadcast, test_monitor_update_fail_reestablish, monitor_failed_no_reestablish_response, test_monitor_update_fail_claim, test_monitor_update_on_pending_forwards, monitor_update_claim_fail_no_response, do_during_funding_monitor_fail, test_path_paused_mpp, test_temporary_error_during_shutdown, double_temp_error, test_manager_persisted_post_outbound_edge_holding_cell.

The ChainMonitor override path (where update_monitor fails and the return is overridden to InProgress) is also exercised without disabling the check by: test_monitor_and_persister_update_fail, test_concurrent_monitor_claim, test_update_err_monitor_lockdown.

@joostjager joostjager force-pushed the revert-mixed-mode-check branch 2 times, most recently from c20c819 to 77a6a30 Compare February 24, 2026 11:07
@joostjager joostjager marked this pull request as ready for review February 24, 2026 12:07
@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 90.90909% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 85.97%. Comparing base (24062c0) to head (c37be93).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
lightning/src/ln/channelmanager.rs 85.71% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4436      +/-   ##
==========================================
+ Coverage   85.94%   85.97%   +0.03%     
==========================================
  Files         159      159              
  Lines      104607   104618      +11     
  Branches   104607   104618      +11     
==========================================
+ Hits        89901    89943      +42     
+ Misses      12194    12168      -26     
+ Partials     2512     2507       -5     
Flag Coverage Δ
tests 85.97% <90.90%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's more context in the PR description than the commit message, please update the commit message with the context (PR descriptions should almost always just be the commit messages themselves!). Also, IIRC the chanmon_consistency target is conservative to avoid violating the old API, we should update it to allow returning Completed more often to match the API changes here.

@ldk-reviews-bot
Copy link

👋 The first review has been submitted!

Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer.

@joostjager joostjager force-pushed the revert-mixed-mode-check branch from 77a6a30 to 218e89a Compare February 24, 2026 15:11
@joostjager
Copy link
Contributor Author

Updated fuzzer

@@ -2101,21 +2101,57 @@ pub fn do_test<Out: Output + MaybeSend + MaybeSync>(
// harm in doing so.
0x00 => {
*mon_style[0].borrow_mut() = ChannelMonitorUpdateStatus::InProgress;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably drop mon_style again, no? Given we added it in the being-basically-reverted commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently only used to remember what style to use after a node reload. But I could read from update_ret too and set that again on the new persister.

We could also reduce the number of command bytes by defining a command to toggle the mode.

Commit 0760f99 ("Disallow dual-sync-async persistence without
restarting") added a panic in non-test builds when a Persist
implementation returns both Completed and InProgress from the same
ChannelManager instance. However, this check runs against the status
that ChainMonitor returns to ChannelManager, not the raw Persist
result. When ChannelMonitor::update_monitor fails (e.g. a
counterparty commitment_signed arrives after a funding spend
confirms), ChainMonitor persists the full monitor successfully but
overrides the return value to InProgress. If the user's Persist impl
only ever returns Completed, this override triggers a false
mode-mismatch panic.

This replaces the panic with a cleaner contract at the Watch trait
level: a Watch implementation must not return Completed for a channel
update if there are still pending InProgress updates for that
channel. This matches the pure-async interface where an update cannot
complete until all prior updates have also completed. Persist
implementors are expected to be consistent (always sync or always
async), so ChainMonitor naturally satisfies this contract without
code changes.

The mode tracking and panic checks from 0760f99 are removed and
replaced with a panic that validates the new contract directly on
the in-flight update state. Legacy tests that switch the persister
between modes mid-flight can opt out via
Node::disable_monitor_completeness_assertion().

The chanmon_consistency fuzz target is also updated to propagate
monitor style changes to the live persister immediately rather than
only on reload. Switching to Completed is guarded to only take
effect when no channels have pending InProgress updates, satisfying
the new Watch contract.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@joostjager joostjager force-pushed the revert-mixed-mode-check branch from 218e89a to 7e2b509 Compare February 24, 2026 19:24
@joostjager
Copy link
Contributor Author

joostjager commented Feb 24, 2026

Pushed two commits that remove mon_style and also save command bytes by toggling the style.

The only functional change is that nodes now always start in sync mode, but the fuzzer can immediately toggle it to async. On reload, the style is carried over.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DId you run the fuzzer for a while? I'm not sure that we shouldn't be able to hit a failure like the below.

/// [`Persist`]/[`Watch`] without first restarting.
/// A [`Watch`] implementation must not return this for a channel update if there are still
/// pending [`InProgress`] updates for that channel. That is, an update can only be considered
/// complete once all prior updates have also completed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thinking about this a bit more there's a weird race. Async updates are "completed" only after release_pending_monitor_events returns the completed update, but also only after it has been processed, which may be some arbitrary and indeterminate amount of time later.

For the ChainMonitor return case, this is fine (though we need to update ChainMonitor - we should probably be returning InProgress for any new update that comes after one for which we have to return a spurious InProgress. Do you already intend to do that in a followup? should we not do it here?). But for the general API its pretty weird - you can in theory return a Completed after returning InProgress as long as you want some indeterminate and arbitrary amount of time, so in practice you can't...

We could fix this by adding some mutual exclusion in channelmanager.rs where we wait before processing Completed monitor updates until after any pending MonitorEvents are processed - this should be basically zero cost as ~no one is going to be using mixed-async-sync persistence in practice so we'll rarely if ever have any contention.

On the flip side, we could say that implementations are allowed to flip from Completed to InProgress for a channel, but never back (without a restart). Its more complicated to document, but it captures what we need to allow the ChainMonitor behavior.

Copy link
Contributor Author

@joostjager joostjager Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementations are allowed to flip from Completed to InProgress for a channel, but never back

This sounds like the easier way. There is no need to support mixing sync and async for Persist implementations. I do wonder now whether my initial version #4435 wasn't just sufficient. Moving the mixed mode check to the chainmonitor level and keeping everything else the same.

Fuzzer run was planned for tonight.

joostjager and others added 2 commits February 24, 2026 22:12
The mon_style array was a shadow copy of each node's persistence mode,
kept in sync with the persister's update_ret. On node reload, it was
used to restore the persistence mode to the new monitor.

Instead, read the current update_ret directly from the old monitor's
persister during reload, making the persister the single source of
truth. This also removes the initial fuzz input byte that was used to
seed mon_style, so all nodes now always start with Completed as their
persistence mode (the fuzzer can still switch to InProgress via the
0x00-0x02 opcodes).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the 6 separate opcodes (0x00-0x02 for InProgress, 0x04-0x06
for Completed) with 3 toggle opcodes (0x00-0x02) that flip the
persistence style for each node. When toggling back to Completed, the
existing guard that prevents switching while monitors are pending is
preserved.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@joostjager joostjager force-pushed the revert-mixed-mode-check branch from 7e2b509 to c37be93 Compare February 24, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants