Skip to content

feat: Write transaction fragments directly to storage to reduce consumer memory footprint#3783

Draft
alco wants to merge 25 commits intomainfrom
alco/write-txn-fragments-to-storage
Draft

feat: Write transaction fragments directly to storage to reduce consumer memory footprint#3783
alco wants to merge 25 commits intomainfrom
alco/write-txn-fragments-to-storage

Conversation

@alco
Copy link
Member

@alco alco commented Jan 27, 2026

Closes #3415

Summary

For shapes without subquery dependencies (the vast majority), the consumer now streams each transaction fragment directly to storage as it arrives, instead of buffering the entire transaction in memory. This significantly reduces per-shape-consumer memory usage for large transactions.

Shapes with subquery dependencies continue to buffer the full transaction (write_unit=txn) because they need the complete change set for move-in/move-out reasoning.

How it works

The consumer operates in one of two modes, determined at shape creation time:

  • write_unit=txn_fragment (default): Each fragment is filtered, converted to log items, and written to storage immediately. A lightweight PendingTxn struct tracks only metadata (xid, offset, byte count, change count) across fragments. When the final fragment with a commit arrives, storage is notified to advance its transaction boundary.

  • write_unit=txn (shapes with subquery dependencies): Fragments are accumulated into a complete transaction before writing, as before.

Key additions

Consumer.PendingTxn — tracks in-progress transaction metadata during fragment streaming. It is the lightweight counterpart to TransactionBuilder: where TransactionBuilder accumulates all changes in memory, PendingTxn only tracks offsets and counters while actual data is written to storage and discarded immediately.

New storage callbacksappend_fragment_to_log!/2 writes log items without advancing the persisted transaction offset (can be called multiple times per transaction), and signal_txn_commit!/2 marks the transaction as complete so crash-recovery sees the correct committed boundary. The existing full-transaction write path is refactored to share the same underlying logic.

Support.StorageTracer — test helper using Erlang trace sessions to observe which storage functions are called from a process, enabling non-intrusive assertions on the write path taken without wrapper modules or function patching.

Other notable changes

  • Initial snapshot filtering (xid-based skip) is now applied at the first fragment, avoiding accumulation of fragments for transactions that will be discarded.
  • The FlushTracker in ShapeLogCollector is now only invoked for fragments that include a commit, since mid-transaction fragments don't represent a flush boundary.
  • Removed CrashingFileStorage (dead code).

Expected memory footprint reduction

Previously, the consumer held an entire transaction's worth of change structs in memory until commit. Now, each fragment's changes are written to storage and released immediately. The consumer only retains the PendingTxn metadata struct (~100 bytes) per in-progress transaction, reducing peak memory by roughly the size of the buffered change data. The improvement scales linearly with transaction size.

Testing

Consumer and storage tests have been expanded to cover both write_unit modes, including tracing-based assertions on which storage functions are called.

@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review

Comment @coderabbitai help to get the list of available commands and usage tips.

@netlify
Copy link

netlify bot commented Jan 27, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 634162c
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6995a5643350ab00082eacdf
😎 Deploy Preview https://deploy-preview-3783--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.22%. Comparing base (03943ad) to head (751cf24).
⚠️ Report is 4 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3783      +/-   ##
==========================================
+ Coverage   87.16%   87.22%   +0.06%     
==========================================
  Files          25       25              
  Lines        2314     2325      +11     
  Branches      580      588       +8     
==========================================
+ Hits         2017     2028      +11     
  Misses        295      295              
  Partials        2        2              
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 92.09% <ø> (+0.05%) ⬆️
packages/y-electric 56.05% <ø> (ø)
typescript 87.22% <ø> (+0.06%) ⬆️
unit-tests 87.22% <ø> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blacksmith-sh

This comment has been minimized.

@alco alco force-pushed the alco/write-txn-fragments-to-storage branch from d3dc0f4 to 1c882c2 Compare January 28, 2026 11:18
@alco alco force-pushed the alco/write-txn-fragments-to-storage branch from 1c882c2 to 71b61f5 Compare February 9, 2026 15:12
@blacksmith-sh

This comment has been minimized.

@alco alco force-pushed the alco/write-txn-fragments-to-storage branch from dba7fc3 to c7d9d17 Compare February 10, 2026 12:42
@blacksmith-sh

This comment has been minimized.

@alco alco changed the title Alco/write txn fragments to storage feat: Write transaction fragments directly to storage to reduce consumer memory footprint Feb 10, 2026
@alco alco force-pushed the alco/write-txn-fragments-to-storage branch from c7d9d17 to ccaed2d Compare February 10, 2026 13:54
@alco
Copy link
Member Author

alco commented Feb 10, 2026

Manual memory benchmark results

The tests below where done separately with electric@main and electric@this-branch. The database had a single table

CREATE TABLE IF NOT EXISTS public.users (
  id bigserial PRIMARY KEY,
  name TEXT NOT NULL,
  email TEXT NOT NULL                                                  
);

and Electric was initialized with 200 shapes, each shape having a where clause N=N where N in 1..200.

After creating the shapes and firing up Observer, I executed the following INSERT in the database:

insert into                              
  public.users(name, email)
    select
      'username_' || generate_series::text || '_' || repeat('x', 2000),
      'username_' || generate_series::text || '@example.com'
    from
      generate_series(1,10001);

Memory charts

The total memory usage on electric@main grows almost to 9GB during transaction processing. For electric@this-branch the total memory usage stays below 500MB.

electric@main

electric-main-mem-chart

electric@this-branch

electric-fragment-mem-chart

System stats comparison

For each run, I screenshotted the system stats in Observer both before the INSERT and after the memory usage had fallen back down (after GC and consumer hibernation).

The key thing to notice from these is that electric@main has residual process memory usage after the load testing which is not the case for electric@this-branch.

electric@main

electric-main-stats

electric@this-branch

electric-fragment-stats

@alco alco marked this pull request as ready for review February 10, 2026 14:26
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ccaed2d3ca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@alco
Copy link
Member Author

alco commented Feb 10, 2026

@claude please review this PR.

  • Does it address the referenced issue?
  • Do you see any obvious issue in the code?
  • Lack of test coverage?
  • Any gnarly or hairy code?

@alco
Copy link
Member Author

alco commented Feb 11, 2026

I've run the replication throughput benchmark on our cloud fleet with the following params, comparing main (i.e. electricsql/electric:canary) with this PR (pushed to Docker Hub as alco/electric:fragment-direct-pr):

%{
  "row_size" => [150],
  "shape_count" => [5, 100, 1000, 2000],
  "tx_affects_shape" => [true, false],
  "tx_count" => [1000],
  "tx_row_count" => [90, 100, 120, 200, 220],
  "where_clause" => ["name = '\#{name}'", "name ILIKE '\#{name}%'"]
}
canvas

@balegas
Copy link
Contributor

balegas commented Feb 11, 2026

Every microsecond counts!

Copy link
Contributor

@robacourt robacourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great work! The interaction with the Materializer will need addressing, maybe consumers with materializers should be write_unit=txn

@alco alco added the claude label Feb 11, 2026
@claude
Copy link

claude bot commented Feb 11, 2026

Claude Code Review

Summary

Seventh review iteration of PR #3783 (2026-02-18). The PR successfully addresses the linked issue #3415 by streaming transaction fragments directly to storage, delivering a dramatic memory improvement (9GB → 500MB). The latest commit (751cf24) fixes empty materializer event filtering and ordering assumptions. However, CI test failures remain a merge blocker - 3 MaterializerTest failures suggest issues with event accumulation and notification in fragment mode.

What's Working Well

Memory optimization delivered: The benchmark results demonstrate the PR's core value - memory usage stays below 500MB during large transaction processing vs. nearly 9GB on main.

Solid architecture: The separation between PendingTxn (lightweight metadata tracking) and TransactionBuilder (full buffering) is clean and appropriate. The storage capability checking via supports_txn_fragment_streaming? provides a good safety net.

Comprehensive testing infrastructure: The addition of StorageTracer for non-intrusive test assertions and oracle-based ground-truth validation shows excellent testing practices.

Latest fix (751cf24) is correct: The empty event filtering check (events != %{}) and sorting fix for event ordering properly address the identified issues.

Issues Found

Critical (Must Fix)

1. CI MaterializerTest failures indicate incomplete fix

Files: packages/sync-service/test/electric/shapes/consumer/materializer_test.exs:353-375, 408-429

Issue: According to the latest Blacksmith CI report (2026-02-18), three MaterializerTest failures persist despite the fix in commit 751cf24:

  1. test events are accumulated in correct order across uncommitted fragments - Ordering assertion failures
  2. test same-batch move event cancellation existing value removed and re-added emits no events - Unexpected message received
  3. test same-batch move event cancellation insert and delete in same batch emits no events - Unexpected message received

Root cause analysis:

Looking at the materializer code (materializer.ex:309-322):

defp maybe_flush_pending_events(state, true) do
  events = cancel_matching_move_events(state.pending_events)
  
  if events != %{} do
    for pid <- state.subscribers do
      send(pid, {:materializer_changes, state.shape_handle, events})
    end
  end
  
  %{state | pending_events: %{}}
end

The fix correctly prevents sending %{}, but the test failures suggest:

  • Event accumulation across fragments may not be working correctly (test line 353-375)
  • The commit: flag may not be propagated correctly in all code paths
  • Tests receiving unexpected messages implies empty events are still being sent from somewhere

Possible issues:

  1. The Consumer may be calling Materializer.new_changes/3 without the correct commit: flag in some edge cases
  2. There might be a race condition where maybe_flush_pending_events is called multiple times
  3. The event merging logic in merge_events/2 might have edge cases

Action required:

  1. Run the failing tests locally with debug logging to see what messages are actually being sent
  2. Add logging to track all calls to maybe_flush_pending_events and new_changes with their arguments
  3. Verify that the Consumer's fragment processing path always passes commit: false for non-final fragments
  4. Check if there's a code path that bypasses the events != %{} check

2. Innermost dependency shape handling unresolved (from previous reviews)

File: packages/sync-service/lib/electric/shapes/consumer/state.ex:174-175

Issue: Per @robacourt's comment in review-context, the current logic enables write_unit=:txn_fragment for shapes without dependencies, but innermost dependency shapes (shapes with no dependencies that ARE dependencies for others) need special handling.

Example:

  • Outer shape: where: id IN (SELECT id FROM accounts) - dependencies: [Inner Shape]
  • Inner shape: SELECT id FROM accounts - dependencies: [] - HAS a materializer

The inner shape will be set to write_unit=:txn_fragment, but its materializer may need full transaction visibility for correct move-in/move-out reasoning.

Evidence this may be causing oracle test failures:

  • Oracle tests fail at varying rates (2-10 runs with specific params)
  • Example discrepancy: materialized shows 6 rows, oracle shows 9 rows (missing rows)
  • This pattern is consistent with incomplete transaction visibility in materializers

Recommendation:

  1. Check if a shape has dependent shapes before setting write_unit=:txn_fragment
  2. If a shape has a materializer (regardless of whether it has dependencies), consider whether it needs full transaction visibility
  3. Add explicit test: "innermost dependency shape with materializer receives fragments correctly"
  4. Correlate oracle test failures with shapes that have materializers - if failures are specific to these shapes, this is likely the root cause

Code location to fix: packages/sync-service/lib/electric/shapes/consumer/state.ex:174-175

write_unit =
  if filled_deps == [] do
    # BUG: This doesn't check if this shape IS a dependency for others
    # and has a materializer that needs full txn visibility
    :txn_fragment
  else
    :txn
  end

Suggested fix pattern:

write_unit =
  cond do
    # Shapes with dependencies always buffer full txn
    filled_deps != [] -> :txn
    # Shapes that have dependent outer shapes with materializers need full txn
    has_dependent_shapes?(shape) -> :txn
    # All other shapes can stream fragments
    true -> :txn_fragment
  end

3. Oracle test baseline still missing (from previous reviews)

Status: The author reported oracle test failures on main but no baseline comparison has been documented yet.

Critical questions remain:

  1. Does main fail oracle tests at the same rate with the same seeds?
  2. Does this PR introduce new failures or expose existing bugs?
  3. Is the oracle test harness itself correct?

Why this is a blocker: Without a baseline, we cannot distinguish between:

  • Pre-existing bugs in main
  • Bugs introduced by this PR
  • Bugs in the test harness itself

Action required:

  1. Run oracle tests on main with fixed seeds (document the seeds used)
  2. Run the same tests on this branch with identical seeds
  3. Document findings in PR comment with failure rates and specific discrepancies
  4. If this PR introduces new failures, investigate before merge
  5. If failures are pre-existing, file a separate issue and document it

Important (Should Fix)

4. Incomplete changeset file

File: .changeset/brown-rice-float.md

Issue: The changeset describes the feature but doesn't mention potential breaking changes or requirements:

  • Storage backends must implement new callbacks (append_fragment_to_log!/2, signal_txn_commit!/2, supports_txn_fragment_streaming?/1)
  • InMemoryStorage and TestStorage explicitly return false for fragment streaming support
  • No migration guide for custom storage implementations

Recommendation: Expand the changeset to document:

  1. New storage interface requirements
  2. Backward compatibility strategy (unsupported storages fall back to :txn mode)
  3. Any performance implications for shapes with subquery dependencies

5. Test helper robustness

File: packages/sync-service/test/electric/shapes/consumer/materializer_test.exs:492

Issue: The apply_changes/2 helper calls prep_changes/1 which adds metadata, but there's no verification that the commit: parameter is being passed correctly by callers.

Recommendation: Add a helper that explicitly documents whether changes are committed:

defp apply_uncommitted_changes(ctx, changes), 
  do: Materializer.new_changes(ctx, prep_changes(changes), commit: false)

defp apply_committed_changes(ctx, changes),
  do: Materializer.new_changes(ctx, prep_changes(changes), commit: true)

This makes test intent clearer and reduces the risk of missing the commit: flag.


Test Coverage

Excellent additions:

  • Oracle tests provide ground-truth validation (commit 634162c)
  • Fragment ordering test (commit bb169f2)
  • Event cancellation tests with move-in/move-out logic
  • Storage capability checking tests
  • StorageTracer for non-intrusive testing

Missing:

  • ✅ Fragment streaming with innermost dependency shapes (CRITICAL)
  • Multi-fragment transaction where final fragment is filtered out (edge case for offset mapping)
  • Materializer event accumulation edge cases (commit: false → commit: false → commit: true)
  • Oracle baseline comparison tests on main

Issue Conformance

Linked issue #3415: "Avoid holding whole transaction in consumer memory"

Core objective delivered: 9GB → 500MB memory improvement
Fragment streaming working: For shapes without dependencies
Safety mechanisms in place: Storage capability checking prevents crashes
⚠️ Materializer edge cases: CI failures need resolution
⚠️ Innermost dependency shapes: Unclear if handled correctly
⚠️ Oracle test baseline: Needed to validate correctness


Architecture & Code Quality

Strengths:

  • Clean separation: PendingTxn (metadata) vs TransactionBuilder (full buffering)
  • Defensive programming: Empty event filtering, storage capability checking
  • Excellent observability: StorageTracer, oracle tests
  • Well-structured storage abstraction for heterogeneous backends

Concerns:

  • Materializer event notification has subtle complexity causing test failures
  • Innermost dependency shape scenario needs explicit handling or documentation
  • Oracle test reliability needs baseline to establish confidence

Previous Review Status (Review #6 → Review #7)

What was addressed since review #6:

  • ✅ Empty event filtering added (commit 751cf24)
  • ✅ Ordering assumption removed from tests (commit 751cf24)
  • ✅ Storage capability checking implemented (commits 4e630ab, 7ec4c9a)
  • ✅ Materializer notification split for fragments vs clients (commit a93967e)
  • ✅ Commit fragment offset fix for txn_offset_mapping (commit 78a629a)

What remains from review #6:

  1. MaterializerTest CI failures - Still failing (3 tests)
  2. Oracle test baseline on main - Not yet documented
  3. Innermost dependency shape question - Unresolved per @robacourt feedback

Next Steps Before Merge

Critical (Merge Blockers):

  1. Fix MaterializerTest CI failures (assignee: @alco)

    • Reproduce failures locally
    • Add debug logging to track maybe_flush_pending_events calls
    • Verify commit: flag propagation in all Consumer code paths
    • Fix and verify tests pass consistently
  2. Establish oracle test baseline on main (assignee: @alco)

    • Document seeds used for reproducibility
    • Run tests on main with fixed seeds (10+ runs)
    • Run tests on this branch with same seeds
    • Document comparison: pre-existing vs new failures
    • If new failures, investigate root cause
  3. Resolve innermost dependency shape handling (assignee: @alco + @robacourt)

    • Add test: innermost dependency shape (no deps, has materializer) with fragment streaming
    • Verify materializer sees complete transaction for move-in/move-out logic
    • If broken, add logic to detect "is dependency for other shapes" and use :txn mode
    • Document decision in code comments

Important (Pre-Merge):

  1. Expand changeset file with storage interface breaking changes
  2. Add test helper clarity for committed vs uncommitted changes
  3. Add inline comments explaining when write_unit=:txn vs :txn_fragment is chosen

Overall Assessment

This PR represents a significant and valuable memory optimization with solid architectural decisions. The ~18x memory reduction (9GB → 500MB) is impressive and addresses a real pain point.

However, three merge blockers remain:

  1. CI test failures are unresolved - MaterializerTest failures suggest the empty event fix is incomplete or there are edge cases in event accumulation logic
  2. Innermost dependency shape behavior is unclear - @robacourt's comment highlights a gap in the current write_unit selection logic that may affect shapes with materializers
  3. Oracle test baseline is missing - Cannot validate correctness without comparing failure rates against main

The code quality is high, testing infrastructure is excellent, and the approach is sound. Once these three blockers are addressed, this PR will be ready to merge.

Recommendation:

  1. Immediate priority: Debug and fix MaterializerTest failures - they indicate a real issue
  2. High priority: Add test for innermost dependency shape + materializer scenario
  3. Before merge: Document oracle test baseline comparison with main

Review iteration: 7 | 2026-02-18

@alco alco force-pushed the alco/write-txn-fragments-to-storage branch 2 times, most recently from 5f18340 to 4714a7a Compare February 16, 2026 13:08
@alco
Copy link
Member Author

alco commented Feb 16, 2026

Update (Feb 16)

Rebased on current main and pushed new commits addressing review feedback:

  • Split materializer and client notifications (a93967e): Separated notify_new_changes into distinct functions for notifying the materializer vs. clients. This is needed to selectively notify the appropriate subscriber when processing individual txn fragments.

  • Use commit fragment offset for txn_offset_mapping and client notifications (78a629a): Fixes the flush alignment issue flagged in review — when the final commit fragment is filtered out, the mapping now uses the commit fragment's last_log_offset directly instead of the last relevant-change boundary. PendingTxn no longer needs to track last_log_offset.

  • Add supports_txn_fragment_streaming? callback to Storage behaviour (4e630ab): InMemoryStorage and TestStorage return false, PureFileStorage returns true. This allows runtime capability checking before enabling fragment streaming mode.

  • Downgrade write_unit to :txn when storage lacks fragment streaming support (4714a7a): On consumer init, validates that the storage backend supports txn fragment streaming. If unsupported (e.g. InMemoryStorage), falls back to full-transaction buffering with a warning log. This addresses the concern about InMemoryStorage crashing on fragment callbacks.

Still TODO: address @robacourt's feedback about shapes that are dependencies (innermost shapes with no dependencies themselves but that DO have a materializer) — these should probably use write_unit=txn to ensure the materializer gets the full transaction change set on commit.

Based on current implementation, Materializer sees changes from all txn fragments but when it then notifies the consumer with a :materializer_changes message in return, Consumer may still be assuming that a complete transaction has been processed.

@blacksmith-sh

This comment has been minimized.

@alco
Copy link
Member Author

alco commented Feb 16, 2026

Latest changes (3 new commits since last push)

Defer Materializer subscriber notifications until commit fragment

Previously, when write_unit=:txn_fragment, the inner Consumer called Materializer.new_changes() per fragment, and the Materializer immediately sent :materializer_changes (move-in/move-out events) to the outer Consumer after each fragment. This could cause redundant intermediate notifications within a single transaction.

Changes:

  • Added a commit: option to Materializer.new_changes/3 (defaults to true for backward compatibility). When commit: false, events are accumulated in a new pending_events field on the Materializer state without notifying subscribers. When commit: true, accumulated events are flushed to subscribers.
  • The fragment-streaming path in consumer.ex passes commit: false for every fragment. The flush happens via notify_new_changes in maybe_complete_pending_txn which uses commit: true (the default).
  • Events from multiple fragments are merged by prepending newer events to older ones, matching the existing reverse-chronological ordering convention used by increment_value/decrement_value.
  • Added a test verifying correct event ordering across uncommitted fragments.
  • Moved signal_txn_commit! inside the num_changes > 0 branch since there's no need to signal storage when no changes were written.

@alco
Copy link
Member Author

alco commented Feb 16, 2026

Regarding an erlier agentic comment about the potential impact of direct txn fragment writing to storage on PureFileStorage.trim_log!, I've asked Claude to analyse it and here's the result - https://gist.github.com/alco/588b3d7d7665f63d768d5f1b611499c7#file-fragment_storage_safety_analysis-md.

TL;DR: Fragment Streaming Storage Safety

All 7 crash/consistency scenarios are safe. The three-offset architecture (last_seen_offset / last_seen_txn_offset / last_persisted_txn_offset) provides the core safety guarantee:

  • trim_log!/2 correctly truncates uncommitted fragments — it uses last_persisted_txn_offset (only advanced after a complete txn is finalized + flushed) to remove both chunk index entries and log file bytes beyond the committed boundary.
  • Readers never see uncommitted datastream_main_log caps reads at last_seen_txn_offset, which isn't advanced until signal_txn_commit! calls finalize_txn.
  • ETS consistency is maintained — fragment entries stay in an in-memory buffer until finalize_txn; the disk/ETS split read strategy handles mid-transaction flushes correctly.
  • Compaction has one theoretical edge case (transaction spanning 30+ MB creating 3+ chunks) where the compacted file could retain uncommitted data after a crash, but readers are still bounded by the txn offset, so no data corruption occurs. Practically, this is a near-zero likelihood scenario.

No code changes are needed — the existing invariants hold correctly with the fragment write path.

Copy link
Contributor

@robacourt robacourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. To test the affect of this on subqueries, you could cherry-pick 0cc3b49 and run the tests with the number of mutations in the test transactions over the max txn size and see if there any issues:

ELECTRIC_EXPERIMENTAL_MAX_TXN_SIZE=5 \
MUTATIONS_PER_TXN=15 \
mix test --only oracle test/integration/oracle_property_test.exs

@alco
Copy link
Member Author

alco commented Feb 16, 2026

@robacourt That's dope! It has found one failure already after running for ~10sec. I'll look into fixing it and see if it can discover anything else.

alco and others added 18 commits February 18, 2026 12:28
To be sure no log items are written to storage, we must verify that
neither of the writing-to-storage functions got called.
On CI the function has been seen to return an empty list where a
non-empty one was expected. Apparently, there may be a delay between the
calls themselves and the moment they end up in the test process'
mailbox.
…rate functions

This is needed to be able to selectively notify the appropriate
subscriber and processing individual txn fragments.
…tions

PendingTxn no longer needs to track last_log_offset since the commit
fragment's offset (txn_fragment.last_log_offset) is now used directly in
maybe_complete_pending_txn. This fixes flush alignment lagging behind
when the final commit fragment is filtered out.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
InMemoryStorage and TestStorage return false, PureFileStorage returns
true. This allows runtime capability checking before enabling fragment
streaming mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pport

When initializing a consumer, validate that the storage backend
supports txn fragment streaming before using :txn_fragment mode.
If unsupported (e.g. InMemoryStorage), fall back to full-transaction
buffering with a warning log.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alco alco force-pushed the alco/write-txn-fragments-to-storage branch from 412234d to 60f11a8 Compare February 18, 2026 11:40
alco and others added 3 commits February 18, 2026 12:41
Previously, Materializer sent :materializer_changes to subscribers on
every new_changes call. When write_unit=:txn_fragment, this meant the
outer Consumer could receive multiple intermediate move-in/move-out
notifications within a single transaction, causing redundant work.

Add a commit: option to Materializer.new_changes/3 (default: true).
Changes are accumulated in pending_events and only flushed to
subscribers when commit: true. The fragment-streaming path passes
commit: not is_nil(fragment.commit) to defer until the final fragment.
maybe_complete_pending_txn sends an empty new_changes with commit: true
to handle the edge case where the commit fragment has no matching
changes but earlier fragments accumulated events.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies that splitting inserts across two new_changes calls (one
uncommitted, one committed) produces the same event order as sending
them in a single call.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alco alco force-pushed the alco/write-txn-fragments-to-storage branch from 60f11a8 to 634162c Compare February 18, 2026 11:41
@blacksmith-sh

This comment has been minimized.

@alco alco force-pushed the alco/write-txn-fragments-to-storage branch from 674ad7e to 751cf24 Compare February 18, 2026 16:07
@alco
Copy link
Member Author

alco commented Feb 18, 2026

@robacourt Pushed 751cf24 to correctly incorporate your recent addition of cancelling out move-ins/move-outs.

@alco
Copy link
Member Author

alco commented Feb 18, 2026

I've just run oracle tests with max_batch_size=5 and mutations_per_txn=15 and can relibably reproduce this issue:

Poll error for shape=shape_9 where=(level_3_id = 'l3-5') OR (value BETWEEN 'v1' AND 'v10'): %Electric.Client.Error{message: %{"errors" => %{"offset" => ["out of bounds for this shape"]}, "message" => "Invalid request"}, resp: %Electric.Client.Fetch.Response{status: 400, last_offset: nil, shape_handle: nil, schema: nil, next_cursor: nil, request_timestamp: ~U[2026-02-18 18:57:16.838842Z], body: %{"errors" => %{"offset" => ["out of bounds for this shape"]}, "message" => "Invalid request"}, headers: %{"access-control-allow-methods" => ["GET, POST, HEAD, DELETE, OPTIONS"], "access-control-allow-origin" => ["*"], "access-control-expose-headers" => ["electric-cursor,electric-handle,electric-offset,electric-schema,electric-up-to-date,electric-internal-known-error,retry-after"], "cache-control" => ["no-store"], "content-type" => ["application/json; charset=utf-8"], "date" => ["Wed, 18 Feb 2026 18:57:16 GMT"], "electric-server" => ["ElectricSQL/1.4.4"], "surrogate-control" => ["no-store"], "vary" => ["accept-encoding"], "x-request-id" => ["GJVsPM1fclpQUIgAAEJC"]}}}

The key bit is %Electric.Client.Error{message: %{"errors" => %{"offset" => ["out of bounds for this shape"]}, "message" => "Invalid request"}

So apparently Electric returns incorrect offset that's not at transaction boundary.

Silver lining is that if I set write_unit: :write_unit_txn in this branch, that is stick to buffering txn fragments in memory, the oracle test harness only finds inconsistency errors similar to those that can be reproduced in main.

I may be able to fix the handling of multiple txn fragments for shapes without subqueries first and we ship those. Then deal with subqueries.

@alco alco marked this pull request as draft February 18, 2026 19:09
@balegas
Copy link
Contributor

balegas commented Feb 18, 2026

:( if it's easy to reproduce, we should be able to figure the root cause. There is no rush to ship this PR, so let's do it properly when we know the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid holding whole transaction in consumer memory

3 participants

Comments