Skip to content

ReorderBuffer crash loop: no process error when calling SlotProcessorServer during backfill-to-real-time transition #2103

@hiasr

Description

@hiasr

Hi!
Once I start to use backfilling, it consistently breaks and requires me to delete the DB connection and replication slot to get it working again.
Backfilling a table works, but once I want to backfill a schema it breaks down.

Context:

  • Sequin version: v0.14.1
  • Deployment: I tried both on Kubernetes and raw docker compose in EC2
  • Postgres: 16 on RDS (no replica)
  • Sequin config: RabbitMQ sink with 1 PG schema configured
  • Logs: hyperdx_search_results_2026-01-11T10-04-20.csv

Symptoms observed:

  • Once starting backfill I can't open the backfill page of the sink anymore
  • After a while, or making some change (like adding a schema to the sink), it starts to break down and the SlotProcessor is in an error loop.
LLM log analysis: Gemini thought it might be due to a race condition of many backfills ending at the same time.

What Happened?
The Sequin.Runtime.SlotProducer.ReorderBuffer process crashed repeatedly because it tried to make a synchronous call (GenServer.call) to the SlotProcessorServer, but that process was not alive or registered.

The specific error message is: ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name

This occurred specifically when the ReorderBuffer received a :flush_batch message and tried to send data to the processor.

The Sequence of Events
Backfill Completion: At approximately 09:58:35.059Z, several backfill operations (e.g., IDs 010253bc, d4a5bc88, 22bd7330) completed in rapid succession.

Pipeline Teardown: As each backfill completed, the system sent a Stopping TableReaderServer signal to clean up the backfill readers.

Race Condition: It appears that stopping these readers or the completion of the backfills triggered a restart or reconfiguration of the replication pipeline.

The Crash: During this transition, the ReorderBuffer was still running and attempted to "flush" a batch of data it had collected. However, the SlotProcessorServer it was trying to reach had either already been shut down as part of the transition or had not yet finished starting/registering its name.

Flapping: Because several backfills finished nearly at the same time, the system "flapped"—continually trying to restart the pipeline while more backfills were finishing and triggering more restarts. Each time it restarted, the ReorderBuffer tried to flush and crashed again because the SlotProcessorServer wasn't ready.

The "Hunch" (Root Cause)
The root cause is a race condition during backfill-to-real-time transition.

When multiple backfills finish simultaneously, the frequent starting and stopping of the replication workers (SlotProcessor, SlotProducer, ReorderBuffer) creates a window where the ReorderBuffer starts up and receives data before the SlotProcessorServer is fully registered in the process registry (:syn). Since the ReorderBuffer uses a synchronous call to flush data, it crashes immediately if the target process is missing, rather than retrying or waiting.

Summary of the Log Evidence
Error Level Logs: Repeated GenServer ... terminating errors for the ReorderBuffer.

Stack Trace: Points to reorder_buffer.ex:122 in the handle_info function for :flush_batch.

Correlation: Every crash is preceded by a Stopping TableReaderServer ... backfill_status=completed info log.

Process PIDs: You can see new PIDs for the SlotProcessorServer (e.g., <0.8759.0>, <0.8775.0>) being initialized every second, confirming the system was trying to recover but failing.

Possible Fixes/Workarounds
Sequential Backfills: If possible, avoid having many backfills finish at the exact same millisecond (though this is often hard to control).

Resilience in ReorderBuffer: This may require a code change in Sequin to make the ReorderBuffer more resilient to a missing SlotProcessorServer (e.g., by using a retry mechanism or a longer timeout during name lookup).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions