-
Notifications
You must be signed in to change notification settings - Fork 0
Design doc: embed Litestream replication for sqlite-rest #121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
b057512
e79df89
913d09d
bbb612e
322f752
46325d5
f16d065
8aaf3ae
229cb9e
6afce55
f01868e
d41a7dd
61a3a12
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,103 @@ | ||
| # Design: Embed Litestream for SQLite replication | ||
|
|
||
| ## Background | ||
|
|
||
| `sqlite-rest` currently opens a local SQLite database file and serves RESTful access to it. There is no built-in durability story beyond a single node. [Litestream](https://litestream.io/) provides streaming WAL replication and restore for SQLite. Litestream ships a Go library that can be embedded to continuously replicate a database file to durable object storage and restore it at startup. This document proposes how to integrate that library without changing the external REST API. | ||
|
|
||
| ## Goals | ||
|
|
||
| - Offer optional replication for the served SQLite database using the Litestream Go library. | ||
| - Provide an opt-in configuration surface (CLI flags/env) to: | ||
| - Restore a database from a configured Litestream replica before the server starts handling traffic. | ||
| - Continuously replicate WAL/snapshots to one or more replicas (initially a single replica). | ||
| - Align lifecycle with the existing `serve` command: replication should start/stop with the process and respect graceful shutdown. | ||
| - Expose basic observability for replication health (log + Prometheus counters/gauges). | ||
|
|
||
| ## Non-goals | ||
|
|
||
| - Implementing multi-writer/leader election; replication is single-writer with read-only restores. | ||
| - Changing the REST API surface or authentication model. | ||
| - Building a full Litestream CLI wrapper (only the embedded library flows we need). | ||
|
|
||
| ## Current state and constraints | ||
|
|
||
| - The server opens the database via `openDB` using a DSN passed to `serve`. | ||
| - Metrics and pprof servers already share the process lifecycle and respect the same `done` channel. | ||
| - Docker image and CLI use a single database file on local disk; WAL mode is implicitly enabled by the SQLite driver. | ||
|
|
||
| ## Proposed approach | ||
|
|
||
| ### High-level flow | ||
|
|
||
| 1. **Configuration** (new `ReplicationOptions`): | ||
| - `--replication-enabled` (bool, default false). | ||
| - `--replication-config` (string, path to Litestream YAML config; preferred path to keep sqlite-rest changes minimal and delegate detailed tuning like snapshot/retention/replicas to Litestream). | ||
| - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, `SQLITEREST_REPLICATION_CONFIG`, etc.). | ||
| - Recommended CLI UX: keep flags minimal (`--replication-enabled`, `--replication-config`) and leave all other Litestream knobs to the config file. | ||
|
|
||
| 2. **Restore before serving**: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. describe the behavior of diverging with remote backup
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added divergence behavior: default fail-fast on lineage mismatch with remote, operator can force-restore or allow degraded if they accept overwrite risk. (f01868e) |
||
| - If enabled, run a Litestream restore for the configured database path **before** opening the DB handle used by `sqlite-rest`. | ||
| - Restore should be idempotent (skip when the local DB is already ahead) and rely on Litestream config knobs (snapshot interval/retention/restore lag) for tuning. | ||
| - Divergence handling: if the local WAL lineage differs from the remote replica (e.g., split-brain), default to fail-fast and require operator action (e.g., force-restore from the chosen replica or re-seed) to avoid serving inconsistent data. | ||
|
|
||
| 3. **Start replication alongside the server**: | ||
| - After opening the DB (once restore is done), create a Litestream replicator instance bound to the same database path and replica URL. | ||
| - Start replication in a goroutine using the same `done` channel used by the HTTP/metrics/pprof servers for coordinated shutdown. | ||
| - Ensure the replicator stops cleanly on context cancellation and flushes pending WAL frames. | ||
|
|
||
| 4. **Observability**: | ||
| - Log key lifecycle events (restore start/finish, replicate start/stop, errors). | ||
| - Add Prometheus metrics (e.g., `replication_last_snapshot_timestamp`, `replication_bytes_replicated_total`, `replication_errors_total`, `replication_lag_seconds`) populated via Litestream stats callbacks or polling the replicator state. | ||
|
|
||
| 5. **Failure handling**: | ||
| - If restore fails: abort startup with a clear error. | ||
| - If replication fails at runtime: surface errors via logs/metrics but keep the HTTP server running; rely on process restarts or admin action to recover. | ||
|
|
||
| ### API surface changes | ||
|
|
||
| - Extend `ServerOptions` (or adjacent option struct) with `ReplicationOptions` and bind new CLI flags on `serve`. | ||
| - Keep defaults disabled to avoid changing existing deployments. | ||
| - No changes to request handlers or DB query path. | ||
|
|
||
| ### Configuration mapping | ||
|
|
||
| - **S3**: use Litestream’s S3 replica driver; accept AWS creds via standard env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) and allow custom endpoint for MinIO. Document minimal IAM needs (typically `s3:PutObject`, `s3:GetObject`, `s3:ListBucket`, and `s3:DeleteObject` for the configured prefix) so operators can keep replication credentials least-privileged. | ||
| - **File**: support `file://` URLs for local/dev validation. | ||
| - Future: allow multiple replicas (multiple remote destinations for the same SQLite DB) by expanding the config surface (e.g., via Litestream config file); initial scope is a single replica to minimize surface area. | ||
|
|
||
| ### Lifecycle integration sketch | ||
|
|
||
| ```go | ||
| restoreIfNeeded(ctx, dbPath, restoreURL, restoreOpts) | ||
| db := openDB(...) | ||
| replicator := newReplicator(dbPath, replicaURL, tuneOpts) | ||
| go replicator.Start(ctx) // ctx tied to serve command cancellation | ||
| go metricsServer.Start(ctx) | ||
| go pprofServer.Start(ctx) | ||
| server.Start(ctx.Done()) | ||
| // Error handling: monitor replicator error channel/state changes; log and increment metrics. | ||
| // On error channel receive, cancel the shared context to shut down servers (fail-fast default). | ||
| ``` | ||
|
|
||
| ### Testing strategy (future implementation) | ||
|
|
||
| - Unit: flag parsing → `ReplicationOptions` defaults/validation. | ||
| - Integration (temporary files): start a litestream replicator pointing to a `file://` replica, perform writes via HTTP handlers, assert replica files advance (e.g., WAL or snapshot count). | ||
| - Restore path: seed replica, delete local DB, start server with `--replication-enabled --replication-restore-from <replica>`, assert DB is restored before serving. | ||
| - Metrics: expose fake replicator stats and assert Prometheus gauges/counters are set. | ||
|
|
||
| ## Migration & compatibility | ||
|
|
||
| - Replication is opt-in and disabled by default; existing CLI invocations keep current behavior. Detailed replication tuning stays in the Litestream config file to minimize sqlite-rest surface changes. | ||
| - Docker image remains the same; enabling replication requires supplying new flags/env and storage credentials. | ||
|
|
||
| ## Open questions | ||
|
|
||
| - What are the sensible defaults for snapshot/retention to balance durability and cost? | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. defaults to disable this feature, delegate the detailed settings to litestream config
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Documented that replication stays disabled by default and detailed settings are delegated to the Litestream config file. (f01868e)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove this question
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed the answered question; defaults remain disabled with detailed settings delegated to the Litestream config. (d41a7dd)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed the remaining question; open questions now only cover sensible defaults for snapshot/retention. (d41a7dd) |
||
|
|
||
| ## Implementation plan (for future PRs) | ||
|
|
||
| 1. Add `ReplicationOptions` with CLI/env bindings and validation. | ||
| 2. Add restore step before `openDB` in `serve`. | ||
| 3. Wire Litestream replicator lifecycle to the server context and add metrics/logging. | ||
| 4. Add targeted tests and minimal docs/README snippet for enabling replication. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change to consume litestream configuration file here to minimal the deltas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to prefer a Litestream YAML config via --replication-config so sqlite-rest changes stay minimal. (f01868e)