The design rationale for talea-store-log, the embedded append-log backend. For the file formats and tunables see the crate reference; for deployment see the how-to; for the ledger-wide design decisions this backend inherits, see Architecture & design.
A ledger commit must be durable before it is acknowledged — an acked-then-lost commit is the one failure a ledger cannot have. Durability costs an fsync, and an fsync costs milliseconds (~3 ms F_FULLFSYNC on a laptop, sub-millisecond on server NVMe). Paying it per request caps a writer at a few hundred commits per second no matter how fast everything else is.
The SQL backends pay that cost plus network round-trips, lock acquisition, and SQL execution — measured on the same laptop, Postgres peaks around 810 commits/s on one book. The question this backend answers: how much of that ceiling is essential, and how much is paying for capabilities (multi-instance coordination, SQL access) a single-node deployment doesn’t use?
talea is already event-sourced: every write is an event, and balances are projections (why). The SQL backends store the events in a database and maintain projections in the same transaction. This backend takes the idea literally: the event log is the storage — one append-only file sequence per book — and the projections live in memory, rebuilt from the log at startup. There is no second source of truth to keep consistent, so there is no transaction to coordinate. What the SQL backends get from ACID, this backend gets from a write protocol:
per-book writer task
┌──────────────────────────────────────┐
│ drain queued commits (group) │
│ validate against in-memory state │ rejects reply per-draft,
│ append CRC-framed events │ never poison batchmates
│ fsync — ONCE for the whole batch │
│ apply to in-memory state │
│ ack callers, publish to subscribers │
└──────────────────────────────────────┘
The order is the invariant: nothing is acknowledged before its bytes are synced, and nothing is applied to readable state before it is durable. A failed fsync kills the writer rather than continuing into unknowable disk state — callers see an error and retry with the same idempotency key, which is always safe.
The SQL backends keep sequences gapless with a per-book counter-row lock — the database serializes writers across instances. This backend has no database, so the arbiter moves into the process: one Tokio task owns each book, and all writes flow through it. Sequence assignment is a local increment; idempotency checks are a map lookup; min-balance validation reads memory. The single-writer model is also why commit timestamps stay monotonic with sequence numbers without any clock coordination.
Trade-off: the arbiter being in-process is exactly why this backend is single-process (enforced by a directory lock). Two processes can’t share an in-memory arbiter. Multi-instance deployments belong on Postgres, where the database is the arbiter every instance can see.
Throughput = batch size ÷ fsync latency. One lone committer pays a full fsync (~3 ms → ~330 commits/s — the measured c1 floor). Sixty-four concurrent committers share one fsync and the same hardware does ~6,600/s; at c128, ~9,500/s. Nothing about the disk changed — the batches got fuller. This is the same group-commit idea the server’s write router applies to SQL transactions, applied at the fsync instead.
What fills the batches in practice is the wire: with one transaction per HTTP request, request overhead caps arrival rate long before the store saturates. The batch endpoint delivers drafts hundreds at a time, and the same fsync schedule then carries ~35–40 k drafts/s on the same hardware (conditions in the bench README; live trends on the CI bench charts).
Trade-off: worst-case latency is best-case batching. A solo commit can’t amortize anything and pays the full fsync alone.
Every event is a length-prefixed, CRC-checked frame, which makes crash states classifiable instead of mysterious:
Snapshots bound startup time (replay from the snapshot’s sequence instead of genesis) but carry no authority: they’re written atomically, validated on load, and a bad one just means a longer replay. The same principle applies to the idempotency index’s spill files and Bloom filter — disk-resident caches are either CRC-valid or rebuilt from the log. Nothing the store cannot rebuild is ever trusted.
read_events, point-in-time trial balance, subscriptions) withhold frames newer than the last applied batch — durable-and-applied or invisible, never a dirty read.i64::MAX (with a logged warning) instead of failing after an fsync; per-account balances still reject overflow at validation time.talea-store-log reference — formats, tunables, measured numbers, known limits