Every database you have ever used is, at its core, a log. The tables you query are a convenience, a materialized view of that log, kept up to date so you do not have to replay history on every read. The write-ahead log, the Raft log, the Kafka topic, the event-sourcing journal: all the same shape, serving the same purpose, at different scales.

I did not write this post to make a grand point. I wrote it because I kept noticing the same data structure showing up in every system I read about, and writing it down helped me understand why.

the claim

A database is two things: an append-only log that records every state change, and a set of indexes that materialize the current state for fast reads. The log is the truth. The indexes are a cache. On every write, you append to the log and update the indexes. On every crash, you replay the log to reconstruct the indexes. On every replication, you ship the log to a follower.

This is not a metaphor. It is the literal architecture of Postgres (WAL + heap), MySQL (redo log + binlog + buffer pool), SQLite (WAL + b-tree), RocksDB (WAL + LSM), and every other durable storage engine you can name. The differences are in the shape of the index and the strategy for keeping it consistent with the log.

a database is a log plus indexeswriteLOG (WAL)append-only · ordered↑ fsync here (durable)replayINDEX (b-tree)materialized viewcrashreplay log tailrebuildINDEX restoredreplicaships log → follower rebuilds same index
A write lands in the log first (durable, append-only, fast). The index updates from the log. On crash, the index is rebuilt by replaying the log tail. On replication, the follower consumes the same log. The log is the source of truth. Everything else is derivable.

why the log comes first

The log solves two problems at once that are hard to solve any other way. The first is durability. Appending to a log is a single sequential write, which is the cheapest durable operation on any medium. Updating a b-tree in place is many random writes and a lot of bookkeeping. Doing it durably on every transaction would be slow. So you write the log first (one fsync), tell the client "committed," and update the b-tree lazily in memory. If you crash before the b-tree is flushed, you replay the log and rebuild.

The second is order. A log is a total order on events. In a single-node database that total order is trivial, it is the append order. In a distributed system, the total order is the whole point. It is what makes a group of machines agree on a single history. This is why Raft's log and Postgres's WAL are the same data structure wearing different hats. Raft uses the log to make multiple machines agree. Postgres uses it to make one machine survive a crash. The mechanism, an append-only, totally ordered sequence of records, is identical.

the WAL

Postgres calls it the WAL. MySQL calls it the redo log (and separately, the binlog for replication). SQLite calls it the WAL too. The pattern is universal:

  1. A transaction modifies pages in an in-memory buffer pool. The pages are dirty but not yet written to disk.
  2. Before the transaction commits, its changes are appended to the log and the log is flushed. This is the fsync from the storage post. This is where that contract matters.
  3. The commit returns. The dirty pages may not be on disk yet. They will be written later by a background process (the checkpointer).
  4. On crash recovery, the database reads the log, redoes any transactions whose changes did not make it to the data files, and undoes any that were not committed. The index is consistent with the log again.

The optimization this enables, group commit, is the reason databases can be both durable and fast. Many transactions append their records to the log in a batch, and a single fsync commits all of them. The cost of the honest flush is amortized across the group. Same principle as journaling in filesystems: batch the slow, honest operation so you can afford to do it honestly.

the LSM tree

A b-tree database has a log (the WAL) and an index (the b-tree). The LSM tree asks a different question: what if the index is a log? What if we never update in place at all, and instead make the whole storage engine append-only?

An LSM (Log-Structured Merge) tree writes everything into an in-memory memtable (a sorted structure). When the memtable fills, it is flushed to disk as an immutable SSTable (Sorted String Table) at level 0. As SSTables accumulate, a background compaction process merges them into larger, sorted files at deeper levels, discarding overwritten and deleted keys. Reads check the memtable, then the SSTables from shallowest to deepest, using bloom filters to skip files that definitely do not contain the key.

LSM tree — append, flush, compactwrite →memtablein-memoryL0flushedL1compactedL2compacted↑ background compaction merges + sortsread →memtable?L0L1 ...L2 (bloom skip)writes: one append · reads: check several (bloom filters skip empties)compaction rewrites files to discard overwritten / deleted keys
Writes are always sequential appends. Never a random in-place update. The cost moves from the write path to the read path (you may check several files) and to the background compaction (which rewrites files as they merge). This is the write-optimized extreme. A b-tree is the read-optimized extreme.

The tradeoff has three names:

  • Write amplification: how many bytes get written to disk per byte of logical write. LSM is low (one append). A b-tree is high (page reads, page writes, split propagation).
  • Read amplification: how many I/Os per read. A b-tree is low (a root-to-leaf traversal). LSM is higher (memtable + several SSTables, mitigated by bloom filters).
  • Space amplification: how much extra space is used. LSM has stale copies of overwritten keys until compaction. A b-tree is compact (in-place updates). Compaction is the tax you pay for the cheap writes.

Pick your engine by your workload. Write-heavy, append-mostly, or time-series: LSM (RocksDB, Cassandra, InfluxDB). Read-heavy, point-lookup-heavy, with a working set that fits in RAM: B-tree (Postgres, MySQL/InnoDB). Most OLTP workloads are the second, which is why Postgres is the default and RocksDB is the specialization.

Kafka

If the log is the source of truth and the tables are materialized views, there is no reason the log has to live inside the database. You can externalize it. A dedicated log system that other systems subscribe to. That is Kafka.

Kafka is a log, partitioned for scale, replicated for durability, with a consumer protocol for reading it back in order. Producers append. Consumers read at their own pace. The log retains history for as long as you configure. Once you have it, the architecture shifts:

  • Your database becomes a materialized view of a Kafka topic, not the other way around.
  • Multiple downstream systems (search index, cache, analytics warehouse, ML feature store) can all consume the same log and build their own projections, all consistent because they share the source.
  • You can replay history: spin up a new consumer at offset zero and rebuild a view from scratch. The log is the backup. The views are disposable.
the log as spine — one history, many viewsone committed history · different scopesPostgres tableRaft log (cluster)Kafka topic (org)the log · append-only · totally orderedone machine's durabilitya cluster's agreementan org's event backbone
The log is the committed history. Every downstream system is a projection. A Postgres table, a Raft log, and a Kafka topic are all logs at different scopes: one machine's durability, a cluster's agreement, and an organization's event backbone.

CDC

Change Data Capture is the technique that connects the in-database log to the external log. Instead of application code writing to both Postgres and Kafka (the "dual write" bug factory, you will eventually write to one and not the other and silently diverge), CDC reads the database's own WAL and publishes every committed change to Kafka. The database is the source of truth. Kafka is the fan-out.

Postgres has logical replication (decoding the WAL into a row-level change stream). Debezium is the standard connector that turns that stream into Kafka events. The guarantee you get is meaningful: every change that committed in the database arrives in Kafka, in commit order, exactly once. No dual writes, no divergence, no "the cache is stale and I do not know why."

This is also the honest way to do event sourcing if you must. The event log is the truth. The read models are projections. The database is an event log that happens to also serve as a queryable store. The architecture is the same. The framing is different.

I keep coming back to the log because it is the one data structure that shows up unchanged across every layer of the stack. The storage post's durability is implemented by a WAL. The consensus post's agreement is implemented by a replicated log. The memory-ordering post's publish pattern is, at the hardware level, a log of coherence events. The BFT post's PBFT is a log agreed on by nodes that do not trust each other.

A log is an append-only, totally ordered sequence of records. That is the whole definition. Everything else (durability, replication, consensus, streaming, event sourcing) is what you build on top of it. The reason it keeps appearing is that a log is the simplest structure that gives you both an order and a history. Order lets multiple things agree. History lets you recover and replay.

When I look at a new system now, the first question I ask is: where is the log? If the answer is "there isn't one," I ask how it survives a crash. If the answer involves anything other than replaying a committed history, I get nervous. I am not sure this heuristic has ever been wrong.

references

  1. Ousterhout (2015). "A Technical Overview of RocksDB." GitHub wiki. The LSM tradeoffs (write/read/space amplification) are explained well here.
  2. Kreps, Narkhede, Rao (2011). "Kafka: a Distributed Messaging System for Log Processing." PDF. The original Kafka paper. Short. The log-as-backbone idea is explicit.
  3. Kreps (2014). "The Log: What every software engineer should know about real-time data's unifying abstraction." LinkedIn Engineering Blog. The post that made me start thinking this way.
  4. Debezium documentation. debezium.io. The standard CDC connector for Postgres/MySQL/etc.
  5. Ongaro. "Consensus: Bridging Theory and Practice" (Raft thesis), Chapter 5 on log compaction. raft.github.io. The log management chapter that most Raft papers skip.