your storage is lying to you

Disks flip bits and do not tell anyone. Controllers report writes that never reached the platter. RAID arrays rebuild from parity that was never consistent to begin with. The scary part is not the failure. It is the silence. Your data is wrong, and every layer between you and the rust swears it is fine.

I started caring about this when I was 17 and found a SQL injection in rice.edu. That was a different kind of corruption, but it got me reading about trust boundaries, and storage is the deepest one. A storage stack is a pile of abstractions, each of which is willing to lie to the one above it in order to feel fast. This post is me walking that stack and naming the lies.

the three liars

When you call write() and it returns, almost nothing has happened. What actually stands between your bytes and the magnetic domains (or floating gates) that will outlive your process:

Each layer can report success before the layer below has done its job. Durability is the property of the bottom layer. Every layer above is a cache that is willing to pretend otherwise.

The three places that habitually lie:

The OS page cache. Your write() often just copies bytes into RAM and returns. The kernel will flush it later, maybe. A power loss between "write returned" and "kernel flushed" loses the data.
The device write cache. Even after the kernel flushes, the drive frequently accepts the bytes into its own volatile RAM and reports done. A power loss here loses data the OS already thought was safe.
The media itself. A sector that reads back wrong without an error code. This is the worst one, and it gets its own section.

bit rot

A platter degrades. A flash cell loses charge. A cosmic ray hits a cell and flips it. The drive's internal ECC corrects small errors transparently (you never see it) until one day the error exceeds the code's correction capacity. Two things can happen, and only one of them is honest:

The drive gives up and returns an unrecoverable read error (URE). Painful, but honest. You know the block is gone.
The drive's ECC silently miscorrects and returns the wrong bytes with no error flag at all. This is silent data corruption. You will not find out until something downstream explodes.

Consumer SATA drives quote a URE rate around 1 in 10^14 bits read, roughly one unrecoverable read per every 12 TB of reads. Enterprise drives quote 10^15 or better. During a RAID rebuild you read the entire remaining array, and at 10^14 a rebuild of a 12 TB array has a coin-flip chance of hitting a URE. This is why people who care do not run RAID5 at any meaningful scale.

cosmic rays and the case for ECC

DRAM is not durable either. A neutron from a cosmic ray shower can strike a memory cell and flip a bit. The often-quoted soft error rate for non-ECC DRAM at sea level is on the order of one bit flip per 256 MB per month. This varies hugely with altitude (a data center in Denver sees several times the rate of one in Mumbai) and cell geometry. Most flips hit unused memory and never matter. A few hit a pointer. A handful hit a parity bit and corrupt a transfer you will never trace back.

The fix is ECC memory: a Hamming-style code that stores a few parity bits per 64-bit word and does SEC-DED, single-error correction, double-error detection. One flipped bit gets corrected on the fly. Two flipped bits get detected and raise an error instead of silently returning wrong data. Three or more, you lose, but three independent flips in one word is vanishingly rare.

The fact that consumer platforms spent years shipping without ECC, and marketing non-ECC as a "gaming" feature, is one of the quieter scandals in the industry. If you run anything that holds state that matters, you want ECC. The cost is trivial. The benefit is that an entire class of silent corruption becomes a correctable, logged event. Linus Torvalds went on a rant about this on LKML in 2012 and he was right.

the RAID write hole

RAID4 and RAID5 compute a parity block P across a stripe of data blocks. To update one data block D₂ without rewriting the whole stripe, you do the read-modify-write dance: readD₂, read P, compute P' = P ⊕ D₂ ⊕ D₂', write D₂' and P'. Two writes.

If power dies between those two writes, the stripe is left inconsistent: P no longer matches the data. The array will keep serving reads happily. Parity is not checked on read, only on rebuild. So you sail along with a quietly broken stripe until a disk fails and you rebuild, at which point the rebuild "corrects" good data using bad parity, and you have corrupted a block that was never wrong.

A stripe with data D₁ D₂ D₃ and parity P = D₁ ⊕ D₂ ⊕ D₃. Updating D₂ requires writing both D₂' and P'. A crash between the two leaves parity stale. Undetected until a rebuild uses it to "fix" healthy blocks.

ZFS and Btrfs close this by never doing partial parity updates the way classic RAID does. They write whole new copies of a stripe (copy-on-write) and only flip the pointer once everything is consistent. The old stripe stays intact until the new one is committed.

the fsync contract is not airtight

fsync(fd) is supposed to be the moment you can trust the bytes are on stable storage. The POSIX text says it forces the file's data and metadata to be flushed to the device, and does not return until the device says it is done. The catch is the last clause: until the device says it is done. If the device is lying about its write cache, fsync is a polite request, not a guarantee.

This has bitten real systems in production. The canonical example is PostgreSQL on EXT4 in 2009. EXT4, with its default data=ordered mode, would delay allocating the data blocks of a newly created file. On a crash it could leave the file full of zeros, including the renamed WAL file Postgres was relying on for durability. Postgres lost data that, by its own accounting, had already been fsync'd. The fix was a combination of filesystem behavior changes and database-side workarounds. The broader lesson is that fsync alone is not a contract you can trust without knowing the storage stack underneath.

How to actually get durability:

FUA (Force Unit Access). A write that bypasses the device's volatile cache and goes straight to media. The honest version of a write.
CACHE FLUSH. A command that tells the drive to drain its volatile cache to media and only acknowledge once done.fsync should issue this. Whether it actually does, and whether the drive actually honors it, is the question.
Disable the drive write cache entirely. Slow, honest, and the only thing some drives actually respect.

The cost is real. Flushing serializes writes and kills throughput. That tension, durability wants synchronous and throughput wants batched, is the reason journals, group commit, and separate journal devices exist. You batch the slow operation so you can afford to do it honestly.

how honest systems work

The systems that actually stay honest, ZFS, Btrfs, Ceph and GFS's checksumming, share one idea: checksums end-to-end, organized as a Merkle tree.

Every block of data gets a cryptographic hash. Every group of blocks gets a hash of its children's hashes. Recurse to the root. The root hash commits to the entire tree. Change one byte anywhere and the root changes.

Corrupting a single leaf changes its hash, which changes its parent, which changes the root. A read verifies the block against the root by walking one path. If the stored root is trusted, a single bad block cannot hide.

On every read, the filesystem recomputes the block's hash and walks it up to the root. If it does not match, the block is corrupt. If there is redundancy (a mirror, or a parity copy that is itself checksummed), the system fetches a good copy, hands it to you, and heals the bad one in the background. This is self-healing, and it is the difference between a system that detects corruption and one that corrects it.

There is an even more aggressive version: T10 DIF/DIX, which appends an 8-byte integrity field to every sector on the wire (a guard tag, an application tag, and a reference tag) so that a block arriving at the media is already self-describing and gets verified at every hop, not just by the filesystem on top. The principle is the same: verify at the boundary, do not trust the layer below.

the pattern I keep seeing

I did not plan for this post to connect to the others, but it does. Checksums live at the top because the filesystem knows the device lies. End-to-end integrity beats hop-by-hop because each hop can lie, and only the endpoints know the full contract. A database does not trust the OS page cache. The OS does not trust the drive cache. TLS verifies at the application boundary, not the link layer. Every system that is hard to corrupt does the same thing: it assumes the layer below is dishonest and checks at its own boundary with something the layer below cannot fake.

I think this is the only design pattern that actually works for integrity. Every alternative I have seen is a slower version of the same idea or a faster version that does not work.

thanks: to people who read drafts of this and pointed out I was conflating T10 DIF and DIX, which are related but not the same thing. DIF puts the integrity field in the sector on the wire. DIX separates it into a separate memory buffer that the HBA verifies. The distinction matters if you are implementing it. I have collapsed them here for readability.

references

Bonwick, J. et al. "ZFS: The Last Word in File Systems" (2007). PDF. Still the best explanation of end-to-end checksumming in a real filesystem.
PostgreSQL wiki. "Corrupt data after power failure on ext4." wiki.postgresql.org. The 2009 incident, documented by the people it happened to.
Torvalds, L. "ECC memory and Intel" (LKML thread, Jan 2012). yarchive.net archive. The rant about consumer platforms dropping ECC.
SNIA. "T10 Protection Information (DIF/DIX)" technical note. snia.org.
Prabhakaran, V. et al. "IRON File Systems" (SIGOS 2005). PDF. The paper that systematically studied what filesystems do when the disk lies. Worth reading if this post was interesting.