Skip to main content

Filesystem Data Integrity: Detecting Silent Corruption

Summary
Understand how modern filesystems use Merkle-tree checksums and mirrored pools to detect, repair, and proactively scrub silent data corruption that ext4 and XFS miss entirely.

The silent corruption problem

Traditional filesystems like ext4 and XFS have a fundamental flaw: they trust the storage layer completely. If a disk returns corrupted bytes, the filesystem hands them straight to your application — no questions asked.

Corruption happens more often than you'd expect, and most of it never trips a SMART warning:

  • Bit rot. Cosmic rays, magnetic decay, and NAND charge leak flip individual bits on stored media over months to years.
  • Firmware bugs. RAID controllers and SSDs occasionally return wrong data because of bugs in their write coalescing or wear-leveling code paths.
  • Misdirected writes. The controller writes data to the right LBA at the wrong moment, overwriting an unrelated block. The drive reports success.
  • Phantom writes. The drive reports success but never actually persists the block — a subsequent read returns stale data from cache or the previous version on disk.
  • DMA / memory errors. Bits flip in DRAM during transfer between CPU, controller, and disk. Without ECC RAM, you have no way to detect it.

The worst part is the silence. The disk returns success, the filesystem sees no error, the operating system has no signal. Your data is corrupted and nobody knows until you read it back months later and find a photo with a colored band across it, or a database that refuses to recover.

The checksum solution

Modern filesystems (ZFS, Btrfs, APFS) solve this by computing a cryptographic hash of every block and storing the hash separately from the data. On every read, they verify the recomputed hash matches the stored one.

The first contrast to understand is what ext4 does versus what a checksum-aware filesystem does on a corrupted read:

Reading a file after silent disk corruption:
1
Lookup metadata → Block 5280, checksum: abc123
2
Fetch Block 5280 → corrupted data (bit flipped)
3
Verify: sha256(data) = xyz789 ≠ abc123
4
→ Return I/O ERROR (corruption detected!)

With checksums: The filesystem computes a hash of the fetched data and compares it to the stored checksum. Mismatch means corruption—return an error rather than bad data. With RAID, it can try a mirror copy and self-heal.

The key invariant: a checksum stored next to its data block can be corrupted by the same write that damages the data. ZFS, Btrfs, and APFS instead store each block's checksum in its parent metadata block — the block that points to it. To corrupt undetectably, an attacker (or cosmic ray) would have to coherently corrupt the data, the parent's stored hash, and the parent's parent, all the way up to the superblock.

Why the tree shape matters

The hashes chain upward into a Merkle tree. Corrupting any leaf changes its recomputed hash, which changes the recomputed hash of its parent block, which changes the recomputed root. The root is checked first, so the filesystem detects corruption anywhere in the tree from a single check at the top:

In production, the tree is much deeper — ZFS uses a 128-bit SHA-256 by default at each level, with the root checksum stored in the uberblock (and the uberblock itself protected by a SHA-256 written into a fixed offset on the pool's label). Every block in the entire filesystem is reachable from that one hash.

This is what makes the integrity guarantee end-to-end. The filesystem proves that the bytes returned to the application are the bytes that were originally written, not bytes that happened to read back from the same sectors.

Self-healing with redundancy

Detection is half the work. With a mirror or a RAID-Z group, a checksum-aware filesystem can detect corruption and also repair it by reading the good copy from a sibling disk and rewriting the bad block:

Three properties make this work that traditional RAID can't match:

  1. The filesystem owns the redundancy. Hardware RAID controllers can't repair silent corruption because they don't know whether either copy is wrong — both copies pass parity. The filesystem knows because the checksum lives one level up the tree.
  2. The repair is online. The application never sees an error code; it receives the verified bytes from the good disk. The bad block is rewritten in the background.
  3. The repair is logged. Every successful self-heal is recorded. Operators can see the corruption rate trending over time and decide when to retire a drive before it fails outright.

Scrubbing: proactive detection

Corruption that's never read stays hidden. Scrubbing is the maintenance operation that reads every block in the pool, verifies its checksum, and repairs whatever it can — so that bit rot doesn't accumulate silently to the point where corruption on the primary drive coincides with corruption on the mirror.

zfs scrub tank # 9.81T scanned at 1.2G/s, 9.78T issued, 12.4T total # 2 errors, 2 repaired in 02:17:34

Cadence depends on the data:

  • Critical or archival data: weekly. The window where both copies could rot independently has to be smaller than the rate at which you discover and replace a failing drive.
  • General-purpose pools: monthly. The default on most ZFS-on-Linux installs.
  • Mostly-read workloads: match scrub cadence to the typical re-read interval — there is no point scrubbing a block that the workload already reads daily.

The goal is to read every block more often than the median time between independent corruption events on two different disks. For consumer SATA drives that's roughly monthly. For enterprise SSDs the safe interval can stretch to quarterly.

Filesystem comparison

FilesystemData checksumsSelf-healingScrubbingNotes
ext4No (metadata only)NoNoOptional metadata_csum protects the journal and inode tables but not file data.
XFSNo (metadata only)NoNoCRC32C on metadata since 2014. Data integrity requires dm-integrity below.
NTFSNoNoNoReFS adds data integrity streams on Windows Server, NTFS itself does not.
BtrfsYes (CRC32C, optional xxhash/SHA-256/Blake2)With raid1/raid10btrfs scrubParity RAID still discouraged for production at the time of writing.
ZFSYes (SHA-256 default, configurable)With mirror or RAID-Zzpool scrubEnd-to-end integrity from application down to the device label.
APFSYes (metadata + data on container)No (single-disk filesystem)No exposed commandPairs with Time Machine and Fusion Drive to handle redundancy.

The cost of integrity

Checksums aren't free, but the overhead is small enough that on modern hardware you measure it in noise:

  • CPU. 1–5 % for SHA-256 on a busy server; closer to noise with hardware-accelerated CRC32C or xxhash. On any CPU with SHA-NI or ARMv8 crypto extensions the cost disappears into a single cycle per byte.
  • Space. ~0.1–0.5 % for checksum storage, depending on block size and which algorithm is configured.
  • Latency. Hidden by the disk I/O time. Verification happens in parallel with the next block fetch.

For most workloads you won't notice. The narrow exceptions are high-frequency trading and write-amplification-sensitive databases, where you can leave metadata checksummed and disable data checksums on specific datasets.

Why dm-integrity isn't quite enough

On ext4 or XFS you can stack dm-integrity underneath to get block-level checksums without changing filesystems. It works, but it isn't equivalent:

  • dm-integrity doesn't know which copy of a mirror is good. It will report an error on a corrupted read, but it can't pull a clean copy from a sibling device because mirroring lives in mdraid above it. You still lose the read.
  • Each layer protects its own bytes, not the data path. Memory errors during DMA between layers go undetected.
  • There's no scrub of the upper layer's data, only of the integrity tag the lower layer wrote.

dm-integrity is a real improvement over bare ext4, but it doesn't reach the end-to-end guarantee a checksum-aware filesystem gives.

When integrity matters most

The critical use cases are the ones where you cannot recreate the bytes:

  • Long-term archival storage — family photos, scanned documents, backups that won't be read for years.
  • Databases and financial records — silently flipped bits in a ledger or a column store are catastrophic.
  • Scientific data — instrument readings, simulation outputs, anything where reproducibility costs serious compute.
  • Source-of-truth caches — once a corrupted block is read into the cache, the corruption can propagate to every consumer until eviction.

If you're running ext4 or XFS without dm-integrity or filesystem-level checksums, you are trusting cosmic rays, firmware bugs, and disk aging never to corrupt your data. On a long enough timeline, they will.

Practical recommendations

  1. New storage systems should be ZFS or Btrfs (mirror only, not parity-raid) with weekly scrubs. The marginal operational cost is small.
  2. Existing ext4 or XFS that holds anything you can't recreate should be sandwiched with dm-integrity plus mdraid mirroring. The result is weaker than ZFS but vastly stronger than the default.
  3. All servers should use ECC RAM. Without it, a memory error during DMA can corrupt every read your filesystem checksum would otherwise have protected.
  4. Scrub before you back up. A backup of a corrupted block silently propagates the corruption into your archival copies.
  5. Track scrub error counts. A drive that starts repairing more blocks per scrub than its baseline is a leading indicator of impending failure — replace it before it fails outright.

If you found this explanation helpful, consider sharing it with others.

Mastodon