Filesystem Data Integrity: Detecting Silent Corruption

Understand how modern filesystems use checksums to detect silent data corruption that traditional filesystems miss entirely.

Best viewed on desktop for optimal interactive experience

The Silent Corruption Problem

Traditional filesystems like ext4 and XFS have a fundamental flaw: they trust the storage layer completely. If a disk returns corrupted data, the filesystem serves it to your application—no questions asked.

This corruption happens more often than you'd expect:

  • Bit rot: Cosmic rays and magnetic decay flip bits over time
  • Firmware bugs: RAID controllers and SSDs sometimes return wrong data
  • Misdirected writes: Data written to the wrong block location
  • Memory errors: Corruption during DMA transfers (without ECC RAM)

The worst part? These are silent failures. The disk reports success; the filesystem sees no error. Your data is corrupted, but nobody knows.

The Checksum Solution

Modern filesystems (ZFS, Btrfs, APFS) solve this by computing a cryptographic hash of every block and storing it separately from the data. On every read, they verify the hash matches.

Toggle below to see the difference:

Reading a file after silent disk corruption:
1
Lookup metadata → Block 5280, checksum: abc123
2
Fetch Block 5280 → corrupted data (bit flipped)
3
Verify: sha256(data) = xyz789 ≠ abc123
4
→ Return I/O ERROR (corruption detected!)

With checksums: The filesystem computes a hash of the fetched data and compares it to the stored checksum. Mismatch means corruption—return an error rather than bad data. With RAID, it can try a mirror copy and self-heal.

The key insight: store the checksum in the parent metadata, not alongside the data. If corruption affects a block, it can't also corrupt the checksum that would detect it.

Self-Healing with Redundancy

Detection is only half the solution. With RAID or mirroring, checksum filesystems can actually repair corruption:

  1. Read block from Disk 1 → checksum mismatch (corrupted)
  2. Read same block from Disk 2 → checksum matches (good copy)
  3. Return good data to application
  4. Overwrite corrupted block on Disk 1 with good data
  5. Log: "1 block repaired"

This happens transparently—your application never sees an error because the filesystem healed itself.

Scrubbing: Proactive Detection

Corruption that isn't read stays hidden. Scrubbing reads every block to find problems before you need the data:

Scrub: Read all 819,200 blocks → Verify checksums → Repair if possible Result: Found 2 corruptions, repaired both from mirror

Run scrubs monthly for normal data, weekly for critical data. Find bit rot before it spreads to your only good copy.

Filesystem Comparison

FilesystemData ChecksumsSelf-HealingScrubbing
ext4NoNoNo
XFSNoNoNo
NTFSNoNoNo
BtrfsYesWith RAIDYes
ZFSYesWith RAIDYes
APFSYesWith RAIDYes

The Cost

Checksums aren't free, but the overhead is minimal:

  • CPU: 1-5% for verification (negligible with modern CPUs)
  • Space: ~0.1-0.5% for checksum storage
  • Latency: Hidden by disk I/O time

For most workloads, you won't notice. For the rare cases where it matters (high-frequency trading), you can disable data checksums while keeping metadata protected.

When Integrity Matters Most

Critical use cases:

  • Long-term archival storage (photos, documents, backups)
  • Databases and financial records
  • Scientific data and research
  • Any data you can't recreate

The uncomfortable truth: If you're using ext4 or XFS without additional protection, you're trusting that cosmic rays, firmware bugs, and disk aging will never corrupt your data. On a long enough timeline, they will.

Practical Recommendations

  1. New storage systems: Use ZFS or Btrfs with mirroring for automatic detection and repair
  2. Existing ext4/XFS: Consider dm-integrity for block-level checksums
  3. Critical data: Use RAID + checksums + regular scrubs
  4. Backups: Scrub before backup to ensure you're not backing up corrupted data
  5. Servers: Use ECC RAM to protect data in memory during transfer
  • Copy-on-Write - Enables atomic checksum updates
  • ZFS - End-to-end checksums with self-healing
  • Btrfs - Linux-native checksums and scrubbing
  • Snapshots - Point-in-time recovery
← Back to Filesystems Overview

If you found this explanation helpful, consider sharing it with others.

Mastodon