Skip to main content

Filesystem Data Integrity: Detecting Silent Corruption

Understand how modern filesystems use checksums to detect silent data corruption that traditional filesystems miss entirely.

The Silent Corruption Problem

Traditional filesystems like ext4 and XFS have a fundamental flaw: they trust the storage layer completely. If a disk returns corrupted data, the filesystem serves it to your application—no questions asked.

This corruption happens more often than you'd expect:

  • Bit rot: Cosmic rays and magnetic decay flip bits over time
  • Firmware bugs: RAID controllers and SSDs sometimes return wrong data
  • Misdirected writes: Data written to the wrong block location
  • Memory errors: Corruption during DMA transfers (without ECC RAM)

The worst part? These are silent failures. The disk reports success; the filesystem sees no error. Your data is corrupted, but nobody knows.

The Checksum Solution

Modern filesystems (ZFS, Btrfs, APFS) solve this by computing a cryptographic hash of every block and storing it separately from the data. On every read, they verify the hash matches.

Toggle below to see the difference:

Reading a file after silent disk corruption:
1
Lookup metadata → Block 5280, checksum: abc123
2
Fetch Block 5280 → corrupted data (bit flipped)
3
Verify: sha256(data) = xyz789 ≠ abc123
4
→ Return I/O ERROR (corruption detected!)

With checksums: The filesystem computes a hash of the fetched data and compares it to the stored checksum. Mismatch means corruption—return an error rather than bad data. With RAID, it can try a mirror copy and self-heal.

The key insight: store the checksum in the parent metadata, not alongside the data. If corruption affects a block, it can't also corrupt the checksum that would detect it.

Self-Healing with Redundancy

Detection is only half the solution. With RAID or mirroring, checksum filesystems can actually repair corruption:

  1. Read block from Disk 1 → checksum mismatch (corrupted)
  2. Read same block from Disk 2 → checksum matches (good copy)
  3. Return good data to application
  4. Overwrite corrupted block on Disk 1 with good data
  5. Log: "1 block repaired"

This happens transparently—your application never sees an error because the filesystem healed itself.

Scrubbing: Proactive Detection

Corruption that isn't read stays hidden. Scrubbing reads every block to find problems before you need the data:

Scrub: Read all 819,200 blocks → Verify checksums → Repair if possible Result: Found 2 corruptions, repaired both from mirror

Run scrubs monthly for normal data, weekly for critical data. Find bit rot before it spreads to your only good copy.

Filesystem Comparison

FilesystemData ChecksumsSelf-HealingScrubbing
ext4NoNoNo
XFSNoNoNo
NTFSNoNoNo
BtrfsYesWith RAIDYes
ZFSYesWith RAIDYes
APFSYesWith RAIDYes

The Cost

Checksums aren't free, but the overhead is minimal:

  • CPU: 1-5% for verification (negligible with modern CPUs)
  • Space: ~0.1-0.5% for checksum storage
  • Latency: Hidden by disk I/O time

For most workloads, you won't notice. For the rare cases where it matters (high-frequency trading), you can disable data checksums while keeping metadata protected.

When Integrity Matters Most

Critical use cases:

  • Long-term archival storage (photos, documents, backups)
  • Databases and financial records
  • Scientific data and research
  • Any data you can't recreate

The uncomfortable truth: If you're using ext4 or XFS without additional protection, you're trusting that cosmic rays, firmware bugs, and disk aging will never corrupt your data. On a long enough timeline, they will.

Practical Recommendations

  1. New storage systems: Use ZFS or Btrfs with mirroring for automatic detection and repair
  2. Existing ext4/XFS: Consider dm-integrity for block-level checksums
  3. Critical data: Use RAID + checksums + regular scrubs
  4. Backups: Scrub before backup to ensure you're not backing up corrupted data
  5. Servers: Use ECC RAM to protect data in memory during transfer
  • Copy-on-Write - Enables atomic checksum updates
  • ZFS - End-to-end checksums with self-healing
  • Btrfs - Linux-native checksums and scrubbing
  • Snapshots - Point-in-time recovery

← Back to Filesystems Overview

If you found this explanation helpful, consider sharing it with others.

Mastodon