Filesystem Data Integrity: Detecting Silent Corruption

Understand how modern filesystems use checksums to detect silent data corruption that traditional filesystems miss entirely.

January 6, 20256 min|filesystems data-integrity reliability

|

Best viewed on desktop for optimal interactive experience

The Silent Corruption Problem

Traditional filesystems like ext4 and XFS have a fundamental flaw: they trust the storage layer completely. If a disk returns corrupted data, the filesystem serves it to your application—no questions asked.

This corruption happens more often than you'd expect:

Bit rot: Cosmic rays and magnetic decay flip bits over time
Firmware bugs: RAID controllers and SSDs sometimes return wrong data
Misdirected writes: Data written to the wrong block location
Memory errors: Corruption during DMA transfers (without ECC RAM)

The worst part? These are silent failures. The disk reports success; the filesystem sees no error. Your data is corrupted, but nobody knows.

The Checksum Solution

Modern filesystems (ZFS, Btrfs, APFS) solve this by computing a cryptographic hash of every block and storing it separately from the data. On every read, they verify the hash matches.

Toggle below to see the difference:

Reading a file after silent disk corruption:

1

Lookup metadata → Block 5280, checksum: abc123

2

Fetch Block 5280 → corrupted data (bit flipped)

3

Verify: sha256(data) = xyz789 ≠ abc123

4

→ Return I/O ERROR (corruption detected!)

With checksums: The filesystem computes a hash of the fetched data and compares it to the stored checksum. Mismatch means corruption—return an error rather than bad data. With RAID, it can try a mirror copy and self-heal.

The key insight: store the checksum in the parent metadata, not alongside the data. If corruption affects a block, it can't also corrupt the checksum that would detect it.

Self-Healing with Redundancy

Detection is only half the solution. With RAID or mirroring, checksum filesystems can actually repair corruption:

Read block from Disk 1 → checksum mismatch (corrupted)
Read same block from Disk 2 → checksum matches (good copy)
Return good data to application
Overwrite corrupted block on Disk 1 with good data
Log: "1 block repaired"

This happens transparently—your application never sees an error because the filesystem healed itself.

Scrubbing: Proactive Detection

Corruption that isn't read stays hidden. Scrubbing reads every block to find problems before you need the data:

Scrub: Read all 819,200 blocks → Verify checksums → Repair if possible
Result: Found 2 corruptions, repaired both from mirror

Run scrubs monthly for normal data, weekly for critical data. Find bit rot before it spreads to your only good copy.

Filesystem Comparison

Filesystem	Data Checksums	Self-Healing	Scrubbing
ext4	No	No	No
XFS	No	No	No
NTFS	No	No	No
Btrfs	Yes	With RAID	Yes
ZFS	Yes	With RAID	Yes
APFS	Yes	With RAID	Yes

The Cost

Checksums aren't free, but the overhead is minimal:

CPU: 1-5% for verification (negligible with modern CPUs)
Space: ~0.1-0.5% for checksum storage
Latency: Hidden by disk I/O time

For most workloads, you won't notice. For the rare cases where it matters (high-frequency trading), you can disable data checksums while keeping metadata protected.

When Integrity Matters Most

Critical use cases:

Long-term archival storage (photos, documents, backups)
Databases and financial records
Scientific data and research
Any data you can't recreate

The uncomfortable truth: If you're using ext4 or XFS without additional protection, you're trusting that cosmic rays, firmware bugs, and disk aging will never corrupt your data. On a long enough timeline, they will.

Practical Recommendations

New storage systems: Use ZFS or Btrfs with mirroring for automatic detection and repair
Existing ext4/XFS: Consider dm-integrity for block-level checksums
Critical data: Use RAID + checksums + regular scrubs
Backups: Scrub before backup to ensure you're not backing up corrupted data
Servers: Use ECC RAM to protect data in memory during transfer

Copy-on-Write - Enables atomic checksum updates
ZFS - End-to-end checksums with self-healing
Btrfs - Linux-native checksums and scrubbing
Snapshots - Point-in-time recovery

← Back to Filesystems Overview