The Silent Data Corruption Problem
Traditional filesystems (ext4, XFS, FAT) have a critical flaw: they trust the storage layer. If a disk returns corrupted data, the filesystem serves it—no questions asked.
Sources of corruption:
- Bit rot: Cosmic rays, magnetic decay, aging
- Buggy firmware: RAID controller errors, SSD bugs
- Silent failures: Disk returns wrong data (no error reported)
- Memory errors: Corrupted during transfer (no ECC RAM)
- Misdirected writes: Wrong block written (cache/firmware bugs)
The problem: Traditional filesystems detect corruption only during reads—and often not even then.
Modern Integrity Solutions
Checksum-based filesystems (ZFS, Btrfs, ReFS) solve this with:
- End-to-End Checksums: Verify data from disk to application
- Self-Healing: Automatic corruption repair (with redundancy)
- Scrubbing: Proactive corruption detection
- Metadata Protection: Checksums for all metadata too
How Data Integrity Works: Interactive Exploration
See checksum verification, corruption detection, and self-healing in action:
Interactive Data Integrity Demo
Checksum Detection: Finding Silent Corruption
Step 1: Initial Write (Checksum Computed)
What's happening:
- →Application writes PDF data (128KB)
- →Filesystem computes checksum: sha256(data) = abc123def456
- →Write data to Block 5280
- →Store checksum in PARENT metadata (not with data)
- →Separation ensures corruption can't hide
Checksum Mechanisms
ZFS Checksums
Every block checksummed:
Data Block: [file data 128KB] Checksum: sha256(data) Location: Stored in parent metadata (NOT with data)
Why parent storage?
- Corruption can't affect its own checksum
- Read path: Fetch metadata (checksum) → Fetch data → Verify
- Mismatch = Corruption detected
Checksum algorithms:
- fletcher2: Fast, weak (legacy)
- fletcher4: Fast, good (default)
- sha256: Strong, slower (critical data)
- sha512: Strongest, slowest
Configure per dataset:
zfs set checksum=sha256 tank/important zfs set checksum=fletcher4 tank/bulk # Default
Btrfs Checksums
Data and metadata checksummed:
Checksum: crc32c (default) Alternatives: xxhash, sha256, blake2 Location: Stored in parent tree node
Configurable:
# Set checksum algorithm at mkfs mkfs.btrfs --checksum xxhash /dev/sda1 # Or per-file (via properties) btrfs property set /path/to/file checksum sha256
Nodatasum option:
# Disable checksums for specific files (faster, no protection) chattr +C /var/lib/mysql/data # Also disables CoW
ext4 Metadata Checksums
ext4 has limited checksums (metadata only):
# Enable metadata checksums at format mkfs.ext4 -O metadata_csum /dev/sda1 # Or convert existing filesystem tune2fs -O metadata_csum /dev/sda1 # Requires e2fsck
What's protected:
- Superblock
- Group descriptors
- Inode tables
- Directory entries
- Journal
What's NOT protected:
- File data (no data checksums!)
- Can't detect silent data corruption
Corruption Detection Flow
Read Path with Checksums
Traditional filesystem (ext4):
1. Application: read(file, offset) 2. Filesystem: Lookup block number 3. Disk: Return block data 4. Filesystem: Return to application ❌ No verification - corrupt data silently served
Checksum filesystem (ZFS/Btrfs):
1. Application: read(file, offset) 2. Filesystem: Lookup block + checksum 3. Disk: Return block data 4. Filesystem: Compute checksum of data 5. Compare: Computed vs Stored checksum ✅ Match → Return data ❌ Mismatch → Corruption detected! 6. If redundant copy exists: - Try mirror/parity copy - Verify checksum - Return good copy - Repair corrupted copy
Write Path with Checksums
ZFS/Btrfs write:
1. Application: write(data) 2. Filesystem: Compute checksum(data) 3. Write data to new location (CoW) 4. Update parent metadata with: - Pointer to new data block - Checksum value 5. Commit transaction (atomic)
Integrity guarantee:
- Checksum stored BEFORE data is referenced
- Corruption during write detected on next read
- Old data preserved (CoW) until commit
Self-Healing
Requirements for Self-Healing
Need redundancy:
- RAID-1/10: Mirror copies
- RAID-5/6: Parity reconstruction
- ZFS RAID-Z: Parity with checksums
- Btrfs RAID: Mirror or RAID-5/6
Self-healing flow:
1. Read block from disk 1 2. Checksum mismatch → Corruption! 3. Try mirror copy (disk 2) 4. Checksum matches → Good copy found 5. Repair corrupted block: - Write good data to disk 1 - Verify checksum 6. Log repair: "Corrected 1 block"
ZFS Self-Healing
Automatic on every read:
# Read file - automatic healing if corrupted cat /tank/data/file.txt # ZFS detects corruption, repairs from mirror/parity # Check healing stats zpool status -v tank # Shows: "X blocks repaired"
Scrub for proactive healing:
# Read and verify EVERYTHING zpool scrub tank # Monitor progress zpool status tank # Shows: scan: scrub in progress, 45% done
Btrfs Self-Healing
Automatic on read (with RAID):
# File read with corruption cat /mnt/btrfs/file # Btrfs: detects corruption, repairs from mirror # Check errors btrfs device stats /mnt # Shows corruption/repair counts per device
Scrub for proactive healing:
# Scrub all data and metadata btrfs scrub start /mnt # Monitor progress btrfs scrub status /mnt # Shows: "X errors found, Y corrected"
Scrubbing: Proactive Verification
What Is Scrubbing?
Scrub = Read every block, verify checksums, repair corruption
Purpose:
- Find corruption before you need the data
- Detect bit rot early (before spreading)
- Verify RAID parity consistency
- Background integrity maintenance
ZFS Scrubbing
Manual scrub:
# Start scrub zpool scrub tank # Check status zpool status tank # Output: # scan: scrub in progress since Sun Jan 10 12:00:00 2025 # 45.2G scanned at 1.5G/s, 12.1G to go # 0 repaired, 78.9% done # Stop scrub (if needed) zpool scrub -s tank
Automatic scrubbing (recommended):
# Weekly scrub via systemd timer systemctl enable zfs-scrub@tank.timer systemctl start zfs-scrub@tank.timer # Or via cron 0 2 * * 0 zpool scrub tank # Every Sunday 2 AM
Scrub results:
zpool status -v tank # Shows: # errors: No known data errors ✅ # OR # errors: Permanent errors in: # /tank/important/file.txt ❌ # (1 corrupted block, no redundancy to repair)
Btrfs Scrubbing
Manual scrub:
# Start scrub btrfs scrub start /mnt # Monitor btrfs scrub status /mnt # Output: # Scrub started: Sun Jan 10 12:00:00 2025 # Status: running # Total to scan: 100GB # ... # Wait for completion btrfs scrub status -d /mnt # Detailed stats
Automatic scrubbing:
# Monthly scrub via systemd systemctl enable btrfs-scrub@mnt.timer # Or via cron 0 3 1 * * btrfs scrub start /mnt # Monthly
Scrub results:
btrfs scrub status -d /mnt # Shows per-device: # Data extents scrubbed: 12345 # Checksum errors: 10 # Corrected errors: 10 ✅ # Uncorrectable errors: 0
Corruption Types and Detection
Detectable Corruption
With checksums (ZFS/Btrfs):
- ✅ Bit flips in data
- ✅ Bit flips in metadata
- ✅ Misdirected writes (block written to wrong location)
- ✅ Torn writes (partial block write)
- ✅ Firmware bugs returning wrong data
- ✅ Memory corruption during transfer
Undetectable Corruption
Even with checksums:
- ❌ Corruption during write (before checksum computed)
- Mitigation: ECC RAM
- ❌ Application writes wrong data
- Mitigation: Application-level checksums
- ❌ Encryption key corruption
- Mitigation: Key backup and verification
Unrecoverable Corruption
Corruption detected but can't repair:
1. Read block: Checksum mismatch 2. Try redundant copy: Also corrupted (or doesn't exist) 3. Try parity reconstruction: Parity also corrupted 4. Result: Permanent data loss
ZFS response:
zpool status -v tank # errors: Permanent errors have been detected in the following files: # /tank/data/important.txt
Btrfs response:
# Read returns: I/O error # dmesg shows: "checksum error, no good copy found"
Checksum Overhead
Performance Impact
Read path:
- Checksum verification: ~1-5% CPU overhead
- Modern CPUs (with AES-NI, SSE4.2): Negligible
- Usually bottlenecked by disk, not checksum
Write path:
- Checksum computation: ~2-10% CPU overhead
- Depends on algorithm (fletcher4 < sha256)
- Often hidden by disk latency
Benchmarks:
No checksum (ext4): 1000 MB/s read ZFS (fletcher4): 980 MB/s read (-2%) ZFS (sha256): 920 MB/s read (-8%) Btrfs (crc32c): 990 MB/s read (-1%) Bottleneck: Usually disk speed, not checksum
Space Overhead
Checksum storage:
- ZFS: 1/1024 blocks (~0.1% overhead)
- Btrfs: Stored in metadata (~0.5% overhead)
- ext4 metadata_csum: less than 1% for metadata only
Negligible space cost for significant protection
Comparison: Integrity Features
| Filesystem | Data Checksums | Metadata Checksums | Self-Healing | Scrubbing |
|---|---|---|---|---|
| ext4 | ❌ No | ✅ Optional | ❌ No | ❌ No |
| XFS | ❌ No | ✅ Yes | ❌ No | ❌ No |
| Btrfs | ✅ Yes | ✅ Yes | ✅ With RAID | ✅ Yes |
| ZFS | ✅ Yes | ✅ Yes | ✅ With RAID | ✅ Yes |
| NTFS | ❌ No | ❌ No | ❌ No | ❌ No |
| APFS | ✅ Yes | ✅ Yes | ✅ With RAID | ✅ Yes |
Integrity leaders: ZFS, Btrfs, APFS
Best Practices
1. Use Checksums
For critical data:
# ZFS: Use strong checksums zfs set checksum=sha256 tank/important # Btrfs: Enable checksums (default) mkfs.btrfs /dev/sda1 # crc32c enabled
2. Enable Redundancy
Checksums detect corruption, redundancy repairs it:
# ZFS: RAID-Z or mirror zpool create tank raidz /dev/sda /dev/sdb /dev/sdc # Btrfs: RAID1 or RAID10 mkfs.btrfs -d raid1 -m raid1 /dev/sda /dev/sdb
3. Regular Scrubbing
Monthly for normal use, weekly for critical:
# ZFS monthly scrub systemctl enable zfs-scrub@tank.timer # Btrfs weekly scrub 0 3 * * 0 btrfs scrub start /mnt
4. Monitor Errors
Check for corruption regularly:
# ZFS zpool status -v tank | grep -i error # Btrfs btrfs device stats /mnt
5. Use ECC RAM
Protect in-memory data:
- Checksums protect on-disk data
- ECC RAM protects in-memory data
- Recommended for ZFS/Btrfs servers
6. Test Restores
Verify backups can detect corruption:
# Scrub before backup zpool scrub tank # Wait for completion # Then backup (ensures no silent corruption)
Limitations
What Checksums Don't Protect
-
Application-level corruption: App writes wrong data
- Solution: Application checksums (e.g., database checksums)
-
Corruption during write: Data corrupted before checksum computed
- Solution: ECC RAM
-
No redundancy: Can detect but not repair
- Solution: RAID or replication
-
Complete disk failure: All copies lost
- Solution: Offsite backups
Performance Considerations
When to disable checksums:
- Never for metadata (always checksum metadata)
- Rarely for data (only if proven bottleneck)
- Databases: May have own checksums (DM-Integrity or database-level)
Disable data checksums (Btrfs):
# Per-file (also disables CoW) chattr +C /var/lib/mysql/data # Or mount option (entire filesystem) mount -o nodatasum /dev/sda1 /mnt
Advanced: DM-Integrity (ext4/XFS)
Device-mapper integrity for non-checksum filesystems:
# Create integrity device integritysetup format /dev/sda1 integritysetup open /dev/sda1 integrity-dev # Format with ext4 mkfs.ext4 /dev/mapper/integrity-dev # Mount mount /dev/mapper/integrity-dev /mnt
Provides:
- Block-level checksums (below filesystem)
- Works with any filesystem
- Performance: 10-30% overhead
- See:
man integritysetup
Related Concepts
- Copy-on-Write: Enables atomic checksum updates
- Snapshots: Immutable copies for data protection
- ZFS: End-to-end checksums and self-healing
- Btrfs: Checksums and scrubbing
- RAID: Redundancy for self-healing
Key Takeaways
- Silent Corruption: Traditional filesystems serve corrupted data unknowingly
- Checksums: Detect corruption at read time (ZFS, Btrfs, APFS)
- Self-Healing: Automatic repair with redundancy (RAID)
- Scrubbing: Proactive verification (find corruption early)
- Overhead: Minimal (~1-5% CPU, less than 1% space)
- Best Practice: Checksums + Redundancy + Scrubbing + ECC RAM
- Limitations: Can't fix corruption without redundancy
- ext4/XFS: Use DM-Integrity for block-level checksums
