ZFS: The Ultimate Filesystem

Master ZFS filesystem with pooled storage, RAID-Z, snapshots, and checksums. Learn enterprise-grade data integrity on Linux.

Best viewed on desktop for optimal interactive experience

Why Your Data Needs ZFS

Every storage system lies to you. Disks report successful writes that silently corrupt. RAID controllers introduce errors. Memory glitches flip bits. Over time, your data degrades without any error messages—this is called bit rot.

Traditional filesystems trust hardware implicitly. When your disk says "write complete," ext4 believes it. When your RAID controller says "all mirrors healthy," XFS trusts it. But hardware fails in subtle ways that these filesystems can't detect.

ZFS trusts nothing. Every block is checksummed. Every checksum is stored in the parent block. Corruption anywhere in the chain is detected and—with redundancy—automatically repaired.

ZFS: The Paranoid Librarian

Think of ZFS like a librarian who trusts no one—not the shelves, not the book bindings, not even their own memory.

1.Checksums on every book — Writes a verification code on every spine, checks it before every read
2.Never erases originals — Writes amendments on new pages, keeps old ones safe until complete
3.Instant photographs — Takes snapshots of the entire library state without copying anything
4.Keeps backup copies — Maintains duplicates and reconstructs damaged books automatically

See ZFS in Action

Before diving into theory, let's see ZFS work. This interactive demo walks you through pool creation, RAID-Z parity, snapshots, and disk failure recovery:

Pool Creation and vdev Management

Step 1 of 5

Initial State: Raw Disks

sdb
sdc
sdd
sde
# Identify available disks
lsblk | grep sd

Four raw disks: /dev/sdb, /dev/sdc, /dev/sdd, /dev/sde

No filesystem or pool configured

Total capacity: 4 × 1TB = 4TB raw

No redundancy or data protection

Ready to be configured into a ZFS pool


Copy-on-Write: Never Lose Data Mid-Write

Most filesystems overwrite data in place. If power fails mid-write, you get corrupted data and need to run fsck. ZFS never overwrites—it writes new data to new locations, then atomically updates pointers.

Traditional FS vs. Copy-on-Write

Traditional (ext4, NTFS)

Like erasing and rewriting in a library book. If someone bumps your elbow mid-write, the book is damaged forever. No undo.

ZFS Copy-on-Write

Like writing amendments on new pages, keeping originals intact. Even if disaster strikes, the old pages still exist—just switch back.

Watch exactly how this works—blocks allocated, snapshots created, rollbacks performed—all with zero copying:

Copy-on-Write Visualizer

Watch how ZFS writes new data without ever overwriting existing blocks. This is why snapshots are instant and rollbacks are safe!

Step 1: File on Disk

initial

A file "report.txt" exists with 4 data blocks (B1-B4). Each block is 128KB and contains part of the document.

report.txt(512KB)
B1
Header
B2
Chapter
B3
Chapter
B4
Conclusion...
File metadata points to:
[B1 → B2 → B3 → B4]
No snapshots yet
Free Block Pool
B5
B6
B7
B8
[zfs] Traditional filesystem: metadata points directly to data blocks. No history, no safety net.
1 / 7
Original Block
Shared (CoW)
Modified Block
New Block

Why Copy-on-Write Matters

  • Instant snapshots - No copying, just save pointers
  • Crash safe - Old data preserved until new write completes
  • Space efficient - Unchanged blocks shared between versions
  • No fsck needed - Filesystem always consistent

Why Copy-on-Write is Revolutionary

  1. Snapshots are instant — Just save the current block pointers (no data copying!)
  2. Snapshots are free — Only changed blocks consume new space
  3. Always consistent — Old state remains valid until new state is complete
  4. No fsck needed — Filesystem is always consistent, even after crash
  5. Time travel — Roll back to any snapshot instantly

The ARC: Your Data's Memory Palace

You've heard "ZFS needs lots of RAM." Here's why: the Adaptive Replacement Cache (ARC) keeps hot data in memory, dramatically speeding up reads.

ARC: The Smart Filing Clerk

Imagine a filing clerk who keeps two stacks on their desk:

  • MRU stack — The 1000 most recently requested files (good for sequential work)
  • MFU stack — The 1000 most frequently requested files (good for hot data)
  • Adaptive sizing — If MFU hits are higher, that stack grows automatically
  • Graceful eviction — When the desk fills up, least-useful files go to a nearby cabinet (L2ARC on SSD)
  • Memory pressure aware — When the office needs space, the clerk shrinks their stacks without crashing

Try the cache simulator—watch entries move between lists, apply memory pressure, and see hit ratios change:

ARC Cache Simulator

The Adaptive Replacement Cache keeps frequently AND recently used data in RAM. Watch how it adapts to access patterns and memory pressure!

Memory Pressure0%
LowHigh
Hit Ratio
0%
0 hits / 0 misses
ARC Size
8/ 8 slots
0 evictions to L2ARC
Simulation
MRU (Recently Used)
0 / 4 slots0%
Empty
MFU (Frequently Used)
0 / 4 slots0%
Empty
L2ARC (SSD Cache)
0 / 4 slots0%
Empty
Simulate File Access
Frequently accessed
Recently accessed

How ARC Adapts

  • MRU list - Recently accessed files (good for sequential scans)
  • MFU list - Frequently accessed files (good for hot data)
  • Adaptive sizing - Lists grow/shrink based on hit patterns
  • Memory pressure - ARC gracefully shrinks when system needs RAM
  • L2ARC overflow - Evicted entries can live on SSD

ARC Configuration

# Check current ARC size cat /proc/spl/kstat/zfs/arcstats | grep "^size" # Limit ARC to 8GB (in /etc/modprobe.d/zfs.conf) options zfs zfs_arc_max=8589934592 # Add L2ARC device (SSD cache for evicted entries) sudo zpool add tank cache /dev/nvme0n1

End-to-End Integrity: Trust Nothing

ZFS doesn't trust hardware. Every block is checksummed, and that checksum is stored in the parent block—creating a Merkle tree from data up to the root (uberblock).

RAID-Z: The Secret Sharing Protocol

Imagine 3 friends each know part of a secret phrase:

Alice: "The treasure is..."
Bob: "...buried under..."
Carol: "...the old oak tree"

If Bob forgets his part, Alice and Carol can XOR their parts with the parity checksum to reconstruct it. Any piece can be rebuilt from the others! This is RAID-Z.

See how checksum verification cascades through the block tree, detecting corruption and self-healing:

Checksum Cascade & Self-Healing

Every ZFS block is checksummed by its parent. Corrupt a block and watch the verification cascade detect and (with redundancy) repair it automatically!

Block tree (click a data block to corrupt it, then run verification)
Uberblock
Checksum:a7b3f9
Metadata
Checksum:c2d8e1
Indirect Block
Checksum:f4a2b7
Data Block 1
Checksum:3e9c1d
"Hello, World!"
Data Block 2
Checksum:8f2a4b
"ZFS is amazing"
Data Block 3
Checksum:d5e7c3
"Data integrity!"

Why End-to-End Checksums Matter

  • Silent corruption - Disk errors can corrupt data without OS knowing
  • Bit rot - Data degrades over time, especially on HDDs
  • Controller bugs - RAID controllers can corrupt data in transit
  • ZFS solution - Checksum in parent means corruption is ALWAYS detected
  • Self-healing - With redundancy, ZFS automatically repairs corrupted blocks
  • Scrub regularly! - Proactive verification catches issues before you need the data

Scrubbing: Proactive Integrity Verification

# Start a scrub (verify all data integrity) sudo zpool scrub tank # Check scrub progress sudo zpool status -v tank # Schedule weekly scrubs (cron example) 0 2 * * 0 /sbin/zpool scrub tank

Common Pitfalls

Watch Out For

1. Memory miscalculation

ZFS wants ~1GB RAM per TB of storage for basic operation. Undersize and performance tanks as ARC can't cache enough.

2. Deduplication RAM trap

Enabling dedup requires ~5GB RAM per TB. A 20TB pool with dedup needs 100GB+ RAM. Most home users should avoid dedup entirely.

3. Pool expansion limitations

You can add vdevs to a pool but can't remove them (until very recent versions). Plan your pool topology carefully upfront.

4. CDDL licensing concerns

ZFS uses CDDL license which is incompatible with GPL. It can't be included in Linux kernel directly—loaded as kernel module.

5. Full pool performance cliff

Above 80% full, ZFS performance degrades significantly. Above 90%, copy-on-write struggles to find free blocks. Keep pools under 80%.


Quick Reference: Essential Commands

Pool Operations

# Create pools with different redundancy sudo zpool create tank mirror /dev/sdb /dev/sdc # RAID1 (mirror) sudo zpool create tank raidz2 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf # RAID6-like # Pool management sudo zpool status -v tank # Health and status sudo zpool iostat -v tank 1 # I/O statistics (1 second interval) sudo zpool add tank mirror /dev/sdg /dev/sdh # Expand pool with new vdev

Dataset Operations

# Create datasets with properties sudo zfs create -o compression=lz4 -o quota=100G tank/documents sudo zfs create -o compression=zstd -o recordsize=1M tank/media # List datasets with useful columns sudo zfs list -o name,used,referenced,compressratio

Snapshots and Clones

# Create snapshot (instant!) sudo zfs snapshot tank/home@backup-$(date +%Y%m%d) # List snapshots sudo zfs list -t snapshot # Rollback to snapshot (destroys newer data) sudo zfs rollback tank/home@backup-20250120 # Create writable clone from snapshot sudo zfs clone tank/vm@golden tank/vm-test

Send/Receive (Backup)

# Full send to remote sudo zfs send tank/home@snap | ssh remote sudo zfs receive backup/home # Incremental send (only changes since @snap1) sudo zfs send -i @snap1 tank/home@snap2 | ssh remote sudo zfs receive backup/home # Encrypted send (raw blocks, preserves encryption) sudo zfs send -w tank/secure@snap | ssh remote sudo zfs receive backup/secure

ZFS vs Other Filesystems

FeatureZFSBtrfsext4XFS
Data Integrity███████████
Snapshots████████
Compression███████
Deduplication██████
Built-in RAID███████
Maturity███████████████
Performance██████████████
RAM Usage██████

When to Use ZFS

Perfect for:

  • NAS and storage servers (where data integrity is paramount)
  • Virtualization hosts (instant VM cloning from snapshots)
  • Database servers (tune recordsize for your workload!)
  • Backup systems (incremental send/receive is incredible)
  • Any system where silent corruption is unacceptable

Consider alternatives when:

  • Less than 8GB RAM available (ECC RAM preferred)
  • Root filesystem on desktop (complex recovery scenarios)
  • Embedded systems with limited resources
  • Strict GPL licensing requirements

Essential ZFS Takeaways

1.Copy-on-Write enables instant snapshots, no fsck, and crash-safe writes
2.Checksums everywhere means silent corruption is detected (and repaired with redundancy)
3.ARC cache is adaptive, memory-pressure aware, and dramatically speeds up reads
4.RAID-Z provides software RAID with per-block parity—no write hole
5.Pools contain datasets—no more partitioning, just create datasets as needed
6.Send/receive enables incremental, resumable, encrypted replication
7.Scrub weekly for proactive corruption detection before you need the data
8.Plan for RAM: ~1GB/TB base, 5GB/TB with dedup (avoid dedup if possible)

Best Practices

  1. Pool Design: Use mirrors for IOPS workloads, RAID-Z2 for capacity with large drives (>4TB)
  2. Keep pools under 80% full — performance degrades significantly above this threshold
  3. Use LZ4 compression — nearly free CPU cost, often speeds up I/O due to reduced bytes transferred
  4. Tune recordsize for workload (1M for large media files, 16K for databases, 128K default)
  5. Weekly scrubs — catch corruption before you need the data
  6. Test restores — snapshots are useless if you've never verified you can restore from them
  7. Avoid dedup unless you have 5GB RAM per TB AND you're certain you have lots of duplicate data

If you found this explanation helpful, consider sharing it with others.

Mastodon