Skip to main content

XFS: High-Performance Parallel Filesystem

Summary
XFS internals end-to-end: allocation groups for lock-free parallel metadata, B+ trees instead of bitmaps, extent-based allocation that scales to terabytes, and delayed allocation that turns scattered writes into contiguous extents.

What XFS is

XFS is a high-performance journaling filesystem created by Silicon Graphics in 1993 for IRIX, ported to Linux in 2001, and now the default filesystem for Red Hat Enterprise Linux and CentOS Stream. Its design priorities — parallelism, large files, and extent-based allocation — make it the workhorse for media servers, scientific computing, and database tablespaces.

Think of XFS as the Formula 1 car of filesystems: purpose-built for speed when handling multi-terabyte datasets and concurrent writers. It trades simplicity for raw throughput.

The core problem XFS solves

How do you achieve maximum disk throughput when multiple processes write at the same time? Traditional filesystems serialise metadata operations through a global lock, creating a bottleneck regardless of how fast the storage hardware is. A 12-disk RAID-0 array can sustain 10 GB/s — but if every metadata update has to acquire the same mutex, you bottleneck at the speed of one CPU core handling the lock.

XFS dismantles that bottleneck three ways: by sharding the filesystem into independent regions, by indexing everything with B+ trees instead of bitmaps, and by deferring physical placement until the kernel has seen the full write pattern.

Allocation groups: divide and conquer

XFS slices the filesystem into Allocation Groups (AGs) — independent regions, each with its own metadata structures and its own lock. Every AG carries:

  • A free-space B+ tree indexed by starting block — used when the allocator wants a run of size n.
  • A second free-space B+ tree indexed by extent length — used to find the largest free extent for a big write.
  • An inode B+ tree that maps inode numbers to their on-disk inode chunks.
  • A per-AG lock held only while modifying that AG's structures.
AG 0
Writing...
AG 1
Waiting
AG 2
Waiting
AG 3
Waiting
Time to write 4 files:
4x (sequential)

Serial: Traditional filesystems serialize metadata operations through a single lock. Four files must be written one after another, leaving most disk bandwidth unused.

A 4 TB filesystem typically has 16 AGs of 256 GB each (mkfs.xfs picks agcount automatically based on volume size). With 16 independent locks, 16 threads can perform metadata operations simultaneously without ever blocking on each other — and the per-AG B+ trees mean each operation is O(log n) over the AG's own free-space index, not over the whole filesystem.

The same parallelism that makes XFS fast also makes it expensive to shrink: shrinking is what would require rebalancing all the per-AG indices, which the maintainers have repeatedly judged not worth the engineering complexity.

Why B+ trees instead of bitmaps

Most older filesystems track free space as a bitmap — one bit per block, scanned linearly when you need to allocate. On a 4 TB filesystem with 4 KB blocks that's a billion bits, and finding a 100 GB contiguous run means scanning a 128 MB bitmap until you find one.

XFS replaces the bitmap with two B+ trees per AG, keyed by block start and extent length. To allocate a 100 GB extent, you look up the largest-extent tree at length ≥ 25 million, and the answer comes back in O(log n) — about 7 hops on a tree containing a million free extents. Allocating the first available run of any size? Look up the start-block tree at the smallest key.

The trees scale to terabyte-sized filesystems without the allocator slowing down. Bitmaps don't.

Extent-based allocation

XFS doesn't track individual blocks at all. It describes file contents as extents — contiguous ranges of blocks described by just three integers per record: start block, length, and the offset within the file.

For a 100 MB contiguous file, XFS stores one extent record (16 bytes). The same file under an ext-style block-pointer layout needs 25 600 block pointers (200 KB) plus indirect/double-indirect block layers. The size delta widens with fragmentation, but even on a heavily aged disk a 100 MB file rarely fragments into more than a few dozen extents — XFS still tracks orders of magnitude less metadata than block pointers.

Fewer metadata entries gives you three compounding wins: faster file creation (one B+ tree insert instead of one per block), smaller in-memory caches (the kernel only has to keep extent records, not every pointer), and shallower B+ tree traversal on lookup.

Delayed allocation

XFS doesn't decide where on disk the data actually goes when you call write(). The kernel returns immediately, the file is dirty in the page cache, and only when the cache flushes — at fsync, dirty-page expiry, or memory pressure — does XFS commit the bytes to physical blocks.

The reason this matters is that by the time XFS picks block locations, it has seen the entire write pattern. Five back-to-back 8 KB writes appending to the same file get placed as one contiguous extent, not five scattered runs:

The immediate-allocator path (what ext-style filesystems do) has to pick a placement at the moment of each write, with no visibility into what comes next. By the time the file is fully written, you've spent IOPs writing five different runs and the file's metadata records five separate extents. Delayed allocation collapses both costs into one.

There are two situations where delayed allocation hurts:

  • Crash before flush. If the machine loses power between the write() and the flush, the reserved space is returned but the bytes are lost. Applications that require durability call fsync() to force the flush, which forfeits the delayed-allocation win for that file.
  • Apparent free-space inconsistency. Tools like df report the reserved space, not the committed space. The reserved-but-unwritten state can make a freshly-written file look bigger than it does on disk during the gap before the flush.

Concurrent metadata in practice

The three design moves combine: AGs let metadata operations run in parallel, B+ trees let each operation finish in logarithmic time, and extent-based + delayed allocation collapse N operations into roughly one. The net effect:

mkfs.xfs -f /dev/nvme0n1 mkfs.xfs reset by feature: reflink=1, finobt=1, sparse=1, rmapbt=0 agcount=16, agsize=29687104 blks

The defaults are pretty good. Two parameters worth tuning on top of them:

  • sunit / swidth for RAID stripes. Set sunit to the RAID stripe unit (usually 64 KB or 128 KB) and swidth to sunit × stripe_width. XFS aligns writes to the stripe boundary so a 1 MB write doesn't split across two physical stripes and double the parity recalculation.
  • -L label for any volume you might move between hosts. Mount-by-label survives device renumbering; mount-by-UUID needs a xfs_repair -L if the superblock gets damaged.

Key characteristics

AspectXFS
Max file size8 EiB
Max volume8 EiB
JournalingMetadata only (data goes through ordered or unordered modes)
Online growYes (xfs_growfs)
Online shrinkNo — would require rebalancing all per-AG indices
SnapshotsNo (use LVM thin pools or dm-snapshot underneath)
Reflink (CoW clones)Yes since Linux 4.9 (mkfs.xfs -m reflink=1, default since 5.1)
Data checksumsNo (see filesystem integrity)
Recommended fsckxfs_repair -n (no-modify check), then xfs_repair

When XFS is the right choice

Reach for XFS when:

  • The workload is dominated by large files — media servers, simulation output, columnar databases. The extent representation pays off the most when files have long contiguous runs.
  • You have many concurrent writers — web app servers, build farms, multi-stream ingest pipelines. Per-AG locks scale up to dozens of cores before metadata becomes the bottleneck.
  • You're sizing a volume in the TB to PB range — XFS scales further than ext4 with less per-volume tuning.
  • You want reflink-backed clonescp --reflink for cheap per-file copies (containers, build artefacts, golden-image snapshots).

Pick something else when:

  • You need filesystem snapshots — use Btrfs, ZFS, or LVM thin pools.
  • You need to shrink the filesystem after over-provisioning — use ext4.
  • The workload is millions of tiny files (mail servers, ccache directories) — ext4's smaller per-file metadata footprint wins.
  • You're on a small embedded device where XFS's in-memory caches outsize the device — pick something simpler.
  • You're a desktop root filesystem — ext4's simpler recovery path is friendlier.

The trade-off, in one paragraph

XFS optimises one thing: maximum throughput for large, parallel workloads. The cost of that focus is that you cannot shrink the filesystem, you don't get built-in snapshots, and the in-memory B+ tree caches are larger than ext4's bitmap caches. For the right workload — multi-TB datasets, many concurrent writers, fast storage — nothing in the Linux ecosystem beats XFS. For general-purpose desktops, ext4's simplicity often wins; for integrity-first workloads, ZFS does.

If you found this explanation helpful, consider sharing it with others.

Mastodon