What XFS is
XFS is a high-performance journaling filesystem created by Silicon Graphics in 1993 for IRIX, ported to Linux in 2001, and now the default filesystem for Red Hat Enterprise Linux and CentOS Stream. Its design priorities — parallelism, large files, and extent-based allocation — make it the workhorse for media servers, scientific computing, and database tablespaces.
Think of XFS as the Formula 1 car of filesystems: purpose-built for speed when handling multi-terabyte datasets and concurrent writers. It trades simplicity for raw throughput.
The core problem XFS solves
How do you achieve maximum disk throughput when multiple processes write at the same time? Traditional filesystems serialise metadata operations through a global lock, creating a bottleneck regardless of how fast the storage hardware is. A 12-disk RAID-0 array can sustain 10 GB/s — but if every metadata update has to acquire the same mutex, you bottleneck at the speed of one CPU core handling the lock.
XFS dismantles that bottleneck three ways: by sharding the filesystem into independent regions, by indexing everything with B+ trees instead of bitmaps, and by deferring physical placement until the kernel has seen the full write pattern.
Allocation groups: divide and conquer
XFS slices the filesystem into Allocation Groups (AGs) — independent regions, each with its own metadata structures and its own lock. Every AG carries:
- A free-space B+ tree indexed by starting block — used when the allocator wants a run of size n.
- A second free-space B+ tree indexed by extent length — used to find the largest free extent for a big write.
- An inode B+ tree that maps inode numbers to their on-disk inode chunks.
- A per-AG lock held only while modifying that AG's structures.
Serial: Traditional filesystems serialize metadata operations through a single lock. Four files must be written one after another, leaving most disk bandwidth unused.
A 4 TB filesystem typically has 16 AGs of 256 GB each (mkfs.xfs picks agcount automatically based on volume size). With 16 independent locks, 16 threads can perform metadata operations simultaneously without ever blocking on each other — and the per-AG B+ trees mean each operation is O(log n) over the AG's own free-space index, not over the whole filesystem.
The same parallelism that makes XFS fast also makes it expensive to shrink: shrinking is what would require rebalancing all the per-AG indices, which the maintainers have repeatedly judged not worth the engineering complexity.
Why B+ trees instead of bitmaps
Most older filesystems track free space as a bitmap — one bit per block, scanned linearly when you need to allocate. On a 4 TB filesystem with 4 KB blocks that's a billion bits, and finding a 100 GB contiguous run means scanning a 128 MB bitmap until you find one.
XFS replaces the bitmap with two B+ trees per AG, keyed by block start and extent length. To allocate a 100 GB extent, you look up the largest-extent tree at length ≥ 25 million, and the answer comes back in O(log n) — about 7 hops on a tree containing a million free extents. Allocating the first available run of any size? Look up the start-block tree at the smallest key.
The trees scale to terabyte-sized filesystems without the allocator slowing down. Bitmaps don't.
Extent-based allocation
XFS doesn't track individual blocks at all. It describes file contents as extents — contiguous ranges of blocks described by just three integers per record: start block, length, and the offset within the file.
For a 100 MB contiguous file, XFS stores one extent record (16 bytes). The same file under an ext-style block-pointer layout needs 25 600 block pointers (200 KB) plus indirect/double-indirect block layers. The size delta widens with fragmentation, but even on a heavily aged disk a 100 MB file rarely fragments into more than a few dozen extents — XFS still tracks orders of magnitude less metadata than block pointers.
Fewer metadata entries gives you three compounding wins: faster file creation (one B+ tree insert instead of one per block), smaller in-memory caches (the kernel only has to keep extent records, not every pointer), and shallower B+ tree traversal on lookup.
Delayed allocation
XFS doesn't decide where on disk the data actually goes when you call write(). The kernel returns immediately, the file is dirty in the page cache, and only when the cache flushes — at fsync, dirty-page expiry, or memory pressure — does XFS commit the bytes to physical blocks.
The reason this matters is that by the time XFS picks block locations, it has seen the entire write pattern. Five back-to-back 8 KB writes appending to the same file get placed as one contiguous extent, not five scattered runs:
The immediate-allocator path (what ext-style filesystems do) has to pick a placement at the moment of each write, with no visibility into what comes next. By the time the file is fully written, you've spent IOPs writing five different runs and the file's metadata records five separate extents. Delayed allocation collapses both costs into one.
There are two situations where delayed allocation hurts:
- Crash before flush. If the machine loses power between the
write()and the flush, the reserved space is returned but the bytes are lost. Applications that require durability callfsync()to force the flush, which forfeits the delayed-allocation win for that file. - Apparent free-space inconsistency. Tools like
dfreport the reserved space, not the committed space. The reserved-but-unwritten state can make a freshly-written file look bigger than it does on disk during the gap before the flush.
Concurrent metadata in practice
The three design moves combine: AGs let metadata operations run in parallel, B+ trees let each operation finish in logarithmic time, and extent-based + delayed allocation collapse N operations into roughly one. The net effect:
mkfs.xfs -f /dev/nvme0n1 mkfs.xfs reset by feature: reflink=1, finobt=1, sparse=1, rmapbt=0 agcount=16, agsize=29687104 blks
The defaults are pretty good. Two parameters worth tuning on top of them:
sunit/swidthfor RAID stripes. Setsunitto the RAID stripe unit (usually 64 KB or 128 KB) andswidthtosunit × stripe_width. XFS aligns writes to the stripe boundary so a 1 MB write doesn't split across two physical stripes and double the parity recalculation.-L labelfor any volume you might move between hosts. Mount-by-label survives device renumbering; mount-by-UUID needs axfs_repair -Lif the superblock gets damaged.
Key characteristics
| Aspect | XFS |
|---|---|
| Max file size | 8 EiB |
| Max volume | 8 EiB |
| Journaling | Metadata only (data goes through ordered or unordered modes) |
| Online grow | Yes (xfs_growfs) |
| Online shrink | No — would require rebalancing all per-AG indices |
| Snapshots | No (use LVM thin pools or dm-snapshot underneath) |
| Reflink (CoW clones) | Yes since Linux 4.9 (mkfs.xfs -m reflink=1, default since 5.1) |
| Data checksums | No (see filesystem integrity) |
| Recommended fsck | xfs_repair -n (no-modify check), then xfs_repair |
When XFS is the right choice
Reach for XFS when:
- The workload is dominated by large files — media servers, simulation output, columnar databases. The extent representation pays off the most when files have long contiguous runs.
- You have many concurrent writers — web app servers, build farms, multi-stream ingest pipelines. Per-AG locks scale up to dozens of cores before metadata becomes the bottleneck.
- You're sizing a volume in the TB to PB range — XFS scales further than ext4 with less per-volume tuning.
- You want reflink-backed clones —
cp --reflinkfor cheap per-file copies (containers, build artefacts, golden-image snapshots).
Pick something else when:
- You need filesystem snapshots — use Btrfs, ZFS, or LVM thin pools.
- You need to shrink the filesystem after over-provisioning — use ext4.
- The workload is millions of tiny files (mail servers, ccache directories) — ext4's smaller per-file metadata footprint wins.
- You're on a small embedded device where XFS's in-memory caches outsize the device — pick something simpler.
- You're a desktop root filesystem — ext4's simpler recovery path is friendlier.
The trade-off, in one paragraph
XFS optimises one thing: maximum throughput for large, parallel workloads. The cost of that focus is that you cannot shrink the filesystem, you don't get built-in snapshots, and the in-memory B+ tree caches are larger than ext4's bitmap caches. For the right workload — multi-TB datasets, many concurrent writers, fast storage — nothing in the Linux ecosystem beats XFS. For general-purpose desktops, ext4's simplicity often wins; for integrity-first workloads, ZFS does.
Related concepts
Learn the Btrfs filesystem with built-in snapshots, RAID, and compression. Explore copy-on-write, subvolumes, and self-healing on Linux.
Understand Copy-on-Write (CoW) in Btrfs and ZFS. Learn how CoW enables instant snapshots, atomic writes, and data integrity.
Explore ext4, the default Linux filesystem with journaling, extents, and proven reliability. Learn how ext4 protects your data.
Learn FAT32 and exFAT filesystems for cross-platform USB drives and SD cards. Understand file size limits and compatibility.
Learn how filesystem journaling prevents data loss during crashes. Explore write-ahead logging and recovery in ext4 and XFS.
Explore Linux filesystems through interactive visuals. Learn VFS, compare ext4 vs Btrfs vs ZFS, and understand file operations.
