Linux 存储与文件系统深度剖析：XFS 文件系统实现细节

Posted on 2026-04-12 Edited on 2026-04-13 In Linux Kernel , Storage

XFS 是目前 Linux 生产环境中使用最广泛的高性能文件系统之一，也是 RHEL/CentOS 7 及以后版本的默认文件系统。本文基于 Linux 6.4-rc1 内核源码（fs/xfs/），从磁盘格式、核心数据结构、B-Tree 管理、日志子系统到并行化设计，对 XFS 进行深度技术剖析。

1. XFS 历史与设计目标

XFS 由 Silicon Graphics（SGI）于 1993 年为 IRIX 操作系统设计，2001 年移植进入 Linux 内核。它的设计背景是媒体制作行业对超大文件和极高 I/O 吞吐量的需求——当时 SGI 的客户需要实时编辑未压缩的高分辨率视频。

XFS 围绕三个核心目标展开：

高性能：通过延迟分配（Delayed Allocation）减少碎片，通过 B+ Tree 管理 extent（连续块区间），以及高度并行的内部结构，XFS 在顺序和随机 I/O 场景下均表现优异。

超大规模：XFS 支持最大 8 EiB 的文件系统（64 位块号）和最大 8 EiB 的单文件，远超很多同类文件系统。inode 号同样是 64 位，支持近乎无限数量的文件。

高并发：XFS 的 Allocation Group（AG）设计将整个文件系统切分成多个独立的空间管理单元，不同 AG 的分配操作可以完全并行执行，在多核和多磁盘场景下几乎线性扩展。

Linux 6.4-rc1 中的 XFS 代码版本已发展至 v5 超级块格式（带 CRC 校验），并引入了稀疏 inode、反向映射 B-Tree（rmapbt）、引用计数 B-Tree（refcountbt）等一系列现代特性。

2. XFS 磁盘布局：Allocation Group 设计

2.1 整体分区结构

XFS 将磁盘空间划分为若干个大小相等的 Allocation Group（AG）。每个 AG 都是一个相对独立的空间管理单元，内部包含：

AG 0                      AG 1                     AG N
+--------+---+---+---+    +--------+---+---+---+
| SB copy| AGF| AGI|AGFL|  | SB copy| AGF| AGI|AGFL|  ...
+--------+---+---+---+    +--------+---+---+---+
| inode chunks           | | inode chunks           |
| free space B-Trees     | | free space B-Trees     |
| data blocks            | | data blocks            |
+------------------------+ +------------------------+

SB（Superblock）：第 0 块，只有 AG 0 的超级块是权威副本，其余 AG 中的是冗余备份（仅在 growfs 时更新）。
AGF（AG Freespace Header）：记录该 AG 的空闲块信息，包含两棵空闲空间 B-Tree 的根节点。
AGI（AG Inode Header）：记录该 AG 的 inode 分配信息，包含 inode B-Tree 根节点。
AGFL（AG Freelist）：一个固定大小的块指针数组，作为 B-Tree 分裂时的紧急空闲块缓冲。

AG 数量通常是几十到几百个，单个 AG 最大为 1 TiB（sb_agblocks 字段上限）。

2.2 超级块结构

超级块定义在 fs/xfs/libxfs/xfs_format.h，in-core 版本是 struct xfs_sb，on-disk 版本是 struct xfs_dsb：

// fs/xfs/libxfs/xfs_format.h

#define XFS_SB_MAGIC    0x58465342  /* 'XFSB' */
#define XFS_SB_VERSION_5  5         /* CRC enabled filesystem */

typedef struct xfs_sb {
    uint32_t    sb_magicnum;    /* magic number == XFS_SB_MAGIC */
    uint32_t    sb_blocksize;   /* logical block size, bytes */
    xfs_rfsblock_t  sb_dblocks; /* number of data blocks */
    xfs_rfsblock_t  sb_rblocks; /* number of realtime blocks */
    xfs_rtblock_t   sb_rextents;/* number of realtime extents */
    uuid_t      sb_uuid;        /* user-visible file system unique id */
    xfs_fsblock_t   sb_logstart;/* starting block of log if internal */
    xfs_ino_t   sb_rootino;     /* root inode number */
    xfs_agblock_t   sb_agblocks;/* size of an allocation group */
    xfs_agnumber_t  sb_agcount; /* number of allocation groups */
    uint16_t    sb_versionnum;  /* header version == XFS_SB_VERSION */
    uint16_t    sb_inodesize;   /* inode size, bytes */
    uint64_t    sb_icount;      /* allocated inodes */
    uint64_t    sb_ifree;       /* free inodes */
    uint64_t    sb_fdblocks;    /* free data blocks */

    /* version 5 superblock fields start here */
    uint32_t    sb_features_compat;
    uint32_t    sb_features_ro_compat;
    uint32_t    sb_features_incompat;
    uint32_t    sb_features_log_incompat;
    uint32_t    sb_crc;         /* superblock crc */
    xfs_lsn_t   sb_lsn;        /* last write sequence */
    uuid_t      sb_meta_uuid;   /* metadata file system unique id */
} xfs_sb_t;

v5 超级块（XFS_SB_VERSION_5）新增了 sb_crc 字段对超级块本身做 CRC32c 校验，以及 sb_lsn 记录最近一次写入时的日志序列号，用于检测是否需要日志恢复。

不兼容特性（sb_features_incompat）的典型位：

// fs/xfs/libxfs/xfs_format.h

#define XFS_SB_FEAT_INCOMPAT_FTYPE      (1 << 0)  /* filetype in dirent */
#define XFS_SB_FEAT_INCOMPAT_SPINODES   (1 << 1)  /* sparse inode chunks */
#define XFS_SB_FEAT_INCOMPAT_META_UUID  (1 << 2)  /* metadata UUID */
#define XFS_SB_FEAT_INCOMPAT_BIGTIME    (1 << 3)  /* large timestamps */
#define XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR (1 << 4) /* needs xfs_repair */
#define XFS_SB_FEAT_INCOMPAT_NREXT64    (1 << 5)  /* large extent counters */

只读兼容特性（sb_features_ro_compat）包括空闲 inode B-Tree（finobt）、反向映射 B-Tree（rmapbt）、引用计数 B-Tree（reflink）等：

#define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)  /* free inode btree */
#define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)  /* reverse map btree */
#define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)  /* reflinked files */
#define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)  /* inobt block counts */

2.3 AGF 与 AGI 结构

AGF（AG Freespace Header）管理该 AG 的空闲块，包含两棵 B-Tree：按块号排序的 BNO tree 和按大小排序的 CNT tree：

// fs/xfs/libxfs/xfs_format.h

typedef struct xfs_agf {
    __be32  agf_magicnum;   /* magic number == XFS_AGF_MAGIC */
    __be32  agf_versionnum; /* header version == XFS_AGF_VERSION */
    __be32  agf_seqno;      /* sequence # starting from 0 */
    __be32  agf_length;     /* size in blocks of a.g. */
    __be32  agf_roots[XFS_BTNUM_AGF];   /* root blocks */
    __be32  agf_levels[XFS_BTNUM_AGF];  /* btree levels */
    __be32  agf_flfirst;    /* first freelist block's index */
    __be32  agf_fllast;     /* last freelist block's index */
    __be32  agf_flcount;    /* count of blocks in freelist */
    __be32  agf_freeblks;   /* total free blocks */
    __be32  agf_longest;    /* longest free space */
    __be32  agf_btreeblks;  /* # of blocks held in AGF btrees */
    uuid_t  agf_uuid;       /* uuid of filesystem */
    __be64  agf_lsn;        /* last write sequence */
    __be32  agf_crc;        /* crc of agf sector */
} xfs_agf_t;

AGI（AG Inode Header）管理该 AG 的 inode，包含 inobt 根节点和 64 个 unlinked inode 哈希桶：

// fs/xfs/libxfs/xfs_format.h

typedef struct xfs_agi {
    __be32  agi_magicnum;   /* magic number == XFS_AGI_MAGIC */
    __be32  agi_seqno;      /* sequence # starting from 0 */
    __be32  agi_length;     /* size in blocks of a.g. */
    __be32  agi_count;      /* count of allocated inodes */
    __be32  agi_root;       /* root of inode btree */
    __be32  agi_level;      /* levels in inode btree */
    __be32  agi_freecount;  /* number of free inodes */
    __be32  agi_newino;     /* new inode just allocated */
    __be32  agi_unlinked[XFS_AGI_UNLINKED_BUCKETS];  /* 64 buckets */
    uuid_t  agi_uuid;
    __be32  agi_crc;
    __be32  agi_free_root;  /* root of the free inode btree */
    __be32  agi_free_level; /* levels in free inode btree */
} xfs_agi_t;

agi_unlinked 数组维护已 unlink() 但尚未释放（仍有进程打开）的 inode 链表，这是崩溃恢复的关键结构，日志恢复时会遍历这些链表完成 inode 清理。

2.4 每个 AG 的内核缓存：xfs_perag

内核为每个 AG 维护一个 struct xfs_perag，缓存 AGF/AGI 中的关键字段，避免每次操作都读磁盘：

// fs/xfs/libxfs/xfs_ag.h

struct xfs_perag {
    struct xfs_mount *pag_mount;    /* owner filesystem */
    xfs_agnumber_t  pag_agno;       /* AG this structure belongs to */
    atomic_t        pag_ref;        /* passive reference count */
    atomic_t        pag_active_ref; /* active reference count */
    uint8_t         pagf_levels[XFS_BTNUM_AGF]; /* btree levels */
    uint32_t        pagf_flcount;   /* count of blocks in freelist */
    xfs_extlen_t    pagf_freeblks;  /* total free blocks */
    xfs_extlen_t    pagf_longest;   /* longest free space */
    xfs_agino_t     pagi_freecount; /* number of free inodes */
    xfs_agino_t     pagi_count;     /* number of allocated inodes */

    /* Precalculated geometry info */
    xfs_agblock_t   block_count;
    xfs_agblock_t   min_block;
    xfs_agino_t     agino_min;
    xfs_agino_t     agino_max;
};

这种设计使得空间分配决策（选择哪个 AG 分配）可以在纯内存操作中完成，只有实际分配时才需要对目标 AG 加锁并访问磁盘。

3. 核心数据结构：XFS Inode

3.1 磁盘 inode 与内核 inode 的关系

XFS inode 的设计非常精妙：对于小文件，数据可以直接内嵌在 inode 中（inline data）；对于中等大小的文件，extent 列表直接存储在 inode 的数据分叉（data fork）中；只有当 extent 数量超过阈值时，才会将 extent 信息迁移到一棵 B-Tree。

内核中的 XFS inode 结构体（xfs_inode_t）定义在 fs/xfs/xfs_inode.h：

// fs/xfs/xfs_inode.h

typedef struct xfs_inode {
    /* Inode linking and identification information. */
    struct xfs_mount    *i_mount;   /* fs mount struct ptr */
    struct xfs_dquot    *i_udquot;  /* user dquot */
    struct xfs_dquot    *i_gdquot;  /* group dquot */
    struct xfs_dquot    *i_pdquot;  /* project dquot */

    /* Inode location stuff */
    xfs_ino_t           i_ino;      /* inode number (agno/agino) */
    struct xfs_imap     i_imap;     /* location for xfs_imap() */

    /* Extent information. */
    struct xfs_ifork    *i_cowfp;   /* copy on write extents */
    struct xfs_ifork    i_df;       /* data fork */
    struct xfs_ifork    i_af;       /* attribute fork */

    /* Transaction and locking information. */
    struct xfs_inode_log_item *i_itemp; /* logging information */
    mrlock_t            i_lock;     /* inode lock */
    atomic_t            i_pincount; /* inode pin count */

    spinlock_t          i_flags_lock;
    unsigned long       i_flags;    /* see defined flags below */
    uint64_t            i_delayed_blks; /* count of delay alloc blks */
    xfs_fsize_t         i_disk_size;/* number of bytes in file */
    xfs_rfsblock_t      i_nblocks;  /* # of direct & btree blocks */
    prid_t              i_projid;   /* owner's project id */
    xfs_extlen_t        i_extsize;  /* basic/minimum extent size */
    union {
        xfs_extlen_t    i_cowextsize; /* basic cow extent size */
        uint16_t        i_flushiter;  /* incremented on flush */
    };
    uint8_t             i_forkoff;  /* attr fork offset >> 3 */
    uint16_t            i_diflags;  /* XFS_DIFLAG_... */
    uint64_t            i_diflags2; /* XFS_DIFLAG2_... */
    struct timespec64   i_crtime;   /* time created */

    /* unlinked list pointers */
    xfs_agino_t         i_next_unlinked;
    xfs_agino_t         i_prev_unlinked;

    /* VFS inode */
    struct inode        i_vnode;    /* embedded VFS inode */

    spinlock_t          i_ioend_lock;
    struct work_struct  i_ioend_work;
    struct list_head    i_ioend_list;
} xfs_inode_t;

xfs_inode_t 尾部内嵌了 struct inode（Linux VFS 的通用 inode），内核通过 container_of 和 XFS_I() 宏在两者之间转换。

3.2 inode fork：数据分叉与属性分叉

XFS inode 使用 fork（分叉） 的概念将数据 extent 和扩展属性 extent 分开管理，对应 i_df（data fork）和 i_af（attribute fork）。每个 fork 的格式由 if_format 字段决定：

// fs/xfs/libxfs/xfs_inode_fork.h

struct xfs_ifork {
    int64_t         if_bytes;       /* bytes in if_u1 */
    struct xfs_btree_block *if_broot; /* file's incore btree root */
    unsigned int    if_seq;         /* fork mod counter */
    int             if_height;      /* height of the extent tree */
    union {
        void        *if_root;       /* extent tree root */
        char        *if_data;       /* inline file data */
    } if_u1;
    xfs_extnum_t    if_nextents;    /* # of extents in this fork */
    short           if_broot_bytes; /* bytes allocated for root */
    int8_t          if_format;      /* format of this fork */
    uint8_t         if_needextents; /* extents have not been read */
};

if_format 有三种取值：

XFS_DINODE_FMT_LOCAL：数据内嵌在 inode 的 literal area（小目录、小符号链接、小文件）。
XFS_DINODE_FMT_EXTENTS：extent 列表直接存储在 inode 中，if_u1.if_root 指向内存中的 extent 列表。
XFS_DINODE_FMT_BTREE：extent 数量超阈值，使用 B-Tree 管理，if_broot 指向根节点。

i_forkoff 字段决定 inode literal area 如何在 data fork 和 attribute fork 之间分配空间：

// fs/xfs/xfs_inode.h

static inline unsigned int xfs_inode_fork_boff(struct xfs_inode *ip)
{
    return ip->i_forkoff << 3;
}

static inline unsigned int xfs_inode_data_fork_size(struct xfs_inode *ip)
{
    if (xfs_inode_has_attr_fork(ip))
        return xfs_inode_fork_boff(ip);
    return XFS_LITINO(ip->i_mount);
}

i_forkoff 以 8 字节为单位，i_forkoff << 3 即为 data fork 在 literal area 中可用的字节数。

3.3 inode 号的编码

XFS inode 号是 64 位，高位是 AG 编号，低位是 AG 内的 inode 号（agino）。这意味着 inode 号天然携带了位置信息，无需额外的 inode 号到磁盘位置的映射表——这是 XFS 相比 ext2/3/4 的重要设计差异之一。

4. Extent 管理与 B-Tree

4.1 Extent 的表示：bmbt record

XFS 用 extent（bmbt record） 来描述一段连续的逻辑文件块到物理磁盘块的映射，一条 extent 只需 16 字节（128 位），紧凑地打包了 4 个字段：

[127:63] startoff  - 64 位文件逻辑偏移（块单位）
[62:21]  startblock- 52 位文件系统物理块号
[20:1]   blockcount- 21 位 extent 长度（块单位）
[0]      state     - 1 位标志（0=正常，1=unwritten/预分配）

一个 unwritten extent 表示磁盘空间已预分配但数据尚未写入（FALLOC_FL_KEEP_SIZE 的语义），读取时返回零而非随机数据，写入完成后通过事务将其转换为 written 状态。

4.2 B-Tree 框架（xfs_btree）

XFS 的所有 B-Tree 共享同一套底层框架，定义在 fs/xfs/libxfs/xfs_btree.h。树节点中的指针（union xfs_btree_ptr）分为短指针（__be32，AG 内相对块号）和长指针（__be64，全局块号）两种形式：

// fs/xfs/libxfs/xfs_btree.h

union xfs_btree_ptr {
    __be32  s;  /* short form ptr - AG-relative block number */
    __be64  l;  /* long form ptr  - filesystem block number  */
};

union xfs_btree_key {
    struct xfs_bmbt_key     bmbt;
    xfs_bmdr_key_t          bmbr;   /* bmbt root block */
    xfs_alloc_key_t         alloc;
    struct xfs_inobt_key    inobt;
    struct xfs_rmap_key     rmap;
    struct xfs_refcount_key refc;
};

B-Tree 块头（struct xfs_btree_block）中记录了 magic number、level（叶节点为 0）、numrecs、左右兄弟指针，以及在 v5 文件系统上额外的 UUID、LSN、CRC 字段用于完整性验证。

4.3 bmbt（Block Map B-Tree）

当一个文件的 extent 数量超过可在 inode 中内联存储的上限时，XFS 将 extent 列表迁移到 bmbt（Block Map B-Tree）中。bmbt 使用长指针（全局块号），key 为文件逻辑偏移，record 为完整的 128 位 extent 描述符。

fs/xfs/libxfs/xfs_bmap_btree.h 中定义了 bmbt 块的寻址宏：

// fs/xfs/libxfs/xfs_bmap_btree.h

// bmbt 块头长度（v5 文件系统带 CRC）
#define XFS_BMBT_BLOCK_LEN(mp) \
    (xfs_has_crc(((mp))) ? \
        XFS_BTREE_LBLOCK_CRC_LEN : XFS_BTREE_LBLOCK_LEN)

// 获取叶节点中第 index 条 record 的地址
#define XFS_BMBT_REC_ADDR(mp, block, index) \
    ((xfs_bmbt_rec_t *) \
        ((char *)(block) + \
         XFS_BMBT_BLOCK_LEN(mp) + \
         ((index) - 1) * sizeof(xfs_bmbt_rec_t)))

// 内部节点中第 index 个 ptr 的地址
#define XFS_BMBT_PTR_ADDR(mp, block, index, maxrecs) \
    ((xfs_bmbt_ptr_t *) \
        ((char *)(block) + \
         XFS_BMBT_BLOCK_LEN(mp) + \
         (maxrecs) * sizeof(xfs_bmbt_key_t) + \
         ((index) - 1) * sizeof(xfs_bmbt_ptr_t)))

bmbt 的根节点特别处理：当 extent 数量刚超过内联上限时，根节点可以直接存储在 inode 的 i_df.if_broot 指向的内存中（bmap root，简称 broot），只有树高增长后才会溢出到独立的磁盘块。

4.4 空闲空间 B-Tree（allocbt）

每个 AG 维护两棵空闲空间 B-Tree，均使用短指针（AG 相对块号）：

BNO tree：按起始块号排序，用于合并相邻空闲区域。
CNT tree：按区域大小排序，用于最优适配（best-fit）分配。

两棵树共享 AGF 中的根节点信息（agf_roots[]，agf_levels[]）。分配时会同时更新两棵树，保持一致性。

4.5 inobt 与 finobt

每个 AG 还有两棵 inode 相关的 B-Tree：

inobt（Inode B-Tree）：以 agino 为键，记录 inode chunk 的分配状态。每条 record 对应 64 个连续 inode，一个 64 位掩码标记哪些 inode 已分配。
finobt（Free Inode B-Tree）：仅在 XFS_SB_FEAT_RO_COMPAT_FINOBT 开启时存在，只记录有空闲 inode 的 chunk，大幅加速 inode 分配时的查找过程。

5. 日志子系统（XFS Journal）

5.1 日志架构概述

XFS 使用 WAL（Write-Ahead Logging） 保证崩溃一致性。与 ext4 的 journaling 不同，XFS 只记录元数据（默认不记录文件数据），日志写入是 XFS 所有元数据修改的唯一持久化路径。

XFS 日志系统由以下几层组成：

应用层事务
   ↓
xfs_trans（事务管理）
   ↓
CIL（Committed Item List，延迟提交日志）
   ↓
iclog（In-Core Log，环形缓冲区）
   ↓
磁盘日志区

5.2 日志记录格式与 LSN

日志中的每条记录（log record）有一个固定的 512 字节头部：

// fs/xfs/libxfs/xfs_log_format.h

typedef struct xlog_rec_header {
    __be32  h_magicno;      /* log record identifier */
    __be32  h_cycle;        /* write cycle of log */
    __be32  h_version;      /* LR version */
    __be32  h_len;          /* len in bytes; 64-bit aligned */
    __be64  h_lsn;          /* lsn of this LR */
    __be64  h_tail_lsn;     /* lsn of 1st LR with uncommitted buffers */
    __le32  h_crc;          /* crc of log record */
    __be32  h_prev_block;   /* block number to previous LR */
    __be32  h_num_logops;   /* number of log operations in this LR */
    __be32  h_cycle_data[XLOG_HEADER_CYCLE_SIZE / BBSIZE];
    __be32  h_fmt;          /* format of log record */
    uuid_t  h_fs_uuid;      /* uuid of FS */
    __be32  h_size;         /* iclog size */
} xlog_rec_header_t;

LSN（Log Sequence Number） 是 XFS 日志的核心概念。LSN 是一个 64 位整数，高 32 位是 cycle（日志绕回次数），低 32 位是 block offset：

// fs/xfs/libxfs/xfs_log_format.h

#define CYCLE_LSN(lsn) ((uint)((lsn)>>32))
#define BLOCK_LSN(lsn) ((uint)(lsn))

static inline xfs_lsn_t xlog_assign_lsn(uint cycle, uint block)
{
    return ((xfs_lsn_t)cycle << 32) | block;
}

LSN 的单调递增特性使得 XFS 可以通过比较 LSN 来判断哪些元数据更新已持久化。

每条日志操作（log operation）都有一个 xlog_op_header：

// fs/xfs/libxfs/xfs_log_format.h

typedef struct xlog_op_header {
    __be32  oh_tid;         /* transaction id of operation */
    __be32  oh_len;         /* bytes in data region */
    __u8    oh_clientid;    /* who sent me this (XFS_TRANSACTION=0x69) */
    __u8    oh_flags;       /* XLOG_START_TRANS / XLOG_COMMIT_TRANS ... */
    __u16   oh_res2;        /* 32 bit align */
} xlog_op_header_t;

#define XLOG_START_TRANS    0x01  /* Start a new transaction */
#define XLOG_COMMIT_TRANS   0x02  /* Commit this transaction */
#define XLOG_CONTINUE_TRANS 0x04  /* Cont this trans into new region */
#define XLOG_END_TRANS      0x10  /* End a continued transaction */

5.3 In-Core Log（iclog）环形缓冲区

日志写入的核心缓冲区是 xlog_in_core（iclog）。iclog 以环形链表组织，默认有 2～8 个，每个大小为 32KB～256KB：

// fs/xfs/xfs_log_priv.h

typedef struct xlog_in_core {
    wait_queue_head_t   ic_force_wait;  /* 同步强制刷盘的等待队列 */
    wait_queue_head_t   ic_write_wait;  /* 写入完成的等待队列 */
    struct xlog_in_core *ic_next;
    struct xlog_in_core *ic_prev;
    struct xlog         *ic_log;
    u32                 ic_size;        /* iclog 全部大小 */
    u32                 ic_offset;      /* 当前写入偏移 */
    enum xlog_iclog_state ic_state;     /* 状态机 */
    unsigned int        ic_flags;
    void                *ic_datap;
    struct list_head    ic_callbacks;   /* 提交回调链表 */

    atomic_t            ic_refcnt ____cacheline_aligned_in_smp;
    xlog_in_core_2_t    *ic_data;       /* 磁盘格式数据区 */
    struct bio          ic_bio;
    struct bio_vec      ic_bvec[];
} xlog_in_core_t;

iclog 的状态机：

// fs/xfs/xfs_log_priv.h

enum xlog_iclog_state {
    XLOG_STATE_ACTIVE,      /* 当前正在被写入 */
    XLOG_STATE_WANT_SYNC,   /* 已满，等待提交 */
    XLOG_STATE_SYNCING,     /* 正在写入磁盘 */
    XLOG_STATE_DONE_SYNC,   /* 已写入磁盘 */
    XLOG_STATE_CALLBACK,    /* 正在执行回调 */
    XLOG_STATE_DIRTY,       /* 脏，等待复用 */
};

5.4 日志票据（Log Ticket）

每个事务在写入日志之前必须先预留空间，通过 xlog_ticket 管理：

// fs/xfs/xfs_log_priv.h

typedef struct xlog_ticket {
    struct list_head    t_queue;    /* reserve/write queue */
    struct task_struct  *t_task;    /* task that owns this ticket */
    xlog_tid_t          t_tid;      /* transaction identifier */
    atomic_t            t_ref;      /* ticket reference count */
    int                 t_curr_res; /* current reservation */
    int                 t_unit_res; /* unit reservation */
    char                t_ocnt;     /* original unit count */
    char                t_cnt;      /* current unit count */
    uint8_t             t_flags;    /* XLOG_TIC_PERM_RESERV etc. */
    int                 t_iclog_hdrs;
} xlog_ticket_t;

XLOG_TIC_PERM_RESERV 标志表示永久预留（Permanent Reservation）——事务可以滚动提交而无需每次重新申请空间，这对于长事务（如目录操作）非常重要。

日志空间管理通过原子操作维护 grant head，避免全局锁竞争：

// fs/xfs/xfs_log.c

static void
xlog_grant_sub_space(
    struct xlog     *log,
    atomic64_t      *head,
    int             bytes)
{
    int64_t head_val = atomic64_read(head);
    int64_t new, old;

    do {
        int cycle, space;
        xlog_crack_grant_head_val(head_val, &cycle, &space);

        space -= bytes;
        if (space < 0) {
            space += log->l_logsize;
            cycle--;
        }

        old = head_val;
        new = xlog_assign_grant_head_val(cycle, space);
        head_val = atomic64_cmpxchg(head, old, new);
    } while (head_val != old);
}

这里的 CAS 循环（atomic64_cmpxchg）是 XFS 日志空间管理的无锁核心，允许多个 CPU 同时修改 grant head 而不引发竞争。

5.5 CIL（Committed Item List）：延迟提交

现代 XFS（引入于 2.6.29）使用 CIL（Committed Item List） 机制将事务的实际日志写入延迟到 checkpoint 时间，从而大幅减少小事务带来的日志写放大：

// fs/xfs/xfs_log_priv.h

struct xfs_cil_ctx {
    struct xfs_cil      *cil;
    xfs_csn_t           sequence;       /* chkpt sequence # */
    xfs_lsn_t           start_lsn;      /* first LSN of chkpt commit */
    xfs_lsn_t           commit_lsn;     /* chkpt commit record lsn */
    struct xlog_in_core *commit_iclog;
    struct xlog_ticket  *ticket;
    atomic_t            space_used;     /* aggregate size of regions */
    struct list_head    busy_extents;   /* busy extents in chkpt */
    struct list_head    log_items;      /* log items in chkpt */
    struct list_head    lv_chain;       /* logvecs being pushed */
    struct work_struct  push_work;
};

CIL 的工作流程：事务提交时，dirty log item 的数据被复制到 per-cpu 的 CIL 上下文中；当 CIL 积累到足够大（约 32MB）或被强制触发时，整批数据才会作为一个 checkpoint 写入 iclog。这将大量的随机小写入合并为顺序的大批量写入，显著提升 I/O 效率。

5.6 日志恢复

崩溃后重新挂载时，XFS 会执行日志恢复（xfs_log_recover.c）。恢复流程：

扫描日志区，根据 cycle 信息确定有效日志记录的范围（h_tail_lsn 到最新的 commit record）。
按事务顺序重放日志：提取每个事务的操作，将最终状态写回到元数据缓冲区。
处理 agi_unlinked 链表，释放那些已 unlink 但崩溃前未来得及释放的 inode 和块。
写入 XLOG_UNMOUNT_TYPE 记录，标记日志已干净。

6. 延迟写（Delayed Allocation）机制

XFS 的延迟写（Delalloc）是其在顺序写入和随机写入场景中都保持低碎片率的关键机制。

传统文件系统在 write() 系统调用时立即分配磁盘块。XFS 则不同：只在 page cache 的 page 即将刷出时（writeback 时）才真正分配磁盘块，中间阶段用 “delay extent”（特殊的预留 extent）占位。

延迟分配的好处是显而易见的：当应用连续写入一个文件时，XFS 可以观察到完整的写入范围，然后一次性分配连续的磁盘块，而不是每次写入 4KB 就分配一个分散的小块。

相关的文件大小更新逻辑在 fs/xfs/xfs_aops.c：

// fs/xfs/xfs_aops.c

int
xfs_setfilesize(
    struct xfs_inode    *ip,
    xfs_off_t           offset,
    size_t              size)
{
    struct xfs_mount    *mp = ip->i_mount;
    struct xfs_trans    *tp;
    xfs_fsize_t         isize;
    int                 error;

    error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp);
    if (error)
        return error;

    xfs_ilock(ip, XFS_ILOCK_EXCL);
    isize = xfs_new_eof(ip, offset + size);
    if (!isize) {
        xfs_iunlock(ip, XFS_ILOCK_EXCL);
        xfs_trans_cancel(tp);
        return 0;
    }

    ip->i_disk_size = isize;
    xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
    xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);

    return xfs_trans_commit(tp);
}

这里的 xfs_trans_log_inode 将 inode 修改记入日志，xfs_trans_commit 将事务提交到 CIL——所有这些都在 I/O 完成后的回调路径中完成，这就是 WAL（Write-Ahead Logging）在 XFS 中的体现。

fsync() 的实现也清晰地展示了 WAL 的语义：

// fs/xfs/xfs_file.c

static int
xfs_fsync_flush_log(
    struct xfs_inode    *ip,
    bool                datasync,
    int                 *log_flushed)
{
    int     error = 0;
    xfs_csn_t seq;

    xfs_ilock(ip, XFS_ILOCK_SHARED);
    seq = xfs_fsync_seq(ip, datasync);
    if (seq) {
        error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
                                  log_flushed);
        spin_lock(&ip->i_itemp->ili_lock);
        ip->i_itemp->ili_fsync_fields = 0;
        spin_unlock(&ip->i_itemp->ili_lock);
    }
    xfs_iunlock(ip, XFS_ILOCK_SHARED);
    return error;
}

fsync() 不需要将 page cache 全部写出，只需强制日志写入磁盘（xfs_log_force_seq），因为元数据已经通过 WAL 先于数据记录到日志中。

7. XFS 并行化设计

XFS 的高并发性能来自于系统性的并行化设计，而不仅仅是锁优化。

7.1 AG 级并行

每个 AG 都有独立的：

AGF/AGI 锁（pagf_state，pagi_state）
空闲空间 B-Tree
inode B-Tree
perag（per-AG inode 分配状态）

创建文件时，XFS 会根据负载均衡策略（文件数量、剩余空间）选择一个 AG，然后所有操作都限制在该 AG 内，完全不影响其他 AG 的操作。这使得在多核机器上，同时创建大量文件的扩展性接近线性。

7.2 inode 锁设计

XFS inode 使用 mrlock_t（multi-reader lock，即读写锁）而非简单的自旋锁，支持多个读者并发：

1
2
3

// fs/xfs/xfs_inode.h (mrlock_t 定义在 mrlock.h)

mrlock_t    i_lock;     /* inode lock */

XFS 还细化了 inode 锁的语义，对不同操作使用不同级别：

XFS_ILOCK_SHARED：用于读操作、fsync 的 LSN 读取。
XFS_ILOCK_EXCL：用于写操作、extent 分配、大小更新。

7.3 AIL（Active Item List）

AIL（Active Item List）是所有 dirty log item 的全局有序链表，按 LSN 排序。它是 XFS 元数据刷盘的推手：xfsaild 守护线程周期性扫描 AIL，将 LSN 最老的脏元数据缓冲区强制写回磁盘，然后推进日志的 tail LSN，释放日志空间。

7.4 并行工作队列（pwork）

fs/xfs/xfs_pwork.c 实现了一套并行工作队列，用于在多 CPU 上并行执行文件系统操作（如并行 inode 遍历、并行 scrub）。

8. XFS 配额与 inode 管理

8.1 配额类型

XFS 支持三种配额类型，在 xfs_inode.h 中的 inode 直接引用了对应的 dquot：

1
2
3

struct xfs_dquot    *i_udquot;  /* user dquot */
struct xfs_dquot    *i_gdquot;  /* group dquot */
struct xfs_dquot    *i_pdquot;  /* project dquot */

项目配额（project quota）是 XFS 独有的特性，可以对目录树（而非某个用户）设置配额，常用于虚拟化存储和多租户场景。

8.2 dquot 结构

xfs_dquot 结构在 fs/xfs/xfs_dquot.h 中定义，包含块、inode、realtime 块三个维度的配额资源：

// fs/xfs/xfs_dquot.h

struct xfs_dquot {
    struct list_head    q_lru;
    struct xfs_mount    *q_mount;
    xfs_dqtype_t        q_type;
    xfs_dqid_t          q_id;

    struct xfs_dquot_res q_blk;     /* regular blocks */
    struct xfs_dquot_res q_ino;     /* inodes */
    struct xfs_dquot_res q_rtb;     /* realtime blocks */

    struct xfs_dq_logitem q_logitem;
    xfs_qcnt_t          q_prealloc_lo_wmark;
    xfs_qcnt_t          q_prealloc_hi_wmark;
};

每个 xfs_dquot_res 都有 reserved（预留+已用）、count（已用）、hardlimit、softlimit 和 timer（宽限期到期时间）字段。

8.3 inode 缓存与惰性释放

XFS 使用 xfs_icache.c 管理内存 inode 缓存，通过 per-cpu 的 xfs_inodegc 实现惰性 inode 释放（deferred inactivation）：

// fs/xfs/xfs_mount.h

struct xfs_inodegc {
    struct llist_head   list;
    struct delayed_work work;
    unsigned int        items;
    unsigned int        shrinker_hits;
};

当 inode 的 VFS 引用计数降为零时，XFS 不立即释放磁盘 inode，而是将其放入 per-cpu 的 xfs_inodegc 列表，由后台工作线程批量处理。这减少了 unlink() 等操作的延迟，提升了单线程删除文件的性能。

9. XFS 常用管理工具

XFS 提供了完整的用户态工具集（xfsprogs），涵盖检查、修复、备份等场景：

工具	功能
`mkfs.xfs`	创建 XFS 文件系统，可指定 AG 数量、块大小、inode 大小、stripe unit 等
`xfs_info`	显示已挂载 XFS 文件系统的几何参数（AG 数量、块大小等）
`xfs_admin`	修改文件系统参数（UUID、label、特性标志）
`xfs_repair`	离线修复损坏的 XFS 文件系统，功能远超 `fsck.xfs`
`xfsdump` / `xfsrestore`	XFS 专用备份/恢复工具，支持增量备份和 extended attributes
`xfs_db`	XFS 调试工具，可直接检查磁盘结构（superblock、AG header、inode 等）
`xfs_io`	文件级 I/O 调试工具，支持 fallocate、punch hole、fiemap 等操作
`xfs_growfs`	在线扩容文件系统（只支持扩大，不支持缩小）
`xfs_freeze`	冻结/解冻文件系统，用于快照
`xfs_scrub`	在线元数据一致性检查（需要 rmapbt 支持）

常用命令示例：

# 创建 XFS 文件系统，4K 块，512B inode
mkfs.xfs -b size=4096 -i size=512 /dev/sdb

# 查看文件系统几何参数
xfs_info /mnt/xfs

# 检查 inode 的 extent 列表（调试用）
xfs_db -r /dev/sdb -c "inode 128" -c "bmap"

# 在线扩容（先扩磁盘分区，再扩文件系统）
xfs_growfs /mnt/xfs

# 离线修复（必须先卸载）
xfs_repair /dev/sdb

10. XFS 性能调优

10.1 格式化参数

在 mkfs.xfs 阶段选择合理的参数对性能影响深远：

**-b size=N**：块大小，默认 4096，可选 512 到 65536。对于大文件顺序 I/O，可设为 65536；对于随机小 I/O（如数据库），保持 4096 或匹配数据库 page size。
**-i size=N**：inode 大小，默认 512，可设为 1024 或 2048。更大的 inode 可以容纳更多 inline extent，减少 B-Tree 分配，但会浪费 inode 空间。
**-d su=N,sw=M**：RAID stripe unit 和 stripe width。正确设置后，XFS 会对齐 AG 边界和数据分配，避免跨 stripe 写入。
**-l size=N**：日志大小。更大的日志（最大 2GB）可以容纳更多未提交事务，减少日志刷盘频率，但延长崩溃恢复时间。

10.2 挂载选项

noatime      - 禁用 atime 更新，减少元数据写入
logbsize=N   - 日志缓冲区（iclog）大小，建议 256k（最大值）
allocsize=N  - 预分配 extent 大小，如 allocsize=64m 可减少碎片
largeio      - 优化大 I/O，设置 stat 返回的最优 I/O 大小
inode64      - 强制在整个磁盘范围内分配 inode（而非限于前 1TB）

10.3 运行时调优

# 查看 XFS 统计信息
cat /proc/fs/xfs/stat

# 调整预读大小（影响顺序读性能）
blockdev --setra 16384 /dev/sdb

# 对于 NVMe，可调整 I/O 调度器
echo none > /sys/block/nvme0n1/queue/scheduler

10.4 典型场景建议

场景	关键参数
大文件顺序 I/O（视频、备份）	`allocsize=1g`, `-b size=65536`, `noatime`
高 IOPS 随机 I/O（数据库）	`-b size=4096`, `-i size=2048`, `logbsize=256k`
海量小文件（邮件、代码库）	`-i size=512 -i maxpct=25`, `noatime`
RAID 阵列	`-d su=<stripe_unit>,sw=<stripe_width>`

11. XFS vs Ext4 vs Btrfs 对比

特性	XFS	Ext4	Btrfs
最大文件系统	8 EiB	1 EiB	16 EiB
最大文件大小	8 EiB	16 TiB	16 EiB
日志模式	仅元数据 WAL	元数据/数据/混合	CoW（无传统 journal）
在线扩容	仅扩大	扩大/缩小	扩大/缩小
在线碎片整理	支持（`xfs_fsr`）	支持（`e4defrag`）	内置（自动）
CoW / 引用链接	支持（v5，reflink）	不支持	核心特性
快照	不支持	不支持	支持
数据校验和	仅元数据 CRC	仅元数据 CRC	数据+元数据
RAID 内置支持	不支持	不支持	支持（RAID 5/6 不稳定）
并发扩展性	极佳（AG 并行）	较好	一般（CoW 写放大）
大文件顺序 I/O	极佳	良好	良好
海量小文件	良好	良好	较差（元数据开销大）
崩溃一致性	高（经过生产验证）	高（经过生产验证）	较高（持续改进中）
适用场景	高性能存储、企业级工作负载	通用场景、桌面	需要快照/CoW 的场景

总结：

XFS 在大文件、高吞吐、高并发场景下仍是 Linux 上的首选（RHEL 默认）。它的 AG 并行架构和成熟的日志子系统提供了极佳的可预测性能和可靠性。
Ext4 是通用场景下的稳健选择，工具链最成熟，在桌面和 Debian 系发行版中应用广泛。
Btrfs 的CoW 和快照特性使它在需要数据保护和弹性存储的场景下有优势，但在 I/O 密集型工作负载下的写放大问题需要关注。

参考资料

Linux 6.4-rc1 内核源码：fs/xfs/
XFS Filesystem Structure - 官方磁盘格式文档
XFS Project Wiki
man 5 xfs，man 8 mkfs.xfs，man 8 xfs_repair
Dave Chinner，”XFS: How XFS Works”，2019 Linux Storage & Filesystem Conference