Linux 存储与文件系统深度剖析（五）：Ext4 文件系统源码分析

Posted on 2026-04-12 Edited on 2026-06-10 In Linux Kernel

Linux 存储与文件系统深度剖析（五）：Ext4 文件系统源码分析

Ext4 是 Linux 世界中使用最为广泛的文件系统之一。从 1992 年诞生的 Ext2 到今天仍在亿级服务器上运行的 Ext4，这个文件系统家族经历了三十余年的演进，积累了大量为生产环境验证的设计智慧。本文基于 Linux 6.4-rc1 内核源码（fs/ext4/ 及 fs/jbd2/），从磁盘格式到内核实现，逐层深入剖析 Ext4 的核心机制。

1. 历史演进：从 Ext2 到 Ext4

1.1 Ext2：奠定基础（1993）

Ext2（Second Extended Filesystem）由 Rémy Card 于 1993 年为 Linux 设计，确立了沿用至今的基本磁盘布局：块组（Block Group）划分、inode 表、块位图、inode 位图。Ext2 没有日志，崩溃后需要 e2fsck 做全盘一致性检查，在大容量磁盘上这可能耗时数小时。

1.2 Ext3：引入日志（2001）

Ext3 在 Ext2 的基础上叠加了 JBD（Journaling Block Device）日志层，提供三种日志模式：

模式	说明	性能	一致性
`journal`	数据+元数据都写日志	最慢	最强
`ordered`	只日志元数据，数据在元数据提交前落盘	中等	强（默认）
`writeback`	只日志元数据，数据顺序无保证	最快	弱

Ext3 的磁盘格式与 Ext2 完全兼容，只需在超级块中设置日志特性位即可将 Ext2 升级为 Ext3。

1.3 Ext4：突破限制（2008）

Ext4 于 2008 年随 Linux 2.6.28 正式合并主线，主要突破：

Extent 树取代间接块映射，大文件性能大幅提升，最大单文件 16 TiB
支持 48 位块地址，最大卷 1 EiB
延迟分配（Delalloc） 显著减少碎片
持久化预分配（Persistent Preallocation）
在线碎片整理
HTree 目录索引（实际源自 Ext3，但 Ext4 普遍启用）
日志校验和，JBD2 替代 JBD
纳秒时间戳，扩展至 2446 年
元数据校验和（crc32c）

2. 磁盘布局

2.1 整体结构

Ext4 将分区划分为若干块组（Block Group），每组大小默认 128 MiB（4K 块时 32768 块）。磁盘头部保留 1024 字节的 boot sector，其后是超级块，然后是块组描述符表，最后是各块组本身。

+----------+----------+------------+----------+----------+- - -
| Boot     | Super    | Group Desc | Group 0  | Group 1  |
| Sector   | Block    | Table      | Data     | Data     |
| (1024B)  | (1024B+) | (N blocks) | ...      | ...      |
+----------+----------+------------+----------+----------+- - -
  0         1024       2048         ...

每个块组内部布局（以 4K 块为例）：

+----------+----------+----------+----------+---------+
| Block    | Inode    | Inode    | Data     | Data    |
| Bitmap   | Bitmap   | Table    | Blocks   | Blocks  |
| (1 blk)  | (1 blk)  | (N blks) | ...      | ...     |
+----------+----------+----------+----------+---------+

2.2 超级块：`struct ext4_super_block`

超级块存于偏移 1024 字节处，是整个文件系统的”身份证”。以下是内核中的完整磁盘结构定义（fs/ext4/ext4.h，第 1230 行）：

struct ext4_super_block {
/*00*/  __le32  s_inodes_count;         /* Inodes count */
        __le32  s_blocks_count_lo;      /* Blocks count */
        __le32  s_r_blocks_count_lo;    /* Reserved blocks count */
        __le32  s_free_blocks_count_lo; /* Free blocks count */
/*10*/  __le32  s_free_inodes_count;    /* Free inodes count */
        __le32  s_first_data_block;     /* First Data Block */
        __le32  s_log_block_size;       /* Block size: 1024 << s_log_block_size */
        __le32  s_log_cluster_size;     /* Allocation cluster size */
/*20*/  __le32  s_blocks_per_group;     /* # Blocks per group */
        __le32  s_clusters_per_group;   /* # Clusters per group */
        __le32  s_inodes_per_group;     /* # Inodes per group */
        __le32  s_mtime;                /* Mount time */
/*30*/  __le32  s_wtime;                /* Write time */
        __le16  s_mnt_count;            /* Mount count */
        __le16  s_max_mnt_count;        /* Maximal mount count */
        __le16  s_magic;                /* Magic signature: 0xEF53 */
        __le16  s_state;                /* File system state */
        __le16  s_errors;               /* Behaviour when detecting errors */
        __le16  s_minor_rev_level;      /* minor revision level */
/*40*/  __le32  s_lastcheck;            /* time of last check */
        __le32  s_checkinterval;        /* max. time between checks */
        __le32  s_creator_os;           /* OS */
        __le32  s_rev_level;            /* Revision level */
/*50*/  __le16  s_def_resuid;           /* Default uid for reserved blocks */
        __le16  s_def_resgid;           /* Default gid for reserved blocks */
        /* EXT4_DYNAMIC_REV superblocks only: */
        __le32  s_first_ino;            /* First non-reserved inode */
        __le16  s_inode_size;           /* size of inode structure */
        __le16  s_block_group_nr;       /* block group # of this superblock */
        __le32  s_feature_compat;       /* compatible feature set */
/*60*/  __le32  s_feature_incompat;     /* incompatible feature set */
        __le32  s_feature_ro_compat;    /* readonly-compatible feature set */
/*68*/  __u8    s_uuid[16];             /* 128-bit uuid for volume */
/*78*/  char    s_volume_name[EXT4_LABEL_MAX]; /* volume name */
/*88*/  char    s_last_mounted[64];     /* directory where last mounted */
/*D0*/  __u8    s_journal_uuid[16];     /* uuid of journal superblock */
/*E0*/  __le32  s_journal_inum;         /* inode number of journal file */
        __le32  s_journal_dev;          /* device number of journal file */
        __le32  s_last_orphan;          /* start of list of inodes to delete */
        __le32  s_hash_seed[4];         /* HTREE hash seed */
        __u8    s_def_hash_version;     /* Default hash version to use */
        __le16  s_desc_size;            /* size of group descriptor */
/*100*/ __le32  s_default_mount_opts;
        __le32  s_first_meta_bg;        /* First metablock block group */
        __le32  s_mkfs_time;            /* When the filesystem was created */
        __le32  s_jnl_blocks[17];       /* Backup of the journal inode */
        /* 64bit support valid if EXT4_FEATURE_INCOMPAT_64BIT */
/*150*/ __le32  s_blocks_count_hi;      /* Blocks count (high 32 bits) */
        __le32  s_r_blocks_count_hi;    /* Reserved blocks count */
        __le32  s_free_blocks_count_hi; /* Free blocks count */
        __le16  s_min_extra_isize;      /* All inodes have at least # bytes */
        __le16  s_want_extra_isize;     /* New inodes should reserve # bytes */
        __le32  s_flags;                /* Miscellaneous flags */
        __le16  s_raid_stride;          /* RAID stride */
        __le16  s_mmp_update_interval;  /* # seconds to wait in MMP checking */
        __le64  s_mmp_block;            /* Block for multi-mount protection */
        __le32  s_raid_stripe_width;    /* blocks on all data disks (N*stride)*/
        __u8    s_log_groups_per_flex;  /* FLEX_BG group size */
        __u8    s_checksum_type;        /* metadata checksum algorithm used */
        __le64  s_kbytes_written;       /* nr of lifetime kilobytes written */
        __le32  s_error_count;          /* number of fs errors */
        __le32  s_first_error_time;     /* first time an error happened */
        __le32  s_first_error_ino;      /* inode involved in first error */
        __le64  s_first_error_block;    /* block involved of first error */
        __u8    s_first_error_func[32]; /* function where the error happened */
        __le32  s_first_error_line;     /* line number where error happened */
        /* ... */
        __le32  s_checksum_seed;        /* crc32c(uuid) if csum_seed set */
};

几个关键字段解析：

**s_magic**：魔数 0xEF53，内核挂载时首先校验此值。
**s_log_block_size**：实际块大小 = 1024 << s_log_block_size，合法值 0/1/2/3 对应 1K/2K/4K/8K。
s_feature_compat/incompat/ro_compat：三级特性位。incompat 中存在内核不认识的位时，必须拒绝挂载；ro_compat 中有未知位时只能只读挂载；compat 中有未知位可以正常挂载（向后兼容）。
**s_journal_inum**：日志文件对应的 inode 号，通常为 inode 8。
**s_last_orphan**：崩溃前未完成删除的 inode 链表头，挂载时由 ext4_orphan_cleanup() 处理。
**s_error_***：内核会将首次和最近一次错误的函数名、行号、涉及的块/inode 持久化到超级块，tune2fs -l 可以查看。

2.3 块组描述符：`struct ext4_group_desc`

每个块组的元数据地址由块组描述符表记录，定义于 fs/ext4/ext4.h，第 338 行：

struct ext4_group_desc
{
    __le32  bg_block_bitmap_lo;      /* Blocks bitmap block */
    __le32  bg_inode_bitmap_lo;      /* Inodes bitmap block */
    __le32  bg_inode_table_lo;       /* Inodes table block */
    __le16  bg_free_blocks_count_lo; /* Free blocks count */
    __le16  bg_free_inodes_count_lo; /* Free inodes count */
    __le16  bg_used_dirs_count_lo;   /* Directories count */
    __le16  bg_flags;                /* EXT4_BG_flags (INODE_UNINIT, etc) */
    __le32  bg_exclude_bitmap_lo;    /* Exclude bitmap for snapshots */
    __le16  bg_block_bitmap_csum_lo; /* crc32c(s_uuid+grp_num+bbitmap) LE */
    __le16  bg_inode_bitmap_csum_lo; /* crc32c(s_uuid+grp_num+ibitmap) LE */
    __le16  bg_itable_unused_lo;     /* Unused inodes count */
    __le16  bg_checksum;             /* crc16(sb_uuid+group+desc) */
    /* 以下字段仅在启用 64BIT 特性时存在 */
    __le32  bg_block_bitmap_hi;      /* Blocks bitmap block MSB */
    __le32  bg_inode_bitmap_hi;      /* Inodes bitmap block MSB */
    __le32  bg_inode_table_hi;       /* Inodes table block MSB */
    __le16  bg_free_blocks_count_hi; /* Free blocks count MSB */
    __le16  bg_free_inodes_count_hi; /* Free inodes count MSB */
    __le16  bg_used_dirs_count_hi;   /* Directories count MSB */
    __le16  bg_itable_unused_hi;     /* Unused inodes count MSB */
    __le32  bg_exclude_bitmap_hi;    /* Exclude bitmap block MSB */
    __le16  bg_block_bitmap_csum_hi; /* crc32c(s_uuid+grp_num+bbitmap) BE */
    __le16  bg_inode_bitmap_csum_hi; /* crc32c(s_uuid+grp_num+ibitmap) BE */
    __u32   bg_reserved;
};

bg_flags 字段中有几个重要标志位：

1
2
3

#define EXT4_BG_INODE_UNINIT  0x0001  /* Inode 表/位图未初始化（新建组优化）*/
#define EXT4_BG_BLOCK_UNINIT  0x0002  /* 块位图未初始化 */
#define EXT4_BG_INODE_ZEROED  0x0004  /* 磁盘上的 inode 表已初始化为零 */

INODE_UNINIT 和 BLOCK_UNINIT 是 Ext4 的延迟初始化（lazy init）特性：mke2fs 只初始化第一个块组，其余块组在内核后台线程中异步初始化，大容量磁盘格式化因此能在秒级完成。

2.4 inode 结构：`struct ext4_inode`

inode 是文件系统的核心抽象，存储文件的元数据和数据块地址。Ext4 的磁盘 inode 定义于 fs/ext4/ext4.h，第 714 行：

struct ext4_inode {
    __le16  i_mode;         /* File mode: 类型+权限位 */
    __le16  i_uid;          /* Low 16 bits of Owner Uid */
    __le32  i_size_lo;      /* Size in bytes (低32位) */
    __le32  i_atime;        /* Access time */
    __le32  i_ctime;        /* Inode Change time */
    __le32  i_mtime;        /* Modification time */
    __le32  i_dtime;        /* Deletion Time */
    __le16  i_gid;          /* Low 16 bits of Group Id */
    __le16  i_links_count;  /* Links count */
    __le32  i_blocks_lo;    /* Blocks count (以512字节扇区为单位) */
    __le32  i_flags;        /* File flags: EXTENTS_FL, INLINE_DATA_FL 等 */
    union {
        struct { __le32 l_i_version; } linux1;
        /* ... */
    } osd1;
    __le32  i_block[EXT4_N_BLOCKS]; /* 数据块指针区域，共60字节 */
    __le32  i_generation;   /* File version (for NFS) */
    __le32  i_file_acl_lo;  /* File ACL */
    __le32  i_size_high;    /* Size in bytes (高32位) */
    __le32  i_obso_faddr;   /* Obsoleted fragment address */
    union {
        struct {
            __le16  l_i_blocks_high;
            __le16  l_i_file_acl_high;
            __le16  l_i_uid_high;
            __le16  l_i_gid_high;
            __le16  l_i_checksum_lo; /* crc32c(uuid+inum+inode) LE */
            __le16  l_i_reserved;
        } linux2;
        /* ... */
    } osd2;
    __le16  i_extra_isize;       /* 扩展区域大小，用于存额外字段 */
    __le16  i_checksum_hi;       /* crc32c(uuid+inum+inode) BE */
    __le32  i_ctime_extra;       /* 纳秒+epoch扩展 ctime */
    __le32  i_mtime_extra;       /* 纳秒+epoch扩展 mtime */
    __le32  i_atime_extra;       /* 纳秒+epoch扩展 atime */
    __le32  i_crtime;            /* File Creation time */
    __le32  i_crtime_extra;      /* 纳秒+epoch扩展 crtime */
    __le32  i_version_hi;        /* 64位版本号高32位 */
    __le32  i_projid;            /* Project ID */
};

关键设计点：

i_block[EXT4_N_BLOCKS]：共 15 个 __le32，占 60 字节。对于传统间接块映射（Ext2/3 遗留），前 12 个是直接块指针，第 13/14/15 个分别是一级/二级/三级间接块指针。对于启用 Extent 树的 Ext4 文件（EXT4_EXTENTS_FL），这 60 字节改为存储 extent 头和 extent 叶子节点，完全不同的语义。
纳秒时间戳：i_ctime 等字段存 Unix 秒数（32位），i_ctime_extra 用 nsec << 2 | epoch 编码：低2位为 epoch 扩展位，可将时间范围延伸至 2446 年；高30位为纳秒。
**i_extra_isize**：Ext4 支持大 inode（256字节起），i_extra_isize 记录 EXT4_GOOD_OLD_INODE_SIZE（128字节）之后额外使用的字节数，存放扩展字段（crtime、版本号、校验和等）。

3. Extent 树机制

Ext4 最重要的性能改进之一是用 B+ 树形式的 Extent 树替代了 Ext2/3 的多级间接块指针。

3.1 核心数据结构

定义于 fs/ext4/ext4_extents.h：

/*
 * This is the extent on-disk structure.
 * It's used at the bottom of the tree (leaf nodes).
 */
struct ext4_extent {
    __le32  ee_block;       /* first logical block extent covers */
    __le16  ee_len;         /* number of blocks covered by extent */
    __le16  ee_start_hi;    /* high 16 bits of physical block */
    __le32  ee_start_lo;    /* low 32 bits of physical block */
};

/*
 * This is index on-disk structure.
 * It's used at all the levels except the bottom (internal nodes).
 */
struct ext4_extent_idx {
    __le32  ei_block;       /* index covers logical blocks from 'block' */
    __le32  ei_leaf_lo;     /* pointer to the physical block of the next
                             * level. leaf or next index could be there */
    __le16  ei_leaf_hi;     /* high 16 bits of physical block */
    __u16   ei_unused;
};

/*
 * Each block (leaves and indexes), even inode-stored has header.
 */
struct ext4_extent_header {
    __le16  eh_magic;       /* probably will support different formats */
    __le16  eh_entries;     /* number of valid entries */
    __le16  eh_max;         /* capacity of store in entries */
    __le16  eh_depth;       /* has tree real underlying blocks?
                             * depth == 0 means the tree is a leaf */
    __le32  eh_generation;  /* generation of the tree */
};

#define EXT4_EXT_MAGIC  cpu_to_le16(0xf30a)
#define EXT4_MAX_EXTENT_DEPTH 5

ext4_extent（叶子节点） 是一个”运行”（run）描述符：从逻辑块 ee_block 开始的 ee_len 个连续块，映射到物理块 (ee_start_hi << 32) | ee_start_lo。注意 ee_len 的最高位（MSB）有特殊含义：

ee_len <= 0x8000：已初始化的 extent
ee_len > 0x8000：未写（unwritten/preallocated）extent，实际长度为 ee_len & 0x7FFF

最大初始化 extent 覆盖 32768 块（EXT_INIT_MAX_LEN = 0x8000），即 4K 块时 128 MiB。

ext4_extent_idx（内部节点） 存储索引，ei_block 是该子树覆盖的起始逻辑块，ei_leaf_lo/hi 指向子节点所在物理块。

树的根始终存在 inode 的 i_block[0..14]（60字节）中：前 12 字节是 ext4_extent_header，后 48 字节最多放 3 个 ext4_extent（叶子）或 ext4_extent_idx（内部节点）。

路径查找使用辅助结构 ext4_ext_path：

struct ext4_ext_path {
    ext4_fsblk_t             p_block;    /* physical block of this level */
    __u16                    p_depth;    /* depth of this level */
    __u16                    p_maxdepth;
    struct ext4_extent      *p_ext;      /* pointer to extent (leaf) */
    struct ext4_extent_idx  *p_idx;      /* pointer to index (internal) */
    struct ext4_extent_header *p_hdr;   /* header of this block */
    struct buffer_head       *p_bh;     /* buffer head for this block */
};

3.2 `ext4_find_extent`：树遍历实现

核心查找函数 ext4_find_extent()（fs/ext4/extents.c，第 883 行）从根向叶子走一遍 B+ 树，每层调用二分搜索：

struct ext4_ext_path *
ext4_find_extent(struct inode *inode, ext4_lblk_t block,
                 struct ext4_ext_path **orig_path, int flags)
{
    struct ext4_extent_header *eh;
    struct buffer_head *bh;
    struct ext4_ext_path *path = orig_path ? *orig_path : NULL;
    short int depth, i, ppos = 0;
    int ret;
    gfp_t gfp_flags = GFP_NOFS;

    if (flags & EXT4_EX_NOFAIL)
        gfp_flags |= __GFP_NOFAIL;

    eh = ext_inode_hdr(inode);       /* 从 inode.i_block 取树根 header */
    depth = ext_depth(inode);        /* 当前树深度 */
    if (depth < 0 || depth > EXT4_MAX_EXTENT_DEPTH) {
        EXT4_ERROR_INODE(inode, "inode has invalid extent depth: %d", depth);
        ret = -EFSCORRUPTED;
        goto err;
    }

    if (!path) {
        /* account possible depth increase */
        path = kcalloc(depth + 2, sizeof(struct ext4_ext_path), gfp_flags);
        if (unlikely(!path))
            return ERR_PTR(-ENOMEM);
        path[0].p_maxdepth = depth + 1;
    }
    path[0].p_hdr = eh;
    path[0].p_bh = NULL;

    i = depth;
    /* 对深度为0的树（全在inode内）可以直接缓存 extent */
    if (!(flags & EXT4_EX_NOCACHE) && depth == 0)
        ext4_cache_extents(inode, eh);

    /* 从根向叶子遍历内部节点 */
    while (i) {
        /* 在当前层的 idx 数组中二分查找 */
        ext4_ext_binsearch_idx(inode, path + ppos, block);
        path[ppos].p_block = ext4_idx_pblock(path[ppos].p_idx);
        path[ppos].p_depth = i;
        path[ppos].p_ext = NULL;

        /* 读取子节点对应的磁盘块 */
        bh = read_extent_tree_block(inode, path[ppos].p_idx, --i, flags);
        if (IS_ERR(bh)) {
            ret = PTR_ERR(bh);
            goto err;
        }

        eh = ext_block_hdr(bh);
        ppos++;
        path[ppos].p_bh = bh;
        path[ppos].p_hdr = eh;
    }

    path[ppos].p_depth = i;
    path[ppos].p_ext = NULL;
    path[ppos].p_idx = NULL;

    /* 在叶子节点做二分搜索找到对应 extent */
    ext4_ext_binsearch(inode, path + ppos, block);
    if (path[ppos].p_ext)
        path[ppos].p_block = ext4_ext_pblock(path[ppos].p_ext);

    ext4_ext_show_path(inode, path);
    return path;

err:
    ext4_free_ext_path(path);
    if (orig_path) *orig_path = NULL;
    return ERR_PTR(ret);
}

3.3 叶子层二分搜索：`ext4_ext_binsearch`

static void
ext4_ext_binsearch(struct inode *inode,
        struct ext4_ext_path *path, ext4_lblk_t block)
{
    struct ext4_extent_header *eh = path->p_hdr;
    struct ext4_extent *r, *l, *m;

    if (eh->eh_entries == 0) {
        /* 空叶子：在 split/add 路径中会遇到 */
        return;
    }

    l = EXT_FIRST_EXTENT(eh) + 1;   /* 从第二个 extent 开始搜 */
    r = EXT_LAST_EXTENT(eh);

    while (l <= r) {
        m = l + (r - l) / 2;
        if (block < le32_to_cpu(m->ee_block))
            r = m - 1;
        else
            l = m + 1;
    }

    /* path->p_ext 指向不大于 block 的最后一个 extent */
    path->p_ext = l - 1;
}

函数返回后，path->p_ext 指向起始逻辑块不超过目标 block 的那个 extent。调用者还需检查 block < ee_block + ee_len 来确认 block 确实落在该 extent 范围内，否则说明是个”洞”（hole）。

4. 块组管理

4.1 块分配器：mballoc

Ext4 使用多块分配器（mballoc，fs/ext4/mballoc.c）一次申请多个连续块，减少碎片。分配请求用 struct ext4_allocation_request 描述：

struct ext4_allocation_request {
    struct inode *inode;        /* 目标 inode */
    unsigned int len;           /* 希望分配的块数 */
    ext4_lblk_t logical;        /* 目标逻辑块号 */
    ext4_lblk_t lleft;          /* 最近已分配的左侧逻辑块 */
    ext4_lblk_t lright;         /* 最近已分配的右侧逻辑块 */
    ext4_fsblk_t goal;          /* 物理块 hint（优先尝试目标） */
    ext4_fsblk_t pleft;         /* 左侧逻辑块对应的物理块 */
    ext4_fsblk_t pright;        /* 右侧逻辑块对应的物理块 */
    unsigned int flags;         /* EXT4_MB_HINT_* 标志 */
};

flags 字段的核心标志：

#define EXT4_MB_HINT_MERGE       0x0001  /* 优先尝试合并到相邻 extent */
#define EXT4_MB_HINT_METADATA    0x0004  /* 分配元数据块 */
#define EXT4_MB_HINT_DATA        0x0020  /* 分配数据块 */
#define EXT4_MB_HINT_NOPREALLOC  0x0040  /* 不预分配 */
#define EXT4_MB_DELALLOC_RESERVED 0x0400 /* 延迟分配预留块 */
#define EXT4_MB_STREAM_ALLOC     0x0800  /* 流式分配（顺序写优化）*/

mballoc 的核心策略是”伙伴系统”（buddy system）：每个块组维护一个伙伴位图，记录不同大小的空闲连续块位置。分配时按 cr（criteria，标准）从 0 到 3 逐步放宽条件：

cr=0：只分配大碎片（≥ 请求大小），延迟分配优先路径
cr=1：平均碎片大小匹配，红黑树查找
cr=2：顺序扫描块组
cr=3：任意可用块

4.2 灵活块组（Flex_BG）

当 flex_bg 特性启用时，相邻的若干块组（默认 16 个）合并为一个”超级块组”（flex group）。超级块组内的块位图和 inode 位图集中存放在第一个块组，数据块则紧随其后连续分布，大幅减少了寻道次数。

5. JBD2 日志机制与崩溃一致性

5.1 JBD2 架构

Ext4 使用 JBD2（Journaling Block Device 2）提供日志能力，JBD2 是独立内核子系统，也可被其他文件系统（如 OCFS2）使用。核心实体：

journal_t（include/linux/jbd2.h，第 770 行）：

struct journal_s
{
    unsigned long       j_flags;         /* General journaling state flags */
    int                 j_errno;         /* Outstanding error on journal */
    struct mutex        j_abort_mutex;   /* Lock for aborting procedure */
    struct buffer_head  *j_sb_buffer;    /* Superblock buffer */
    journal_superblock_t *j_superblock;  /* Superblock structure */
    rwlock_t            j_state_lock;    /* Protect scalar fields */
    int                 j_barrier_count; /* Processes waiting for barrier lock */
    struct mutex        j_barrier;       /* The barrier lock itself */
    transaction_t       *j_running_transaction;    /* Current running transaction */
    transaction_t       *j_committing_transaction; /* Being committed */
    transaction_t       *j_checkpoint_transactions;/* Checkpointed transactions */
    /* ... */
    unsigned long       j_commit_interval;  /* commit interval in jiffies */
    struct task_struct  *j_task;            /* kjournald2 thread */
    int                 j_max_transaction_buffers; /* max buffers per transaction */
    /* ... */
};

transaction_t（include/linux/jbd2.h，第 560 行）：

struct transaction_s
{
    journal_t           *t_journal;         /* 所属日志 */
    tid_t               t_tid;              /* 事务序号 */
    enum {
        T_RUNNING,          /* 正在接收修改 */
        T_LOCKED,           /* 准备提交，不再接收 */
        T_SWITCH,           /* 切换到新事务 */
        T_FLUSH,            /* 刷新数据 */
        T_COMMIT,           /* 提交阶段 */
        T_COMMIT_DFLUSH,    /* 提交数据刷盘 */
        T_COMMIT_JFLUSH,    /* 提交日志刷盘 */
        T_COMMIT_CALLBACK,  /* 执行提交后回调 */
        T_FINISHED          /* 已完成 */
    }                   t_state;
    unsigned long       t_log_start;        /* 事务在日志中的起始位置 */
    int                 t_nr_buffers;       /* t_buffers 链表中的 buffer 数 */
    struct journal_head *t_reserved_list;   /* 已预留但未修改的 buffer */
    struct journal_head *t_buffers;         /* 已修改的元数据 buffer */
    struct journal_head *t_forget;          /* 可在 checkpoint 后释放的 buffer */
    struct journal_head *t_checkpoint_list; /* 等待 checkpoint 的 buffer */
    struct journal_head *t_shadow_list;     /* IO 过程中的 shadow buffer */
    struct list_head    t_inode_list;       /* 关联 inode 列表 */
    spinlock_t          t_handle_lock;      /* 保护 handle 相关信息 */
    /* ... */
};

5.2 事务生命周期

JBD2 事务遵循严格的状态机，一次完整的写操作流程如下：

应用程序写入
     │
     ▼
ext4_journal_start()   ──→  获取 handle_t，绑定到 running transaction
     │
     ▼
jbd2_journal_get_write_access(handle, bh)
     │                    将 buffer 的原始内容（或修改后）拷贝到日志 shadow
     ▼
修改 buffer_head 内容
     │
     ▼
jbd2_journal_dirty_metadata(handle, bh)
     │                    标记 buffer 为"脏元数据"，加入 t_buffers 链表
     ▼
jbd2_journal_stop(handle)
     │                    减少 handle 引用计数，若为0则可触发事务提交
     ▼
kjournald2 线程
     │
     ├── 写 Journal Descriptor Block（描述本次事务涉及的所有块）
     ├── 写 Journal Data（所有脏的元数据块内容）
     ├── 写 Journal Commit Block（表示事务完整写入日志）
     │
     ▼
  checkpoint（异步）
     │
     └── 将日志中的块真正写回文件系统原始位置，释放日志空间

jbd2_journal_start() 函数定义于 fs/jbd2/transaction.c：

handle_t *jbd2_journal_start(journal_t *journal, int nblocks)
{
    return jbd2__journal_start(journal, nblocks, 0, 0, GFP_NOFS, 0, 0);
}

参数 nblocks 是本次操作预期修改的日志块数量上限，JBD2 会为此在日志中预留空间。

5.3 三种数据模式的实现

在 data=ordered（默认）模式下，fs/ext4/ext4_jbd2.h 中的宏 EXT4_ORDERED_DATA_MODE 控制以下逻辑：提交事务的元数据之前，必须确保文件的数据页先于元数据落盘——这通过将 inode 加入 t_inode_list 并在提交时强制 writeback 实现，防止日志重放后看到指向未初始化块的元数据。

6. 读写实现：从 VFS 到磁盘

6.1 读路径：`ext4_file_read_iter`

定义于 fs/ext4/file.c，第 130 行：

static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
    struct inode *inode = file_inode(iocb->ki_filp);

    /* 若文件系统已强制关闭（forced shutdown），直接返回 EIO */
    if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
        return -EIO;

    /* 零长度读无需更新 atime */
    if (!iov_iter_count(to))
        return 0;

#ifdef CONFIG_FS_DAX
    /* 若是 DAX（直接访问 PMEM）模式，走 DAX 路径 */
    if (IS_DAX(inode))
        return ext4_dax_read_iter(iocb, to);
#endif
    /* Direct I/O：绕过 page cache */
    if (iocb->ki_flags & IOCB_DIRECT)
        return ext4_dio_read_iter(iocb, to);

    /* 默认：buffered I/O，经由 page cache */
    return generic_file_read_iter(iocb, to);
}

generic_file_read_iter 在 VFS 层实现，最终通过 mapping->a_ops->read_folio() 触发 Ext4 的 ext4_read_folio()，后者通过 ext4_map_blocks() 将逻辑块号翻译为物理块号，再提交 bio 到块层。

6.2 写路径：`ext4_file_write_iter`

定义于 fs/ext4/file.c，第 696 行：

static ssize_t
ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
    struct inode *inode = file_inode(iocb->ki_filp);

    if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
        return -EIO;

#ifdef CONFIG_FS_DAX
    if (IS_DAX(inode))
        return ext4_dax_write_iter(iocb, from);
#endif
    /* Direct I/O 写 */
    if (iocb->ki_flags & IOCB_DIRECT)
        return ext4_dio_write_iter(iocb, from);
    else
        /* Buffered 写（包含延迟分配路径） */
        return ext4_buffered_write_iter(iocb, from);
}

ext4_buffered_write_iter 调用 generic_perform_write，后者对每个页面调用 address_space_operations：

static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
                                        struct iov_iter *from)
{
    ssize_t ret;
    struct inode *inode = file_inode(iocb->ki_filp);

    if (iocb->ki_flags & IOCB_NOWAIT)
        return -EOPNOTSUPP;

    inode_lock(inode);                     /* 持有 inode 写锁 */
    ret = ext4_write_checks(iocb, from);   /* 检查大小、quota 等 */
    if (ret <= 0)
        goto out;

    current->backing_dev_info = inode_to_bdi(inode);
    ret = generic_perform_write(iocb, from); /* 核心写入循环 */
    current->backing_dev_info = NULL;

out:
    inode_unlock(inode);
    if (likely(ret > 0)) {
        iocb->ki_pos += ret;
        ret = generic_write_sync(iocb, ret); /* 若 O_SYNC 则等待落盘 */
    }
    return ret;
}

generic_perform_write 的每次迭代都调用：

ext4_da_write_begin()：准备页面，标记延迟分配
将用户数据拷贝进页面
ext4_da_write_end()：完成写入，更新 i_disksize

6.3 逻辑块到物理块映射：`ext4_map_blocks`

ext4_map_blocks()（fs/ext4/inode.c，第 478 行）是读写路径的核心枢纽：

int ext4_map_blocks(handle_t *handle, struct inode *inode,
                    struct ext4_map_blocks *map, int flags)
{
    struct extent_status es;
    int retval;

    map->m_flags = 0;

    /* 首先查询内存中的 extent status 缓存（避免磁盘读） */
    if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) &&
        ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
        if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
            /* 已写或未写的真实 extent，直接返回物理地址 */
            map->m_pblk = ext4_es_pblock(&es) +
                          map->m_lblk - es.es_lblk;
            map->m_flags |= ext4_es_is_written(&es) ?
                            EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
            retval = es.es_len - (map->m_lblk - es.es_lblk);
            if (retval > map->m_len)
                retval = map->m_len;
            map->m_len = retval;
        } else if (ext4_es_is_delayed(&es) || ext4_es_is_hole(&es)) {
            /* 延迟分配或空洞：返回0，上层按需分配 */
            map->m_pblk = 0;
            retval = 0;
        }
        /* ... */
        goto found;
    }

    /* cache miss：持 i_data_sem 读锁查询 extent 树 */
    /* 若 flags 含 EXT4_GET_BLOCKS_CREATE，则可能触发实际块分配 */
    /* ... */
}

ext4_map_blocks 三步查询策略：

Extent Status Tree（内存缓存）：高速 RB 树，存储已查询过的 extent 状态（written/unwritten/delayed/hole）。
Extent 树（磁盘 B+ 树）：调用 ext4_ext_map_blocks() 走 ext4_find_extent() 查找。
块分配：若需要创建新块，调用 mballoc 分配物理块，再通过 ext4_ext_insert_extent() 插入 extent 树。

7. 延迟分配（Delayed Allocation）

延迟分配是 Ext4 性能优化的核心特性，通过挂载选项 delalloc（默认开启）启用。

7.1 原理

传统文件系统在 write() 时立即分配物理块。Ext4 延迟分配将块分配推迟到 writeback（脏页回写）时，好处在于：

写入更多数据后，分配器对局部性有更好的判断，可以分配连续的大块
若文件被删除前从未发生回写（如临时文件），完全避免了磁盘分配

7.2 实现：`ext4_da_write_begin` 和 `ext4_da_get_block_prep`

延迟分配写入由 ext4_da_write_begin() 启动（fs/ext4/inode.c，第 2875 行）：

static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
                               loff_t pos, unsigned len,
                               struct page **pagep, void **fsdata)
{
    int ret, retries = 0;
    struct folio *folio;
    pgoff_t index;
    struct inode *inode = mapping->host;

    if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
        return -EIO;

    index = pos >> PAGE_SHIFT;

    /* 若文件系统已满或正在进行 fsverity，退回到非延迟分配模式 */
    if (ext4_nonda_switch(inode->i_sb) || ext4_verity_in_progress(inode)) {
        *fsdata = (void *)FALL_BACK_TO_NONDELALLOC;
        return ext4_write_begin(file, mapping, pos, len, pagep, fsdata);
    }
    *fsdata = (void *)0;

    /* 处理 inline data（小文件数据存 inode 内）*/
    if (ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) {
        ret = ext4_da_write_inline_data_begin(mapping, inode, pos, len,
                                              pagep, fsdata);
        if (ret < 0) return ret;
        if (ret == 1) return 0;
    }

retry:
    folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
                                mapping_gfp_mask(mapping));
    if (IS_ERR(folio))
        return PTR_ERR(folio);

    folio_wait_stable(folio);

    /* ext4_da_get_block_prep：标记块为 delayed，不实际分配 */
    ret = __block_write_begin(&folio->page, pos, len, ext4_da_get_block_prep);
    if (ret < 0) {
        folio_unlock(folio);
        folio_put(folio);
        if (pos + len > inode->i_size)
            ext4_truncate_failed_write(inode);
        if (ret == -ENOSPC &&
            ext4_should_retry_alloc(inode->i_sb, &retries))
            goto retry;
        return ret;
    }
    *pagep = &folio->page;
    return ret;
}

ext4_da_get_block_prep() 是关键：它调用 ext4_da_map_blocks()，如果块还未分配，只在 extent status tree 中标记一个 delayed 状态，不分配任何物理块，并将 buffer 标记为 BH_Delay。

int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
                           struct buffer_head *bh, int create)
{
    struct ext4_map_blocks map;
    int ret = 0;

    BUG_ON(create == 0);
    BUG_ON(bh->b_size != inode->i_sb->s_blocksize);

    map.m_lblk = iblock;
    map.m_len = 1;

    /* 查询现有映射：若已分配则直接返回；若是空洞则创建 delayed 记录 */
    ret = ext4_da_map_blocks(inode, iblock, &map, bh);
    if (ret <= 0)
        return ret;

    map_bh(bh, inode->i_sb, map.m_pblk);
    ext4_update_bh_state(bh, map.m_flags);

    if (buffer_unwritten(bh)) {
        set_buffer_new(bh);
        set_buffer_mapped(bh);
    }
    return 0;
}

真正的块分配发生在 writeback 路径的 ext4_writepages() → mpage_prepare_extent_to_map() → ext4_map_blocks() 中，此时一次为多个连续的 delayed 块分配物理空间。

8. 目录索引：HTree（dx_root）

8.1 线性目录 vs HTree

Ext2/3 中目录是简单的线性链表，每次查找要从头遍历所有目录项，在大目录（数千文件）中性能极差。Ext4 的 HTree（Hash Tree）将目录实现为基于文件名哈希的 B+ 树，查找复杂度降为 O(log n)。

8.2 核心数据结构（`fs/ext4/namei.c`）

/* HTree 的一个索引项：哈希值 → 数据块号 */
struct dx_entry
{
    __le32 hash;   /* 文件名哈希值 */
    __le32 block;  /* 指向的目录数据块号 */
};

/* 目录树根节点（位于目录的第一个数据块） */
struct dx_root
{
    struct fake_dirent dot;       /* "."  目录项（占位） */
    char dot_name[4];
    struct fake_dirent dotdot;    /* ".." 目录项（占位） */
    char dotdot_name[4];
    struct dx_root_info
    {
        __le32 reserved_zero;
        u8  hash_version;         /* 哈希函数版本：half-MD4/tea/siphash 等 */
        u8  info_length;          /* = 8，本结构体大小 */
        u8  indirect_levels;      /* 树深度（0 = 单层索引）*/
        u8  unused_flags;
    } info;
    struct dx_entry entries[];    /* 索引项数组（紧随其后）*/
};

/* 内部节点（非根的索引块） */
struct dx_node
{
    struct fake_dirent fake;
    struct dx_entry entries[];
};

/* 每个 htree 块末尾的尾部（存 checksum） */
struct dx_tail {
    u32    dt_reserved;
    __le32 dt_checksum;   /* crc32c(uuid+inum+dirblock) */
};

查找时流程：计算文件名哈希 → 在 dx_root.entries[] 中二分查找对应数据块 → 读取该数据块，线性扫描目录项匹配文件名。

哈希版本由超级块 s_def_hash_version 指定，默认为 half-MD4（Half-MD4，快速且分布均匀）。

9. 性能调优挂载选项

Ext4 提供丰富的挂载选项，以下是关键参数（来自 fs/ext4/super.c）：

9.1 数据写入模式

选项	含义
`data=journal`	数据和元数据都写日志，最安全，最慢
`data=ordered`	默认；元数据写日志，数据在元数据前落盘
`data=writeback`	元数据写日志，数据写序不保证

9.2 日志相关

选项	含义
`journal_async_commit`	异步提交日志，减少写延迟
`journal_checksum`	启用日志块校验和（默认开）
`commit=N`	日志提交间隔（秒），默认 5 秒
`noload`	挂载时不加载日志（只读模式用）

9.3 分配与预分配

选项	含义
`delalloc`	延迟块分配（默认开）
`nodelalloc`	禁用延迟分配（数据库场景）
`noauto_da_alloc`	关闭 close 时的强制 delalloc 落盘
`max_batch_time=N`	mballoc 批量分配最大等待时间（微秒）
`min_batch_time=N`	mballoc 批量分配最小等待时间（微秒）

9.4 IO 特性

选项	含义
`barrier` / `nobarrier`	写屏障控制（默认开）；SSD 可以安全关闭
`discard` / `nodiscard`	trim/discard 支持（SSD 推荐开）
`dioread_nolock`	Direct I/O 读时不持 inode 锁，提升并发读性能

9.5 错误处理

选项	含义
`errors=continue`	遇错继续（不推荐）
`errors=remount-ro`	遇错重挂载为只读（默认）
`errors=panic`	遇错 kernel panic

9.6 推荐生产配置示例

# 普通服务器（高可靠性）
mount -o defaults,data=ordered,barrier=1,journal_async_commit /dev/sda1 /data

# SSD 优化
mount -o defaults,discard,nobarrier,delalloc,data=ordered /dev/nvme0n1 /data

# 数据库（需 Direct I/O，避免延迟分配带来的不确定性）
mount -o defaults,nodelalloc,data=writeback,nobarrier /dev/sdb1 /dbdata

10. 常见问题排查

10.1 `e2fsck`：文件系统一致性检查

# 强制全面检查（卸载后运行）
e2fsck -f -v /dev/sda1

# 自动修复（生产环境慎用，建议先 -n 查看）
e2fsck -p /dev/sda1

# 只检查不修复
e2fsck -n /dev/sda1

e2fsck 按 5 个阶段检查：块/inode 位图、inode 结构、目录连通性、目录引用计数、块引用计数。超级块 s_error_count、s_first_error_func、s_first_error_line 等字段可以帮助定位历史错误来源。

10.2 `debugfs`：低级调试

# 交互式进入（只读）
debugfs /dev/sda1

# 查看 inode 信息
debugfs -R "stat <inode_number>" /dev/sda1

# 查看 extent 树
debugfs -R "extents <inode_number>" /dev/sda1

# 查看超级块
debugfs -R "stats" /dev/sda1

# dump 指定文件内容
debugfs -R "dump <inode_number> /tmp/recovered" /dev/sda1

常用 debugfs 命令：

命令	用途
`stat <inum>`	显示 inode 详细信息
`extents <inum>`	显示 extent 树结构
`htree <dir_inum>`	显示目录 HTree 结构
`logdump`	dump 日志内容（需 `-f` 挂载）
`lsdel`	列出已删除的 inode
`undelete <inum> <name>`	恢复已删除文件（有限支持）

10.3 `tune2fs`：调整文件系统参数

# 查看所有参数（包含错误历史）
tune2fs -l /dev/sda1

# 修改挂载间隔和最大挂载次数
tune2fs -i 0 -c 0 /dev/sda1

# 设置保留块比例（默认 5%，大盘可降低）
tune2fs -m 1 /dev/sda1

# 启用/禁用特性
tune2fs -O extents,uninit_bg,dir_index /dev/sda1

# 查看日志大小
tune2fs -l /dev/sda1 | grep -i journal

10.4 常见告警与处理

告警：EXT4-fs error (device sda1): ext4_find_extent:880: inode has invalid extent depth

原因：extent 树头部损坏，eh_depth 超过 EXT4_MAX_EXTENT_DEPTH(5)。

处理：e2fsck -f /dev/sda1，若无法修复则需从备份恢复。

告警：EXT4-fs warning: maximal mount count reached

原因：挂载次数达到 s_max_mnt_count（默认 -1 表示不检查，但旧版本会有此限制）。

处理：tune2fs -c 0 /dev/sda1 取消限制，或 e2fsck -f 清零计数。

告警：EXT4-fs (sda1): delayed block allocation failed for inode … No space left

原因：延迟分配时发现空间不足，ext4_nonda_switch() 触发非延迟模式但仍分配失败。

处理：清理磁盘空间；检查 df -i 是否 inode 耗尽；检查 reserved block 占比（tune2fs -m）。

11. 总结

Ext4 是一个在 30 年演进中积累大量工程智慧的成熟文件系统。其核心设计决策：

Extent 树 以 60 字节的 inode 内嵌空间为根，5 层 B+ 树可寻址 1 EiB 空间，查找效率高，碎片少。
JBD2 日志 提供元数据原子性，三种数据模式灵活适配不同一致性需求。
延迟分配 通过推迟物理块分配到 writeback 时机，实现更大范围的连续分配，显著减少随机写碎片。
HTree 目录索引 将目录查找从 O(n) 降为 O(log n)，数万文件的目录查找毫秒级完成。
元数据校验和 结合 JBD2 日志校验，从磁盘静默错误到日志重放错误均有保护。

对于大多数通用场景，Ext4 的默认配置（data=ordered、delalloc、barrier）是合理的起点。在高性能场景可开启 discard、journal_async_commit；数据库等需要精确控制 IO 语义的场景则应关闭 delalloc，配合 Direct I/O 使用。

参考源码（Linux 6.4-rc1）：

fs/ext4/ext4.h — 核心数据结构定义（ext4_super_block、ext4_group_desc、ext4_inode）
fs/ext4/ext4_extents.h — Extent 树结构体
fs/ext4/extents.c — Extent 树实现（ext4_find_extent、ext4_ext_binsearch）
fs/ext4/inode.c — inode 操作、ext4_map_blocks、延迟分配
fs/ext4/file.c — ext4_file_read_iter、ext4_file_write_iter
fs/ext4/namei.c — 目录操作、HTree（dx_root、dx_entry）
fs/ext4/super.c — 超级块挂载、挂载选项解析
fs/ext4/mballoc.c — 多块分配器
fs/jbd2/journal.c — JBD2 日志核心
fs/jbd2/transaction.c — 事务管理（jbd2_journal_start）
include/linux/jbd2.h — journal_t、transaction_t 定义

Linux 存储与文件系统深度剖析（四）：页缓存与缓冲区缓存

Posted on 2026-04-12 Edited on 2026-06-10 In Linux Kernel

页缓存（Page Cache）是 Linux 内核中性能优化最关键的子系统之一。它充当内存与磁盘之间的高速缓冲层，使得绝大多数文件读写操作无需真正触达磁盘。本文基于 Linux 6.4-rc1 内核源码，从数据结构到核心算法，系统地剖析页缓存的实现原理。

Linux 存储与文件系统深度剖析（三）：块设备层详解

Posted on 2026-04-12 Edited on 2026-06-10 In Linux Kernel

前言

在 Linux I/O 栈中，块设备层（Block Layer）是连接文件系统与底层硬件驱动的关键枢纽。无论是 ext4 的 writepage、XFS 的 journal 写入，还是数据库的 Direct I/O，最终都会落到这一层，转化为标准的 I/O 请求发送给设备驱动。

本文基于 Linux 6.4-rc1（ac9a78681b92）内核源码，从数据结构到代码执行路径，深度解析块设备层的工作原理。理解这一层，是做存储性能分析、I/O 调度调优乃至驱动开发的必要基础。

一、块设备层架构概述

1.1 传统单队列架构（legacy single-queue）

在 Linux 3.13 以前，块设备层采用单一请求队列（request_queue）设计。所有 CPU 的 I/O 请求都需要竞争同一把 queue_lock 自旋锁，然后将请求插入电梯调度器（如 CFQ、Deadline）。这一设计在 HDD 时代游刃有余——机械盘的 seek time 比锁竞争开销高出几个数量级，调度器对 I/O 的合并与排序能显著提升吞吐量。

然而，随着 NVMe SSD 的普及，设备延迟已经降至微秒级，单队列的软件开销反而成了瓶颈：

全局锁竞争：在 NUMA 多核系统上，锁争用带来严重的 cache-line bouncing
单队列深度不足：高端 NVMe 设备支持 64K+ 的队列深度，单队列无法利用
调度器开销：为慢速设备设计的调度算法在快速 NVMe 上反而增加了延迟

1.2 blk-mq 多队列架构

为此，Jens Axboe 在 2013-2014 年引入了 blk-mq（block multi-queue）架构（block/blk-mq.c，版权归 Jens Axboe 和 Christoph Hellwig 所有），从根本上重构了块设备层：

Application (read/write syscall)
       │
       ▼
VFS / Page Cache
       │
       ▼
File System (ext4/xfs/btrfs)
       │ submit_bio()
       ▼
┌─────────────────────────────────────────┐
│           Block Layer (blk-mq)          │
│                                         │
│  Per-CPU Software Queue (blk_mq_ctx)    │
│    CPU0 │ CPU1 │ CPU2 │ ... │ CPUn      │
│      ↘    ↓    ↙                        │
│   Hardware Queue (blk_mq_hw_ctx)        │
│    HQ0   HQ1   HQ2   ...   HQm          │
└─────────────────────────────────────────┘
       │ queue_rq()
       ▼
Block Driver (nvme/scsi/virtio-blk)
       │
       ▼
Physical Device

核心设计思想是两级队列：

软件队列（Software Queue，blk_mq_ctx）：每个 CPU 绑定一个，用于接收该 CPU 提交的 I/O 请求，无需全局加锁。
硬件队列（Hardware Queue，blk_mq_hw_ctx）：对应设备的实际硬件队列（如 NVMe 的 Submission Queue），一个或多个软件队列映射到一个硬件队列。

blk-mq 同样支持 I/O 调度器（mq-deadline、bfq、kyber），但调度粒度从全局变为了硬件队列级别，大幅减少了锁竞争。

二、核心数据结构深度分析

2.1 `struct bio` —— I/O 操作的基本单元

bio（Block I/O）是块设备层最核心的数据结构，定义在 include/linux/blk_types.h 中。它描述了一次 I/O 操作的所有信息：目标设备、起始扇区、数据缓冲区列表、操作类型及完成回调。

/* include/linux/blk_types.h */
struct bio {
    struct bio          *bi_next;     /* request queue link */
    struct block_device *bi_bdev;
    blk_opf_t            bi_opf;     /* bottom bits REQ_OP, top bits req_flags */
    unsigned short       bi_flags;   /* BIO_* below */
    unsigned short       bi_ioprio;
    blk_status_t         bi_status;
    atomic_t             __bi_remaining;

    struct bvec_iter     bi_iter;

    blk_qc_t             bi_cookie;
    bio_end_io_t        *bi_end_io;
    void                *bi_private;
#ifdef CONFIG_BLK_CGROUP
    struct blkcg_gq     *bi_blkg;
    struct bio_issue     bi_issue;
#ifdef CONFIG_BLK_CGROUP_IOCOST
    u64                  bi_iocost_cost;
#endif
#endif
    /* ... integrity, encryption ... */
    unsigned short       bi_vcnt;      /* how many bio_vec's */
    unsigned short       bi_max_vecs;  /* max bvl_vecs we can hold */
    atomic_t             __bi_cnt;     /* pin count */
    struct bio_vec      *bi_io_vec;    /* the actual vec list */
    struct bio_set      *bi_pool;
    struct bio_vec       bi_inline_vecs[]; /* inline vecs for small I/O */
};

关键字段逐一解析：

字段	类型	说明
`bi_next`	`struct bio *`	当多个 bio 被合并到一个 request 时，通过此指针形成链表
`bi_bdev`	`struct block_device *`	目标块设备（含分区信息），通过 `bi_bdev->bd_disk` 可得 `gendisk`
`bi_opf`	`blk_opf_t`（u32）	低 8 位为操作类型（`enum req_op`），高 24 位为标志位（`REQ_SYNC`、`REQ_FUA` 等）
`bi_flags`	`unsigned short`	BIO 状态标志，如 `BIO_CLONED`（克隆 bio）、`BIO_CHAIN`（链式 bio）等
`bi_status`	`blk_status_t`	I/O 完成状态，`BLK_STS_OK`（0）表示成功，其余为各类错误码
`__bi_remaining`	`atomic_t`	引用计数，链式 bio 时最后一个完成才触发 `bi_end_io`
`bi_iter`	`struct bvec_iter`	I/O 迭代器，记录当前扇区位置（`bi_sector`）、已处理字节数（`bi_done`）等
`bi_cookie`	`blk_qc_t`	提交后返回的 cookie，用于 polling 模式（`REQ_POLLED`）查询完成状态
`bi_end_io`	函数指针	I/O 完成回调，驱动完成后调用，用于通知上层（page cache、文件系统等）
`bi_private`	`void *`	供调用者保存私有上下文，内核不使用
`bi_vcnt`	`unsigned short`	`bi_io_vec` 数组中有效的 `bio_vec` 数量
`bi_io_vec`	`struct bio_vec *`	数据段列表，每个 `bio_vec` 描述一个物理页面片段（page, offset, len）
`bi_inline_vecs[]`	flexible array	尾部内联的 vec 空间，避免小 I/O 的二次内存分配

操作类型（enum req_op） 是理解 bio 语义的关键：

/* include/linux/blk_types.h */
enum req_op {
    REQ_OP_READ          = 0,   /* 从设备读扇区 */
    REQ_OP_WRITE         = 1,   /* 向设备写扇区 */
    REQ_OP_FLUSH         = 2,   /* 刷新写缓存 */
    REQ_OP_DISCARD       = 3,   /* 丢弃扇区（TRIM/UNMAP） */
    REQ_OP_SECURE_ERASE  = 5,   /* 安全擦除 */
    REQ_OP_WRITE_ZEROES  = 9,   /* 写零 */
    REQ_OP_ZONE_APPEND   = 13,  /* 追加写（ZNS 设备） */
    /* ... zone management ops ... */
};

注意操作号的奇偶性有意义：奇数为写方向（TO device），偶数为读方向（FROM device），这由 op_is_write() 内联函数利用 op & 1 快速判断。

2.2 `struct request` —— 调度器视角的 I/O 请求

bio 是文件系统/VFS 层与块设备层的接口，而 struct request 是块设备层内部的调度单元。一个 request 可能包含多个连续地址的 bio（经过合并后）。它定义在 include/linux/blk-mq.h：

/* include/linux/blk-mq.h */
struct request {
    struct request_queue    *q;
    struct blk_mq_ctx       *mq_ctx;    /* 提交 request 的软件队列 */
    struct blk_mq_hw_ctx    *mq_hctx;  /* 分发到的硬件队列 */

    blk_opf_t    cmd_flags;    /* op and common flags */
    req_flags_t  rq_flags;     /* RQF_* 内部状态标志 */

    int          tag;           /* 硬件队列 tag（分配时赋值） */
    int          internal_tag;  /* 调度器内部 tag */

    unsigned int timeout;

    unsigned int __data_len;   /* 总数据长度（字节） */
    sector_t     __sector;     /* 当前扇区游标 */

    struct bio  *bio;          /* bio 链表头 */
    struct bio  *biotail;      /* bio 链表尾 */

    union {
        struct list_head  queuelist; /* 在调度器队列中的链表节点 */
        struct request   *rq_next;   /* 在 plug list 中的下一个 */
    };

    struct block_device *part;      /* 目标分区 */
    u64  start_time_ns;             /* 请求分配时间（用于延迟统计） */
    u64  io_start_time_ns;          /* I/O 下发到设备的时间 */

    unsigned short  nr_phys_segments; /* DMA scatter-gather 段数 */
    unsigned short  ioprio;

    enum mq_rq_state  state;   /* MQ_RQ_IDLE / IN_FLIGHT / COMPLETE */
    atomic_t          ref;

    union {
        struct hlist_node  hash;     /* 调度器合并哈希表节点 */
        struct llist_node  ipi_list; /* 跨 CPU 完成通知 */
    };

    union {
        struct rb_node     rb_node;       /* 调度器红黑树节点 */
        struct bio_vec     special_vec;   /* WRITE SAME 等特殊 payload */
        void              *completion_data;
    };

    union {
        struct { struct io_cq *icq; void *priv[2]; } elv; /* 调度器私有 */
        struct { unsigned int seq; struct list_head list;
                 rq_end_io_fn *saved_end_io; } flush;       /* flush 序列 */
    };

    rq_end_io_fn  *end_io;      /* 完成回调 */
    void          *end_io_data;
};

几个关键设计细节：

tag 与 internal_tag 的区别：tag 是真正下发给硬件的编号（从 blk_mq_tags 的 sbitmap 分配），internal_tag 是在使用调度器时由调度器分配的”预分配”编号。只有请求真正下发时才分配硬件 tag。
state 的三个取值（MQ_RQ_IDLE → MQ_RQ_IN_FLIGHT → MQ_RQ_COMPLETE）用原子读写保护，驱动通过 blk_mq_start_request() 将状态置为 IN_FLIGHT，完成时置为 COMPLETE。
start_time_ns 和 io_start_time_ns 的差值就是在软件层（调度器）排队等待的时间，这正是 iostat -x 中 await 减去 svctm 的部分。

req_flags_t（RQF_*） 描述请求的内部生命周期状态：

#define RQF_STARTED       (1 << 1)   /* 驱动已开始处理 */
#define RQF_FLUSH_SEQ     (1 << 4)   /* 属于 flush 序列 */
#define RQF_MQ_INFLIGHT   (1 << 6)   /* 计入 inflight 统计 */
#define RQF_IO_STAT       (1 << 13)  /* 计入磁盘 I/O 统计 */
#define RQF_ELV           (1 << 22)  /* 队列上挂有电梯调度器 */

2.3 `struct request_queue` —— 块设备的全局控制中心

request_queue 是每个块设备（或分区组）的核心管理结构，定义在 include/linux/blkdev.h：

/* include/linux/blkdev.h (节选) */
struct request_queue {
    struct request          *last_merge;   /* 最近一次合并的 request，加速后续合并查找 */
    struct elevator_queue   *elevator;     /* 绑定的 I/O 调度器实例 */

    struct percpu_ref        q_usage_counter; /* 引用计数，freeze 时变为 0 */

    struct blk_queue_stats  *stats;
    struct rq_qos           *rq_qos;       /* QoS 控制链（throttle、wbt 等） */

    const struct blk_mq_ops *mq_ops;       /* 驱动操作函数表 */

    /* 软件队列：每个 CPU 一个 */
    struct blk_mq_ctx __percpu *queue_ctx;

    unsigned int             queue_depth;

    /* 硬件队列：通过 XArray 索引 */
    struct xarray            hctx_table;
    unsigned int             nr_hw_queues;

    void                    *queuedata;    /* 驱动私有数据 */
    unsigned long            queue_flags;  /* QUEUE_FLAG_* */

    int                      id;
    spinlock_t               queue_lock;
    struct gendisk          *disk;

    unsigned long            nr_requests;  /* 最大请求数 */
    unsigned int             rq_timeout;
    struct timer_list        timeout;
    struct work_struct       timeout_work;

    struct queue_limits      limits;       /* 设备限制（max_sectors, seg_size 等） */

    /* flush 机制 */
    struct blk_flush_queue  *fq;

    /* requeue 队列：无法立即下发时暂存 */
    struct list_head         requeue_list;
    spinlock_t               requeue_lock;
    struct delayed_work      requeue_work;

    int                      mq_freeze_depth; /* 冻结深度计数 */
    wait_queue_head_t        mq_freeze_wq;
    struct mutex             mq_freeze_lock;
    /* ... */
};

request_queue 中的 struct queue_limits limits 非常重要，它记录了设备的物理约束，直接影响 I/O 分割和合并的策略：

max_sectors：单次 I/O 最大扇区数
max_segments：最大 scatter-gather 段数
max_segment_size：单个 DMA 段的最大字节数
logical_block_size：逻辑块大小（通常 512B 或 4096B）
physical_block_size：物理块大小（影响合并对齐）
discard_granularity：TRIM/DISCARD 的粒度

nr_hw_queues 决定了多队列的并行度。对于 NVMe SSD，这个值通常等于 CPU 核心数（每个 CPU 一个硬件队列）；对于虚拟设备（如 virtio-blk）通常是 1。

2.4 `struct blk_mq_hw_ctx` —— 硬件队列状态机

/* include/linux/blk-mq.h (节选) */
struct blk_mq_hw_ctx {
    struct {
        spinlock_t       lock;
        struct list_head dispatch;  /* 待下发但资源不足时暂存的请求 */
        unsigned long    state;     /* BLK_MQ_S_STOPPED / TAG_ACTIVE / SCHED_RESTART */
    } ____cacheline_aligned_in_smp;

    struct delayed_work  run_work;  /* 延迟运行硬件队列的 work */
    cpumask_var_t        cpumask;   /* 该 hctx 对应的 CPU 集合 */
    int                  next_cpu;
    int                  next_cpu_batch;

    unsigned long        flags;     /* BLK_MQ_F_* */

    void                *sched_data; /* I/O 调度器私有数据 */
    struct request_queue *queue;
    struct blk_flush_queue *fq;
    void                *driver_data;

    struct sbitmap       ctx_map;   /* 标记哪些软件队列有待处理请求 */

    struct blk_mq_ctx   *dispatch_from; /* 无调度器时的分发起点（轮转） */
    unsigned int         dispatch_busy; /* 硬件繁忙度（EWMA 估算） */

    unsigned short       type;      /* HCTX_TYPE_DEFAULT/READ/POLL */
    unsigned short       nr_ctx;    /* 关联的软件队列数量 */
    struct blk_mq_ctx  **ctxs;      /* 软件队列数组 */

    struct blk_mq_tags  *tags;      /* 硬件 tag 集合 */
    struct blk_mq_tags  *sched_tags; /* 调度器 tag 集合 */

    unsigned long        queued;    /* 累计入队请求数（调试统计） */
    unsigned long        run;       /* 累计下发请求数 */

    unsigned int         numa_node;
    unsigned int         queue_num; /* hctx 编号 */
    atomic_t             nr_active; /* tag set 共享时的活跃请求数 */
    /* ... debugfs, cpuhp ... */
};

ctx_map 是一个 sbitmap，每个 bit 对应一个软件队列（blk_mq_ctx）。当某个软件队列有新请求时，对应 bit 被置位（blk_mq_hctx_mark_pending()）；硬件队列运行时遍历所有置位的软件队列取出请求。这个设计避免了逐个遍历所有 CPU 队列的开销。

2.5 `struct blk_mq_ctx` —— 软件队列（Per-CPU）

/* block/blk-mq.h */
struct blk_mq_ctx {
    struct {
        spinlock_t       lock;
        struct list_head rq_lists[HCTX_MAX_TYPES]; /* DEFAULT/READ/POLL 三种类型 */
    } ____cacheline_aligned_in_smp;

    unsigned int         cpu;
    unsigned short       index_hw[HCTX_MAX_TYPES]; /* 该 ctx 在 hctx->ctx_map 中的 bit 位 */
    struct blk_mq_hw_ctx *hctxs[HCTX_MAX_TYPES];   /* 关联的硬件队列 */

    struct request_queue *queue;
    struct blk_mq_ctxs   *ctxs;
    struct kobject        kobj;
} ____cacheline_aligned_in_smp;

____cacheline_aligned_in_smp 确保每个 CPU 的软件队列数据独占一条 cache line，消除 false sharing。rq_lists 按请求类型分三个链表，允许驱动（通过 enum hctx_type）针对不同操作类型（如 READ vs POLL）映射到不同的硬件队列。

三、bio 的完整生命周期

3.1 bio 分配：`bio_alloc_bioset()`

bio 通常通过 bio_alloc_bioset() 从内存池（mempool）分配，而不是直接用 kmalloc。这是为了保证在内存紧张时，I/O 路径仍能前进而不死锁。核心逻辑在 block/bio.c：

/* block/bio.c */
struct bio *bio_alloc_bioset(struct block_device *bdev, unsigned short nr_vecs,
                             blk_opf_t opf, gfp_t gfp_mask,
                             struct bio_set *bs)
{
    gfp_t saved_gfp = gfp_mask;
    struct bio *bio;
    void *p;

    /* Per-CPU 缓存快速路径：REQ_ALLOC_CACHE 标志 + 小 bio */
    if (opf & REQ_ALLOC_CACHE) {
        if (bs->cache && nr_vecs <= BIO_INLINE_VECS) {
            bio = bio_alloc_percpu_cache(bdev, nr_vecs, opf, gfp_mask, bs);
            if (bio)
                return bio;
        } else {
            opf &= ~REQ_ALLOC_CACHE;
        }
    }

    /*
     * 防死锁：当运行在 submit_bio_noacct() 上下文（即任何块驱动内部）时，
     * 已有 bio 在当前任务的 bio_list 中待提交。若此时内存紧张并触发直接回收，
     * 可能导致死锁。因此先尝试不阻塞分配，失败时将待提交 bio 转发给救援线程。
     */
    if (current->bio_list &&
        (!bio_list_empty(&current->bio_list[0]) ||
         !bio_list_empty(&current->bio_list[1])) &&
        bs->rescue_workqueue)
        gfp_mask &= ~__GFP_DIRECT_RECLAIM;

    p = mempool_alloc(&bs->bio_pool, gfp_mask);
    if (!p && gfp_mask != saved_gfp) {
        punt_bios_to_rescuer(bs);  /* 将阻塞的 bio 交给 rescuer workqueue */
        gfp_mask = saved_gfp;
        p = mempool_alloc(&bs->bio_pool, gfp_mask);
    }
    if (unlikely(!p))
        return NULL;

    bio = p + bs->front_pad;  /* front_pad 供上层（如 md、dm）存放私有数据 */

    /* 根据 nr_vecs 决定是否需要从 bvec_pool 额外分配 bvec 数组 */
    if (nr_vecs > BIO_INLINE_VECS) {
        struct bio_vec *bvl = bvec_alloc(&bs->bvec_pool, &nr_vecs, gfp_mask);
        /* ... */
        bio_init(bio, bdev, bvl, nr_vecs, opf);
    } else if (nr_vecs) {
        bio_init(bio, bdev, bio->bi_inline_vecs, BIO_INLINE_VECS, opf);
    } else {
        bio_init(bio, bdev, NULL, 0, opf);
    }

    bio->bi_pool = bs;
    return bio;
}
EXPORT_SYMBOL(bio_alloc_bioset);

bvec_slabs 的分级设计 值得关注——bio.c 定义了四种规格的 bvec slab：

static struct biovec_slab bvec_slabs[] __read_mostly = {
    { .nr_vecs = 16,            .name = "biovec-16"  },
    { .nr_vecs = 64,            .name = "biovec-64"  },
    { .nr_vecs = 128,           .name = "biovec-128" },
    { .nr_vecs = BIO_MAX_VECS,  .name = "biovec-max" },
};

小于等于 4 个 vec 的 bio 使用 bi_inline_vecs（内联在 bio 结构体尾部），无需额外分配。这对于大量小 I/O（如 4K 随机读写）的场景非常重要。

3.2 bio 提交：`submit_bio()` → `submit_bio_noacct()`

上层（文件系统、Direct I/O）调用 submit_bio() 将 bio 送入块设备层：

/* block/blk-core.c */
void submit_bio(struct bio *bio)
{
    if (bio_op(bio) == REQ_OP_READ) {
        task_io_account_read(bio->bi_iter.bi_size);
        count_vm_events(PGPGIN, bio_sectors(bio));
    } else if (bio_op(bio) == REQ_OP_WRITE) {
        count_vm_events(PGPGOUT, bio_sectors(bio));
    }

    submit_bio_noacct(bio);
}
EXPORT_SYMBOL(submit_bio);

submit_bio() 做的事很简单：更新任务 I/O 统计（/proc/<pid>/io）和全局 VM 事件计数器，然后调用 submit_bio_noacct()。

submit_bio_noacct() 是真正的入口，它会进行 block cgroup 检查、throttling、wbt（writeback throttling）等 QoS 处理。对于 blk-mq 设备，最终调用 blk_mq_submit_bio()。

3.3 `blk_mq_submit_bio()` —— blk-mq 提交路径

这是 blk-mq 架构的提交核心，位于 block/blk-mq.c：

/* block/blk-mq.c */
void blk_mq_submit_bio(struct bio *bio)
{
    struct request_queue *q = bdev_get_queue(bio->bi_bdev);
    struct blk_plug *plug = blk_mq_plug(bio);
    const int is_sync = op_is_sync(bio->bi_opf);
    struct blk_mq_hw_ctx *hctx;
    struct request *rq;
    unsigned int nr_segs = 1;
    blk_status_t ret;

    /* 1. bounce buffer 处理（高端内存设备的 DMA 限制） */
    bio = blk_queue_bounce(bio, q);

    /* 2. 按设备限制分割 bio（超过 max_sectors 或 max_segments 则拆分） */
    if (bio_may_exceed_limits(bio, &q->limits)) {
        bio = __bio_split_to_limits(bio, &q->limits, &nr_segs);
        if (!bio)
            return;
    }

    /* 3. 数据完整性预处理（T10 PI/DIX） */
    if (!bio_integrity_prep(bio))
        return;

    bio_set_ioprio(bio);

    /* 4. 尝试合并到 plug list 中已有的 request（plug 合并，最快路径） */
    rq = blk_mq_get_cached_request(q, plug, &bio, nr_segs);
    if (!rq) {
        if (!bio)
            return;
        /* 5. 分配新的 request */
        rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
        if (unlikely(!rq))
            return;
    }

    trace_block_getrq(bio);
    rq_qos_track(q, rq, bio);

    /* 6. 将 bio 关联到 request */
    blk_mq_bio_to_request(rq, bio, nr_segs);

    /* 7. 内联加密 keyslot 获取 */
    ret = blk_crypto_rq_get_keyslot(rq);
    if (ret != BLK_STS_OK) {
        bio->bi_status = ret;
        bio_endio(bio);
        blk_mq_free_request(rq);
        return;
    }

    /* 8. flush 操作特殊处理 */
    if (op_is_flush(bio->bi_opf)) {
        blk_insert_flush(rq);
        return;
    }

    /* 9. plug 路径：请求暂存在 plug list，unplug 时批量下发 */
    if (plug) {
        blk_add_rq_to_plug(plug, rq);
        return;
    }

    /* 10. 直接下发路径 */
    hctx = rq->mq_hctx;
    if ((rq->rq_flags & RQF_ELV) ||
        (hctx->dispatch_busy && (q->nr_hw_queues == 1 || !is_sync))) {
        blk_mq_insert_request(rq, 0);
        blk_mq_run_hw_queue(hctx, true);
    } else {
        blk_mq_run_dispatch_ops(q, blk_mq_try_issue_directly(hctx, rq));
    }
}

plug 机制是性能优化的关键：调用者在一批 I/O 开始前调用 blk_start_plug()，所有 bio 被暂存在 per-task 的 plug 链表中（不加任何锁），I/O 结束后调用 blk_finish_plug() 触发 unplug，此时进行合并和批量下发。这本质上是将调度从内核侧延迟到应用侧，减少了加锁次数和 context switch。

3.4 bio 完成：`bio_endio()`

当设备驱动完成 I/O 后，会调用 blk_mq_end_request()，最终触发 bio_endio()：

/* block/bio.c */
void bio_endio(struct bio *bio)
{
again:
    /* 链式 bio：只有所有子 bio 都完成才触发父 bio 的回调 */
    if (!bio_remaining_done(bio))
        return;

    /* 数据完整性验证 */
    if (!bio_integrity_endio(bio))
        return;

    rq_qos_done_bio(bio);

    /* blktrace 追踪点 */
    if (bio->bi_bdev && bio_flagged(bio, BIO_TRACE_COMPLETION)) {
        trace_block_bio_complete(bdev_get_queue(bio->bi_bdev), bio);
        bio_clear_flag(bio, BIO_TRACE_COMPLETION);
    }

    /* 尾递归优化：避免链式 bio 栈溢出 */
    if (bio->bi_end_io == bio_chain_endio) {
        bio = __bio_chain_endio(bio);
        goto again;
    }

    blk_throtl_bio_endio(bio);  /* throttle 记账 */
    bio_uninit(bio);            /* 释放 cgroup 引用等 */
    if (bio->bi_end_io)
        bio->bi_end_io(bio);    /* 调用上层回调（如 page cache 的 end_bio_read） */
}
EXPORT_SYMBOL(bio_endio);

注意 bio_chain_endio 分支的尾递归优化：当多个 bio 通过 bio_chain() 串联时，如果用普通递归处理，深度链可能导致栈溢出。goto again 将递归转为循环，同时 __bio_chain_endio 返回父 bio 让循环继续处理，这是内核中防止栈溢出的经典技巧。

四、blk-mq 多队列架构详解

4.1 队列映射：CPU → 软件队列 → 硬件队列

blk-mq 使用两级映射：

CPU → 软件队列（blk_mq_ctx）：通过 per_cpu 机制，每个 CPU 直接访问自己的 queue_ctx。
软件队列 → 硬件队列（blk_mq_hw_ctx）：通过 ctx->hctxs[type]，不同类型（DEFAULT/READ/POLL）的请求可以路由到不同硬件队列。

映射关系由 struct blk_mq_queue_map 描述：

struct blk_mq_queue_map {
    unsigned int *mq_map;       /* CPU ID → 硬件队列索引 */
    unsigned int  nr_queues;
    unsigned int  queue_offset;
};

默认映射算法（blk_mq_map_queues()）将 CPU 均匀分散到硬件队列，NUMA 感知版本会优先将 CPU 映射到本 NUMA 节点的硬件队列，减少跨 NUMA 内存访问。

4.2 请求分发：`blk_mq_dispatch_rq_list()`

当硬件队列需要处理请求时，blk_mq_dispatch_rq_list() 负责将请求下发给驱动：

/* block/blk-mq.c */
bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
                              unsigned int nr_budgets)
{
    struct request_queue *q = hctx->queue;
    struct request *rq;
    int queued = 0;
    blk_status_t ret = BLK_STS_OK;
    LIST_HEAD(zone_list);
    bool needs_resource = false;

    if (list_empty(list))
        return false;

    queued = 0;
    do {
        struct blk_mq_queue_data bd;

        rq = list_first_entry(list, struct request, queuelist);
        /* 准备分发（预算检查、tag 分配等） */
        prep = blk_mq_prep_dispatch_rq(rq, !nr_budgets);
        if (prep != PREP_DISPATCH_OK)
            break;

        list_del_init(&rq->queuelist);
        bd.rq   = rq;
        bd.last = list_empty(list);  /* 告知驱动是否是最后一个请求 */

        if (nr_budgets)
            nr_budgets--;

        /* 调用驱动的 queue_rq 回调，真正下发给硬件 */
        ret = q->mq_ops->queue_rq(hctx, &bd);
        switch (ret) {
        case BLK_STS_OK:
            queued++;
            break;
        case BLK_STS_RESOURCE:
            needs_resource = true;
            fallthrough;
        case BLK_STS_DEV_RESOURCE:
            /* 设备资源不足，暂停并将未下发请求回放到 dispatch 链表 */
            blk_mq_handle_dev_resource(rq, list);
            goto out;
        case BLK_STS_ZONE_RESOURCE:
            /* ZNS 设备 zone 资源不足，移到 zone_list 继续处理其他 zone */
            blk_mq_handle_zone_resource(rq, &zone_list);
            needs_resource = true;
            break;
        default:
            blk_mq_end_request(rq, ret);
        }
    } while (!list_empty(list));
out:
    /* 将 zone_list 中的请求合并回 list */
    if (!list_empty(&zone_list))
        list_splice_tail_init(&zone_list, list);
    /* ...处理未下发的请求，考虑是否需要 restart... */
}

bd.last 字段很有趣——它允许驱动实现”批量提交”优化。NVMe 驱动利用此标志决定何时真正 ring doorbell（更新提交队列尾指针），当 last=false 时只填写 SQ Entry 但暂不通知设备，last=true 时才一次性通知，大幅减少 MMIO 写操作次数（每次 MMIO 写耗时数百纳秒）。

4.3 请求完成：跨 CPU 的 IPI 机制

在 NUMA 系统或中断亲和性配置不当时，I/O 完成中断可能在与提交请求不同的 CPU 上触发。blk-mq 有两种完成路径：

本地完成：blk_mq_complete_request() → 直接调用 rq->q->mq_ops->complete(rq)
跨 CPU 完成（IPI）：通过 llist 和 RAISE_SOFTIRQ(BLOCK_SOFTIRQ) 将完成通知发送到请求所在 CPU，再在 softirq 上下文处理

1	static DEFINE_PER_CPU(struct llist_head, blk_cpu_done);

blk_cpu_done 是 per-CPU 的无锁链表，跨 CPU 完成时将 request 通过 llist_add() 挂到目标 CPU 的链表上，然后发送 IPI 唤醒 softirq 处理。

五、请求合并机制深度分析

5.1 合并的类型

I/O 合并是块设备层的重要优化，将多个 bio/request 合并为一个大请求，减少 I/O 操作次数。blk-merge.c 实现了三种合并：

类型	说明	条件
后向合并（Back Merge）	新 bio 追加到 request 末尾	`req_end_sector == bio_start_sector`
前向合并（Front Merge）	新 bio 插入到 request 头部	`bio_end_sector == req_start_sector`
request 间合并（Elevator Merge）	两个 request 合并为一个	调度器负责，检查相邻性

5.2 合并判断：`blk_try_merge()`

/* block/blk-merge.c */
enum elv_merge blk_try_merge(struct request *rq, struct bio *bio)
{
    if (blk_discard_mergable(rq))
        return ELEVATOR_DISCARD_MERGE;
    else if (blk_rq_pos(rq) + blk_rq_sectors(rq) == bio->bi_iter.bi_sector)
        return ELEVATOR_BACK_MERGE;    /* request 末尾扇区 == bio 起始扇区 */
    else if (blk_rq_pos(rq) - bio_sectors(bio) == bio->bi_iter.bi_sector)
        return ELEVATOR_FRONT_MERGE;   /* bio 末尾扇区 == request 起始扇区 */
    return ELEVATOR_NO_MERGE;
}

5.3 后向合并的完整检查：`ll_back_merge_fn()`

仅扇区相邻还不够，还需要通过硬件限制检查：

/* block/blk-merge.c */
int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs)
{
    /* 检查是否会产生 DMA gap（虚拟边界约束） */
    if (req_gap_back_merge(req, bio))
        return 0;

    /* 数据完整性 gap 检查（PI/DIF 数据） */
    if (blk_integrity_rq(req) &&
        integrity_req_gap_back_merge(req, bio))
        return 0;

    /* 内联加密：key context 必须匹配 */
    if (!bio_crypt_ctx_back_mergeable(req, bio))
        return 0;

    /* 合并后不能超过设备最大 sectors 限制 */
    if (blk_rq_sectors(req) + bio_sectors(bio) >
        blk_rq_get_max_sectors(req, blk_rq_pos(req))) {
        req_set_nomerge(req->q, req);  /* 标记该 req 不可再合并，避免重复检查 */
        return 0;
    }

    return ll_new_hw_segment(req, bio, nr_segs); /* 检查 segments 数量限制 */
}

5.4 调度器的哈希加速

为了快速找到可合并的 request，调度器维护一个以扇区号为键的哈希表（RQF_HASHED 标志表示请求在哈希表中）。每次 bio 提交时，通过 elv_merge() 在哈希表中 O(1) 查找末尾扇区匹配的 request，而不是遍历全部请求。

request_queue 的 last_merge 字段更进一步——它缓存了上一次合并的 request 指针，因为 I/O 模式往往具有时间局部性，顺序写入时后续 bio 极可能与同一个 request 合并。

六、块设备 I/O 统计：`/proc/diskstats` 的数据来源

iostat 的数据全部来自 /proc/diskstats，而 diskstats 的数据由块设备层在请求生命周期的关键节点记录。核心结构是 struct disk_stats（per-CPU）和 part_stat_* 系列宏。

diskstats 的字段与内核计数器映射：

/proc/diskstats 字段:
  reads_completed    → part_stat_add(part, ios[STAT_READ], 1)
  sectors_read       → part_stat_add(part, sectors[STAT_READ], ...)
  time_reading_ms    → part_stat_add(part, nsecs[STAT_READ], nsecs) / NSEC_PER_MSEC
  reads_merged       → part_stat_inc(part, merges[STAT_READ])
  writes_completed   → part_stat_add(part, ios[STAT_WRITE], 1)
  ...
  io_in_progress     → part_in_flight()  ← 实时值，非累计
  time_io_ms         → part_stat_add(part, io_ticks, ...)

记录时机：

blk_account_io_start(rq)：request 下发到设备时，递增 in_flight 计数，记录 start_time_ns
blk_account_io_done(rq, now)：request 完成时，更新 ios、sectors、nsecs，递减 in_flight

await（平均 I/O 等待时间）= nsecs[READ] / ios[READ]，包含排队时间和设备服务时间。svctm（已被 iostat 废弃，不再可靠）原本估算纯设备服务时间。

io_ticks 是设备繁忙时间的累计，当 in_flight > 0 时每个 tick 递增，对应 iostat 的 %util。

七、实际调试技巧

7.1 blktrace：内核级 I/O 追踪

blktrace 利用内核 tracefs/relay 机制，可以捕获 bio/request 在块设备层每个阶段的事件（队列、合并、下发、完成等）：

# 追踪 nvme0n1 设备 10 秒
blktrace -d /dev/nvme0n1 -w 10 -o /tmp/nvme_trace

# 分析追踪结果
blkparse /tmp/nvme_trace.blktrace.* | head -50

# 生成 I/O 模式可视化（需要 seekwatcher）
btt -i /tmp/nvme_trace.blktrace.0

blktrace 事件字母含义：

字母	阶段	说明
Q	Queued	bio 进入块设备层
G	Get request	分配 request 结构
M	Merge	bio 被合并到已有 request
I	Insert	request 插入 I/O 调度器
D	Issue	request 下发给驱动
C	Complete	request 完成
P	Plug	设备被 plugged
U	Unplug	设备被 unplugged，触发批量下发

从 Q 到 D 的时间是软件栈延迟，D 到 C 是设备服务时间。

7.2 BPF 工具：biolatency 和 biosnoop

BCC（BPF Compiler Collection）提供了更灵活的 I/O 分析工具：

# 统计 I/O 延迟分布（直方图）
biolatency -d nvme0n1 10

# 实时追踪每个 I/O 操作
biosnoop -d nvme0n1

# 跟踪 blk-mq 请求队列深度
blkqueue  # 需要 bpftrace

# 使用 bpftrace 自定义：统计 submit_bio 调用栈
bpftrace -e 'kprobe:submit_bio { @[kstack] = count(); }'

7.3 系统调优参数

调优块设备层的关键 sysfs 参数（以 nvme0n1 为例）：

# 查看/设置 I/O 调度器
cat /sys/block/nvme0n1/queue/scheduler
echo mq-deadline > /sys/block/nvme0n1/queue/scheduler

# 队列深度（request_queue.nr_requests）
cat /sys/block/nvme0n1/queue/nr_requests
echo 256 > /sys/block/nvme0n1/queue/nr_requests

# 最大 I/O 大小（对应 queue_limits.max_sectors_kb）
cat /sys/block/nvme0n1/queue/max_sectors_kb

# read-ahead 大小
cat /sys/block/nvme0n1/queue/read_ahead_kb
echo 128 > /sys/block/nvme0n1/queue/read_ahead_kb

# 硬件队列数量（只读）
cat /sys/block/nvme0n1/mq/

# 查看各硬件队列的请求统计
cat /sys/kernel/debug/block/nvme0n1/hctx0/run
cat /sys/kernel/debug/block/nvme0n1/hctx0/queued

7.4 使用 debugfs 分析 blk-mq 状态

Linux 提供了详细的 blk-mq debugfs 接口（需要 CONFIG_BLK_DEBUG_FS=y）：

# 查看 request_queue 整体状态
cat /sys/kernel/debug/block/nvme0n1/state

# 查看硬件队列状态
ls /sys/kernel/debug/block/nvme0n1/
# 输出: hctx0  hctx1  ...  hctxN

# 查看特定硬件队列的 dispatch 队列中的请求
cat /sys/kernel/debug/block/nvme0n1/hctx0/dispatch

# 查看软件队列
cat /sys/kernel/debug/block/nvme0n1/hctx0/ctx0/rq_list

# 查看 tag 使用情况
cat /sys/kernel/debug/block/nvme0n1/hctx0/tags

八、blk-mq 性能调优实战

8.1 选择合适的 I/O 调度器

none      → 无调度器，适合 NVMe 等极低延迟设备（高并发随机 I/O）
mq-deadline → deadline 的 blk-mq 版，适合混合负载，防止写饥饿
kyber     → 基于 token 的轻量级调度器，适合低延迟 SSD（latency target）
bfq       → 基于比例带宽分配，适合桌面/混合负载（公平性好但开销大）

对于 NVMe SSD 数据库服务器，通常推荐 none 或 mq-deadline（设置合理的 write_expire 以控制写入延迟）。

8.2 NUMA 感知调优

在多 NUMA 节点系统上，确保设备中断亲和性和进程调度在同一 NUMA 节点：

# 查看 nvme0n1 中断的 CPU 亲和性
cat /proc/irq/$(ls /sys/block/nvme0n1/device/msi_irqs/ | head -1)/smp_affinity_list

# 将 I/O 密集进程绑定到 NVMe 控制器所在 NUMA 节点
numactl --cpunodebind=0 --membind=0 fio --name=test ...

九、小结

本文从源码层面系统梳理了 Linux 块设备层的核心机制：

架构演进：从单队列到 blk-mq 多队列，解决了 NVMe 时代的扩展性问题
核心数据结构：bio 描述单次 I/O，request 是调度器视角的单元，request_queue 是设备的控制中枢，blk_mq_hw_ctx/blk_mq_ctx 实现了两级队列的无锁化设计
bio 生命周期：从 bio_alloc_bioset() 的内存池设计，到 blk_mq_submit_bio() 的 plug/unplug 批处理优化，再到 bio_endio() 的链式 bio 尾递归处理
请求合并：通过哈希加速和 last_merge 缓存，在满足 DMA 约束的前提下最大化合并效果
统计与调试：/proc/diskstats 的数据来源，以及 blktrace、BPF 工具链的使用

下一篇将聚焦 I/O 调度器（mq-deadline、bfq、kyber）的实现原理，深入分析请求排序、带宽公平分配和延迟控制算法。

参考资料

Linux 6.4-rc1 源码：block/bio.c、block/blk-mq.c、block/blk-merge.c、block/blk-core.c
头文件：include/linux/blk_types.h、include/linux/blkdev.h、include/linux/blk-mq.h
Jens Axboe: blk-mq design (2013)
Linux Block I/O: Introducing Multi-queue SSD Access on Multi-core Systems
Brendan Gregg: Systems Performance: Enterprise and the Cloud, Chapter 9

Linux 存储栈深度解析（二）：VFS 虚拟文件系统源码剖析

Posted on 2026-04-12 Edited on 2026-06-10 In Linux Kernel

本文基于 Linux 6.4-rc1 内核源码（commit ac9a78681b92），对 VFS 层的核心数据结构、关键算法和执行路径进行深度分析。所有代码片段均直接取自内核源文件，并标注了文件路径与行号。

Linux 存储与文件系统深度剖析系列（一）：总览与架构

Posted on 2026-04-12 Edited on 2026-06-10 In Linux Kernel

Linux 存储与文件系统深度剖析系列（一）：总览与架构

系列介绍

欢迎来到 Linux 存储与文件系统深度剖析系列！本系列将从代码级别深入探讨 Linux 内核中的存储子系统和文件系统实现，基于 Linux 6.4-rc1 内核源码进行分析。

系列文章目录

总览与架构（本文）
VFS 虚拟文件系统层深度剖析
块设备层（Block Layer）详解
页缓存（Page Cache）与缓冲区缓存机制
Ext4 文件系统源码分析
Btrfs 文件系统核心原理
XFS 文件系统实现细节
IO 调度器深度剖析
直接 IO 与异步 IO 实现
文件系统性能优化与调试

Linux 存储栈整体架构

Linux 存储栈是一个复杂的分层架构，从用户空间到物理硬件，主要包含以下几层：

┌─────────────────────────────────────────────┐
│          用户空间应用                        │
│    (read, write, open, close, mmap...)      │
└─────────────────┬───────────────────────────┘
                  │ System Call Interface
┌─────────────────▼───────────────────────────┐
│        VFS (Virtual Filesystem Switch)      │
│    - struct inode, dentry, file, super      │
│    - 通用文件操作接口                         │
└─────────────────┬───────────────────────────┘
                  │
      ┌───────────┼───────────┐
      │           │           │
┌─────▼────┐ ┌───▼────┐ ┌───▼────┐
│   Ext4   │ │ Btrfs  │ │  XFS   │  ... (具体文件系统)
└─────┬────┘ └───┬────┘ └───┬────┘
      │          │          │
      └──────────┼──────────┘
                 │
┌────────────────▼────────────────────────────┐
│         Page Cache / Buffer Cache           │
│    (struct address_space, struct page)      │
└────────────────┬────────────────────────────┘
                 │
┌────────────────▼────────────────────────────┐
│          Block Layer (块设备层)              │
│    - struct bio, struct request             │
│    - 块设备通用操作                           │
└────────────────┬────────────────────────────┘
                 │
┌────────────────▼────────────────────────────┐
│          IO Scheduler (IO调度器)            │
│    - mq-deadline, BFQ, Kyber                │
└────────────────┬────────────────────────────┘
                 │
┌────────────────▼────────────────────────────┐
│      Block Device Drivers (块设备驱动)       │
│    - SCSI, NVMe, virtio-blk, ...            │
└────────────────┬────────────────────────────┘
                 │
┌────────────────▼────────────────────────────┐
│          Physical Storage (物理存储)         │
│    - HDD, SSD, NVMe, RAID, ...              │
└─────────────────────────────────────────────┘

核心数据结构概览

1. VFS 层核心结构

在 Linux 内核中，VFS 使用以下核心数据结构：

// include/linux/fs.h

// inode: 代表文件系统中的一个对象（文件、目录等）
struct inode {
    umode_t                 i_mode;      // 文件类型和权限
    unsigned short          i_opflags;
    kuid_t                  i_uid;       // 所有者 UID
    kgid_t                  i_gid;       // 所有者 GID
    unsigned int            i_flags;

    const struct inode_operations *i_op;  // inode 操作方法
    struct super_block      *i_sb;        // 所属超级块
    struct address_space    *i_mapping;   // 页缓存映射

    loff_t                  i_size;       // 文件大小（字节）
    struct timespec64       i_atime;      // 访问时间
    struct timespec64       i_mtime;      // 修改时间
    struct timespec64       i_ctime;      // 状态改变时间

    unsigned long           i_ino;        // inode 编号
    dev_t                   i_rdev;       // 设备号（设备文件）

    // ... 更多字段
};

// dentry: 目录项，连接路径名和 inode
struct dentry {
    unsigned int d_flags;
    seqcount_spinlock_t d_seq;
    struct hlist_bl_node d_hash;
    struct dentry *d_parent;              // 父目录项
    struct qstr d_name;                   // 文件名
    struct inode *d_inode;                // 关联的 inode

    const struct dentry_operations *d_op; // dentry 操作方法
    struct super_block *d_sb;             // 所属超级块

    // ... 更多字段
};

// file: 代表打开的文件
struct file {
    struct path             f_path;       // 包含 dentry 和 vfsmount
    struct inode            *f_inode;     // 缓存的 inode 指针
    const struct file_operations *f_op;   // 文件操作方法

    spinlock_t              f_lock;
    atomic_long_t           f_count;      // 引用计数
    unsigned int            f_flags;      // 打开标志
    fmode_t                 f_mode;       // 文件模式
    loff_t                  f_pos;        // 文件位置指针

    struct address_space    *f_mapping;   // 页缓存映射

    // ... 更多字段
};

// super_block: 超级块，代表已挂载的文件系统
struct super_block {
    struct list_head        s_list;
    dev_t                   s_dev;        // 设备号
    unsigned char           s_blocksize_bits;
    unsigned long           s_blocksize;  // 块大小
    loff_t                  s_maxbytes;   // 最大文件大小
    struct file_system_type *s_type;      // 文件系统类型
    const struct super_operations *s_op;  // 超级块操作方法

    unsigned long           s_flags;
    unsigned long           s_magic;      // 魔数
    struct dentry           *s_root;      // 根目录项

    // ... 更多字段
};

2. Block Layer 核心结构

// include/linux/blk_types.h

// bio: Block I/O，描述一次块 I/O 操作
struct bio {
    struct bio              *bi_next;     // 请求队列链表
    struct block_device     *bi_bdev;     // 目标块设备
    unsigned int            bi_opf;       // 操作标志（读/写等）

    unsigned short          bi_vcnt;      // bio_vec 数组大小
    unsigned short          bi_max_vecs;  // bio_vec 最大数量

    atomic_t                __bi_cnt;     // 引用计数
    struct bio_vec          *bi_io_vec;   // 指向页向量数组

    bio_end_io_t            *bi_end_io;   // I/O 完成回调
    void                    *bi_private;  // 私有数据

    // ... 更多字段
};

// request: 代表一个或多个 bio 组成的请求
struct request {
    struct request_queue    *q;           // 所属请求队列
    struct blk_mq_ctx       *mq_ctx;      // 多队列上下文
    struct blk_mq_hw_ctx    *mq_hctx;     // 硬件队列上下文

    unsigned int            cmd_flags;    // 命令标志
    enum rq_qos_id          rq_qos_id;    // QoS ID

    sector_t                __sector;     // 起始扇区
    unsigned int            __data_len;   // 数据长度

    struct bio              *bio;         // bio 链表头
    struct bio              *biotail;     // bio 链表尾

    // ... 更多字段
};

关键源码文件位置

基于 Linux 6.4-rc1 内核，以下是主要的源码文件位置：

VFS 层

fs/ - 文件系统核心代码
- fs/namei.c - 路径查找和解析
- fs/open.c - 文件打开操作
- fs/read_write.c - 读写操作
- fs/inode.c - inode 管理
- fs/dcache.c - dentry 缓存
- fs/super.c - 超级块管理

Block Layer

block/ - 块设备层代码
- block/blk-core.c - 核心功能
- block/blk-mq.c - 多队列实现
- block/bio.c - bio 处理
- block/blk-merge.c - 请求合并
- block/blk-settings.c - 块设备设置

IO Schedulers

block/mq-deadline.c - Deadline 调度器
block/bfq-iosched.c - BFQ (Budget Fair Queueing) 调度器
block/kyber-iosched.c - Kyber 调度器

具体文件系统

fs/ext4/ - Ext4 文件系统
fs/btrfs/ - Btrfs 文件系统
fs/xfs/ - XFS 文件系统

Page Cache

mm/filemap.c - 文件映射和页缓存
mm/readahead.c - 预读机制
fs/buffer.c - 缓冲区缓存

一个文件读取的完整流程

让我们通过一个简单的 read() 系统调用，看看数据是如何从磁盘流向用户空间的：

用户程序: read(fd, buffer, size)
    |
    v
系统调用入口: sys_read()  (fs/read_write.c)
    |
    v
VFS 层: vfs_read()
    | - 检查权限和文件状态
    | - 调用 file->f_op->read_iter()
    v
文件系统层: ext4_file_read_iter()  (以 Ext4 为例)
    | - 调用 generic_file_read_iter()
    v
Page Cache: filemap_read()  (mm/filemap.c)
    | - 在页缓存中查找数据
    | - 如果缓存命中，直接返回
    | - 如果缓存未命中，继续...
    v
文件系统 readpage: ext4_readpage()
    | - 准备从磁盘读取数据
    | - 创建 bio 结构
    v
Block Layer: submit_bio()  (block/bio.c)
    | - 将 bio 提交到块设备层
    | - 可能进行请求合并
    v
IO Scheduler: 调度器处理请求
    | - mq-deadline / BFQ / Kyber
    | - 优化 I/O 顺序
    v
Block Driver: SCSI/NVMe/... 驱动程序
    | - 向硬件发送命令
    v
Hardware: 物理存储设备执行读取
    |
    v (中断返回)
Block Driver: 中断处理程序
    |
    v
Block Layer: bio_endio()
    | - 调用 bio->bi_end_io() 回调
    v
Page Cache: 将数据标记为最新
    |
    v
VFS: 将数据复制到用户空间缓冲区
    |
    v
用户程序: read() 返回

核心概念解析

1. 页缓存（Page Cache）

页缓存是 Linux 内存管理的核心组件之一，它缓存文件内容在内存中，避免重复的磁盘访问。

关键特性：

以页（通常 4KB）为单位缓存文件数据
使用 struct address_space 管理文件到物理页的映射
支持预读（readahead）机制优化顺序读取
支持回写（writeback）机制延迟写入

代码位置： mm/filemap.c

2. 块 I/O 层（Block Layer）

块设备层是连接文件系统和设备驱动的中间层。

核心功能：

将文件系统的 I/O 请求转换为块设备请求
I/O 请求的合并和重排序
支持多队列（blk-mq）架构
提供 I/O 统计和 QoS 控制

代码位置： block/

3. VFS（Virtual Filesystem Switch）

VFS 是所有文件系统的抽象层，提供统一的接口。

设计理念：

对上层提供统一的系统调用接口
对下层提供标准的文件系统操作接口
使用面向对象的设计思想（通过函数指针实现多态）

代码位置： fs/

代码示例：创建一个简单的文件系统

为了更好地理解 VFS，让我们看一个极简的文件系统示例（基于 ramfs）：

// 超级块操作
static const struct super_operations simple_super_ops = {
    .statfs     = simple_statfs,
    .drop_inode = generic_delete_inode,
};

// inode 操作
static const struct inode_operations simple_dir_inode_operations = {
    .create     = simple_create,
    .lookup     = simple_lookup,
    .link       = simple_link,
    .unlink     = simple_unlink,
    .mkdir      = simple_mkdir,
    .rmdir      = simple_rmdir,
    .mknod      = simple_mknod,
    .rename     = simple_rename,
};

// 文件操作
static const struct file_operations simple_file_operations = {
    .read_iter  = generic_file_read_iter,
    .write_iter = generic_file_write_iter,
    .mmap       = generic_file_mmap,
    .fsync      = noop_fsync,
    .llseek     = generic_file_llseek,
};

// 填充超级块
static int simple_fill_super(struct super_block *sb, void *data, int silent)
{
    struct inode *inode;

    sb->s_blocksize = PAGE_SIZE;
    sb->s_blocksize_bits = PAGE_SHIFT;
    sb->s_magic = 0x12345678;
    sb->s_op = &simple_super_ops;

    // 创建根 inode
    inode = new_inode(sb);
    if (!inode)
        return -ENOMEM;

    inode->i_ino = 1;
    inode->i_mode = S_IFDIR | 0755;
    inode->i_op = &simple_dir_inode_operations;
    inode->i_fop = &simple_file_operations;

    // 创建根 dentry
    sb->s_root = d_make_root(inode);
    if (!sb->s_root)
        return -ENOMEM;

    return 0;
}

性能关键路径

在 Linux 存储栈中，以下是几个性能关键路径：

1. 快速路径（Fast Path）

页缓存命中
零拷贝操作（sendfile, splice）
直接 I/O（绕过页缓存）

2. 慢速路径（Slow Path）

页缓存未命中导致的磁盘 I/O
同步写入和 fsync 操作
元数据操作（创建、删除文件）

调试工具

在学习和调试存储系统时，以下工具非常有用：

系统工具

# 查看块设备 I/O 统计
iostat -x 1

# 追踪块 I/O
blktrace -d /dev/sda -o - | blkparse -i -

# 查看文件系统缓存统计
cat /proc/meminfo | grep -i cache

# 查看 VFS 统计
cat /proc/sys/fs/dentry-state
cat /proc/sys/fs/inode-state

内核工具

# 动态追踪（需要 CONFIG_DYNAMIC_FTRACE）
trace-cmd record -e block:* -e vfs:* command

# BPF 工具
biolatency   # I/O 延迟分布
biosnoop     # 追踪块 I/O 事件

下一篇预告

在下一篇文章中，我们将深入探讨 VFS 虚拟文件系统层，包括：

inode、dentry、file 的生命周期管理
路径查找算法的实现细节
dcache（dentry cache）的设计与优化
VFS 如何处理各种文件操作的源码分析

参考资料

Linux 内核源码（v6.4-rc1）：https://git.kernel.org/
Linux 内核文档：Documentation/filesystems/
《深入 Linux 内核架构》
《Linux 内核设计与实现》

作者注： 本系列所有源码分析基于 Linux 6.4-rc1 内核版本。随着内核的演进，部分实现细节可能会有所变化，但核心设计理念保持相对稳定。

深入理解 CIMaster：大规模 CI 集群智能协调系统

Posted on 2026-04-06 Edited on 2026-06-23 In DevOps

深入理解 CIMaster：大规模 CI 集群智能协调系统

在现代云原生开发中，持续集成（CI）流水线是软件交付的基石。在大规模场景下，管理共享测试基础设施成为一个关键挑战。这就是 CIMaster 发挥作用的地方——一个精巧的集群管理服务，旨在协调对共享 CI 测试集群的访问，确保资源的高效利用并防止测试冲突。

问题背景：大规模共享 CI 基础设施

在大型组织中，每天运行成百上千个 CI 任务时，测试集群是需要高效共享的昂贵资源。主要挑战包括：

资源竞争：多个 CI 任务竞争有限的测试集群
集群状态管理：跟踪哪些集群可用、被占用或被保留用于调试
人工干预：开发者需要保留集群进行调查而不阻塞其他人
动态供应：当容量不足时按需创建新集群
生命周期管理：使用后或过期后自动释放集群

CIMaster 通过一个集中式协调服务解决了所有这些挑战。

架构概览

CIMaster 是一个用 Go 编写的 Kubernetes 原生服务，提供集群生命周期管理的 REST API。它由几个关键组件组成：

核心组件

┌─────────────────────────────────────────────────────────────┐
│                        CIMaster 服务                         │
│                                                              │
│  ┌──────────────────┐       ┌─────────────────────────┐    │
│  │  HTTP API 服务器 │       │   集群管理器             │    │
│  │   (端口 8080)    │◄─────►│   - 状态管理            │    │
│  │                  │       │   - 分配逻辑            │    │
│  │  /getvacant      │       │   - Hold 过期           │    │
│  │  /holdcluster    │       │                         │    │
│  │  /releasecluster │       │                         │    │
│  │  /createcluster  │       └───────────┬─────────────┘    │
│  │  ...             │                   │                   │
│  └──────────────────┘                   │                   │
│                                         │                   │
│  ┌──────────────────┐       ┌───────────▼─────────────┐    │
│  │ 指标服务器       │       │  Kubernetes ConfigMap   │    │
│  │  (端口 8090)     │       │  - cluster.json         │    │
│  │                  │       │  - 乐观锁               │    │
│  └──────────────────┘       └─────────────────────────┘    │
│                                                              │
└────────────────┬─────────────────────────────────────────────┘
                 │
                 │ HTTP POST
                 ▼
     ┌───────────────────────┐
     │  Prow Manual Trigger  │
     │  /manual-trigger      │
     └───────────────────────┘

1. 集群管理器 (`cluster-manager.go`)

系统的核心，负责：

集群分配：查找并分配空闲集群给 CI 任务
Hold 管理：允许开发者预留集群用于调试（6 小时过期）
自动清理：定期释放过期的 hold
与 Prow 集成：通过 Prow 的 manual-trigger 端点触发集群创建

2. 集群操作 (`cluster-ops.go`)

实现 ClusterInterface 接口，包含以下操作：

OccupyVacantCluster：原子性地分配可用集群
FinishOccupiedCluster：将集群返回到可用池
HoldCluster/ReleaseCluster：手动 hold 管理
AddCluster/DeleteCluster：集群库存管理

3. 状态持久化

所有集群状态存储在 Kubernetes ConfigMap 中（ci 命名空间中的 clusters）：

[
  {
    "name": "cluster-01",
    "region": "us-west",
    "status": "testing",
    "lastJob": "e2e-conformance",
    "lastBuild": "12345",
    "lastTriggerName": "john",
    "hold": false,
    "disabled": false,
    "purpose": "tess-ci",
    "osimage": "centos-atomic-7.6.1810-qcow2"
  }
]

乐观锁使用 Kubernetes ResourceVersion 防止并发更新时的竞争条件。

与 Prow Manual Trigger 的集成

CIMaster 的一个强大功能是通过 manual-trigger 组件与 Prow 的集成。这使得当现有容量不足时能够动态供应集群。

什么是 Prow Manual Trigger？

Prow 是 Kubernetes 的 CI/CD 系统。manual-trigger 组件（/Users/tashen/test-infra/prow/cmd/manual-trigger）是一个 HTTP 服务器，允许在正常 GitHub webhook 流程之外以编程方式创建 ProwJob。

核心能力：

接受带有任务规格的 HTTP POST 请求
在 Kubernetes 中创建 ProwJob 自定义资源
支持 presubmit、postsubmit 和 periodic 任务类型
向任务注入环境变量（如 AUTHOR）

CIMaster 如何使用 Manual Trigger

当用户调用 CIMaster 的 /createcluster 端点时：

1	curl "http://cimaster:8080/createcluster?user=john&branch=master&job=e2e-k8s-1.32"

CIMaster 执行以下流程：

// 1. 构造 Prow 请求
prowRequest := types.ProwManualTriggerRequest{
    Org:      "tess",
    Repo:     "tessops",
    BaseRef:  "master",              // 来自 branch 参数
    ProwType: "postsubmit",
    ProwJob:  "e2e-k8s-1.32",        // 集群创建任务
    User:     "john",                // 设置 AUTHOR 环境变量
}

// 2. 发送到 Prow manual-trigger 端点
resp, err := http.Post(
    "https://prow.tess.io/manual-trigger",
    "application/json",
    requestBody,
)

// 3. 返回状态给调用者

在 Prow 端，manual-trigger 服务：

// 1. 接收请求
func (s *server) handleManualTrigger(w http.ResponseWriter, r *http.Request) {
    var req triggerRequest
    json.NewDecoder(r.Body).Decode(&req)

    // 2. 从配置中查找任务定义
    postsubmits := cfg.PostsubmitsStatic[req.Org+"/"+req.Repo]
    for _, p := range postsubmits {
        if p.Name == req.ProwJob {
            prowJob = createProwJobFromPostsubmit(p, req)
            break
        }
    }

    // 3. 注入 AUTHOR 环境变量
    if req.User != "" {
        addAuthorEnvToProwJob(prowJob, req.User)
    }

    // 4. 在 Kubernetes 中创建 ProwJob
    prowJobClient.Create(ctx, prowJob, metav1.CreateOptions{})

    // 5. 等待 BuildID 并返回状态链接
    statusLink := fmt.Sprintf("https://prow.tess.io/prowjob?prowjob=%s", prowJob.Name)
    logLink := fmt.Sprintf("https://prow.tess.io/log?job=%s&id=%s", req.ProwJob, buildID)
}

请求-响应流程

┌──────────┐         ┌──────────┐         ┌──────────────┐         ┌────────────┐
│   用户   │         │ CIMaster │         │ Prow Manual  │         │ Kubernetes │
│          │         │          │         │   Trigger    │         │            │
└────┬─────┘         └────┬─────┘         └──────┬───────┘         └─────┬──────┘
     │                    │                      │                       │
     │  POST /createcluster                      │                       │
     │  user=john        │                      │                       │
     ├──────────────────►│                      │                       │
     │                   │                      │                       │
     │                   │  POST /manual-trigger│                       │
     │                   │  {org, repo, prowjob}│                       │
     │                   ├─────────────────────►│                       │
     │                   │                      │                       │
     │                   │                      │  创建 ProwJob         │
     │                   │                      │  AUTHOR=john          │
     │                   │                      ├──────────────────────►│
     │                   │                      │                       │
     │                   │                      │  ◄─────────────────── │
     │                   │                      │  ProwJob 已创建       │
     │                   │                      │                       │
     │                   │ ◄────────────────────┤                       │
     │                   │  {success, job_name, │                       │
     │  ◄────────────────┤   status_link}       │                       │
     │  集群创建已触发   │                      │                       │
     │                   │                      │                       │

触发的 ProwJob 通常运行基础设施即代码（如 Terraform 或 Ansible）来供应新的 Kubernetes 集群，一旦就绪就会被添加到 CIMaster 的池中。

集群生命周期状态机

集群在几个状态之间转换：

┌───────────┐
│  finished │ ◄───────────────────┐
│  (空闲)   │                     │
└─────┬─────┘                     │
      │                           │
      │ /getvacant                │
      │                           │
      ▼                           │
┌──────────┐                      │
│ testing  │                      │
│ (占用中) │                      │
└────┬─────┘                      │
     │                            │
     │ /finishtest                │
     │                            │
     └────────────────────────────┘

Hold 状态（叠加）：
┌─────────┐
│ hold=   │
│ false   │◄──── /releasecluster ────┐
└────┬────┘                           │
     │                                │
     │ /holdcluster                   │
     │                                │
     ▼                                │
┌─────────┐                           │
│ hold=   │                           │
│ true    │────────────────────────────┘
└─────────┘    (6小时后自动过期)

关键特性和实现细节

1. 带重试逻辑的智能分配

CIMaster 实现了带抖动的指数退避来处理并发分配：

type RandomBackoff struct {
    MinBackoff time.Duration
    MaxBackoff time.Duration
    rng        *rand.Rand
}

func (rb *RandomBackoff) GetRetryInterval() time.Duration {
    delta := rb.MaxBackoff - rb.MinBackoff
    return rb.MinBackoff + time.Duration(rb.rng.Int63n(int64(delta)+1))
}

每个操作最多重试 3 次，使用随机的 50-200ms 退避时间，避免惊群问题。

2. 自动 Hold 过期

后台 goroutine 持续检查过期的 hold：

func (cm *ClusterManager) runCronReleaseHeldEnvs() {
    for {
        durationUntilNextExpire, err := cm.clearExpiredHolds()
        timer.Reset(durationUntilNextExpire)
        <-timer.C
    }
}

这确保了如果开发者忘记释放，集群不会被无限期锁定。

3. 多用途集群支持

CIMaster 支持不同的集群类型：

**tess-ci**：标准 CI 测试集群
**tnet-ci**：具有 OS 镜像选择的网络特定测试集群

分配时会遵守用途和 OS 镜像要求：

if cluster.Purpose != purpose {
    continue  // 跳过不兼容的集群
}
if cluster.Purpose == TnetCI && cluster.OSImage != osimage {
    continue  // 跳过错误的 OS 镜像
}

4. 管理员授权

受保护的端点使用简单的基于文件的授权：

func checkUser(h http.HandlerFunc, users []string) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        userName := r.URL.Query().Get("name")
        if !contains(users, userName) {
            fmt.Fprintf(w, "user %s is not authorized", userName)
            return
        }
        h(w, r)
    }
}

管理员用户从 /botadmin/users 文件加载（分号分隔）。

5. 可观测性

Prometheus 指标：在端口 8090 暴露（/metrics）
结构化日志：所有操作都使用关联 ID 记录
优雅关闭：120 秒的宽限期来处理进行中的请求

API 示例

为 CI 分配集群

# 为构建 #123 获取空闲集群
CLUSTER=$(curl -s "http://cimaster:8080/getvacant?build=123&job=e2e-test&email=ci-bot@ebay.com")
echo "使用集群: $CLUSTER"

# 运行测试...

# 将集群返回到池中
curl "http://cimaster:8080/finishtest?cluster=$CLUSTER"

调试工作流

# Hold 集群进行调查
curl "http://cimaster:8080/holdcluster?cluster=cluster-05&name=alice&desc=debugging+network+issue"

# 调查...
kubectl get pods -n test-namespace

# 完成后释放
curl "http://cimaster:8080/releasecluster?cluster=cluster-05&name=alice"

创建新集群

# 通过 Prow 触发集群创建
curl "http://cimaster:8080/createcluster?user=admin&branch=master&job=e2e-k8s-1.32"
# 响应: cluster creation triggered successfully: {...}

# 监控 Prow 任务
# https://prow.tess.io/prowjob?prowjob=<job-name>

# 就绪后，管理员将其添加到池中
curl "http://cimaster:8080/addcluster?cluster=cluster-20&region=eu-central&name=admin"

JSON API 支持

用于编程访问：

1	curl -H "Accept: application/json" "http://cimaster:8080/getvacant?build=123&job=test"

响应：

{
  "name": "cluster-07",
  "region": "us-west",
  "status": "testing",
  "lastBuild": "123",
  "lastJob": "test",
  "lastTriggerName": "john",
  "purpose": "tess-ci"
}

部署

CIMaster 作为 Kubernetes Deployment 运行，具有 3 个副本以实现高可用：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cimaster
  namespace: ci
spec:
  replicas: 3
  minReadySeconds: 90
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
  template:
    spec:
      containers:
      - name: cimaster
        image: hub.tess.io/tess/cimaster:v0.0.37
        command:
        - cimaster
        - --manageCluster=true
        - --cluster-config-map=clusters
        - --botAdminDir=/botadmin
        - --prow-url=https://prow.tess.io/manual-trigger
        - --default-prow-job=e2e-k8s-1.32
        ports:
        - containerPort: 8080  # API
        - containerPort: 8090  # 指标

性能特性

分配延迟：约 100-300ms（包括 ConfigMap 读写周期）
重试开销：每次重试 50-200ms（最多 3 次尝试）
Hold 过期检查：每 10 分钟（默认）
并发性：通过乐观锁对多个副本安全

实际应用影响

在 eBay 的 TESS 平台，CIMaster 管理：

20+ 个跨多个区域的共享测试集群
每天数百个来自不同团队的 CI 任务
6 小时自动 hold 过期防止资源锁定
亚秒级分配适用于大多数请求
通过 Prow 集成实现动态扩展

未来增强

正在考虑的潜在改进：

优先级队列：允许关键任务跳过分配队列
集群健康检查：自动禁用不健康的集群
使用分析：跟踪分配模式并优化容量
Webhook 通知：hold 过期的 Slack/邮件警报
多集群联邦：跨多个 Kubernetes 集群协调

结论

CIMaster 展示了一个相对简单的协调服务如何解决 CI/CD 基础设施中的复杂资源管理挑战。通过结合：

Kubernetes ConfigMap 中的有状态集群跟踪
用于安全并发访问的乐观锁
废弃 hold 的自动过期
用于动态供应的 Prow 集成
易于集成的 REST API

…它为大规模共享测试基础设施提供了坚实的基础。

与 Prow 的 manual-trigger 组件的集成特别优雅——CIMaster 不需要知道如何创建集群，只需要知道何时请求它们。这种关注点分离允许基础设施团队独立演进集群供应策略。

无论您是为大型组织构建 CI 基础设施，还是希望优化 Kubernetes 平台中的资源利用，CIMaster 展示的模式都为分布式系统协调提供了宝贵的见解。

链接和资源

CIMaster 仓库：tess.io/contrib/cimaster
Prow 文档：k8s.io/test-infra/prow
Manual Trigger 组件：/Users/tashen/test-infra/prow/cmd/manual-trigger
API 指南：/Users/tashen/cimaster/doc/guide.md

本文探讨了 CIMaster 的内部架构，这是一个生产环境的集群协调服务。所有代码示例均来自实际实现。

Understanding CIMaster: Intelligent CI Cluster Coordination at Scale

Posted on 2026-04-06 Edited on 2026-06-23 In DevOps

Understanding CIMaster: Intelligent CI Cluster Coordination at Scale

In modern cloud-native development, continuous integration (CI) pipelines are the backbone of software delivery. At scale, managing shared test infrastructure becomes a critical challenge. This is where CIMaster comes in—a sophisticated cluster management service designed to coordinate access to shared CI test clusters, ensuring efficient resource utilization and preventing test conflicts.

The Problem: Shared CI Infrastructure at Scale

In large organizations running hundreds or thousands of CI jobs daily, test clusters are expensive resources that need to be shared efficiently. Key challenges include:

Resource Contention: Multiple CI jobs competing for limited test clusters
Cluster State Management: Tracking which clusters are available, occupied, or held for debugging
Manual Intervention: Developers needing to hold clusters for investigation without blocking others
Dynamic Provisioning: Creating new clusters on-demand when capacity is insufficient
Lifecycle Management: Automatically releasing clusters after use or expiration

CIMaster addresses all these challenges through a centralized coordination service.

Architecture Overview

CIMaster is a Kubernetes-native service written in Go that provides a REST API for cluster lifecycle management. It consists of several key components:

Core Components

┌─────────────────────────────────────────────────────────────┐
│                        CIMaster Service                      │
│                                                              │
│  ┌──────────────────┐       ┌─────────────────────────┐    │
│  │  HTTP API Server │       │   Cluster Manager       │    │
│  │   (Port 8080)    │◄─────►│   - State Management    │    │
│  │                  │       │   - Allocation Logic    │    │
│  │  /getvacant      │       │   - Hold Expiration     │    │
│  │  /holdcluster    │       │                         │    │
│  │  /releasecluster │       │                         │    │
│  │  /createcluster  │       └───────────┬─────────────┘    │
│  │  ...             │                   │                   │
│  └──────────────────┘                   │                   │
│                                         │                   │
│  ┌──────────────────┐       ┌───────────▼─────────────┐    │
│  │ Metrics Server   │       │  Kubernetes ConfigMap   │    │
│  │  (Port 8090)     │       │  - cluster.json         │    │
│  │                  │       │  - Optimistic Locking   │    │
│  └──────────────────┘       └─────────────────────────┘    │
│                                                              │
└────────────────┬─────────────────────────────────────────────┘
                 │
                 │ HTTP POST
                 ▼
     ┌───────────────────────┐
     │  Prow Manual Trigger  │
     │  /manual-trigger      │
     └───────────────────────┘

1. Cluster Manager (`cluster-manager.go`)

The heart of the system, responsible for:

Cluster Allocation: Finding and assigning vacant clusters to CI jobs
Hold Management: Allowing developers to reserve clusters for debugging (with 6-hour expiration)
Automatic Cleanup: Periodically releasing expired holds
Integration with Prow: Triggering cluster creation through Prow’s manual-trigger endpoint

2. Cluster Operations (`cluster-ops.go`)

Implements the ClusterInterface with operations like:

OccupyVacantCluster: Atomically allocate an available cluster
FinishOccupiedCluster: Return a cluster to the available pool
HoldCluster/ReleaseCluster: Manual hold management
AddCluster/DeleteCluster: Cluster inventory management

3. State Persistence

All cluster state is stored in a Kubernetes ConfigMap (clusters in the ci namespace):

[
  {
    "name": "cluster-01",
    "region": "us-west",
    "status": "testing",
    "lastJob": "e2e-conformance",
    "lastBuild": "12345",
    "lastTriggerName": "john",
    "hold": false,
    "disabled": false,
    "purpose": "tess-ci",
    "osimage": "centos-atomic-7.6.1810-qcow2"
  }
]

Optimistic Locking prevents race conditions during concurrent updates using Kubernetes ResourceVersion.

Integration with Prow’s Manual Trigger

One of CIMaster’s powerful features is its integration with Prow through the manual-trigger component. This enables dynamic cluster provisioning when existing capacity is insufficient.

What is Prow Manual Trigger?

Prow is Kubernetes’ CI/CD system. The manual-trigger component (/Users/tashen/test-infra/prow/cmd/manual-trigger) is an HTTP server that allows programmatic creation of ProwJobs outside the normal GitHub webhook flow.

Key Capabilities:

Accepts HTTP POST requests with job specifications
Creates ProwJob custom resources in Kubernetes
Supports presubmit, postsubmit, and periodic job types
Injects environment variables (like AUTHOR) into jobs

How Uses Manual Trigger

When a user calls ‘s /createcluster endpoint:

1	curl "http://cimaster:8080/createcluster?user=john&branch=master&job=e2e-k8s-1.32"

CIMaster performs the following flow:

// 1. Construct Prow request
prowRequest := types.ProwManualTriggerRequest{
    Org:      "tess",
    Repo:     "tessops",
    BaseRef:  "master",              // from branch parameter
    ProwType: "postsubmit",
    ProwJob:  "e2e-k8s-1.32",        // cluster creation job
    User:     "john",                // sets AUTHOR env var
}

// 2. Send to Prow manual-trigger endpoint
resp, err := http.Post(
    "https://prow.tess.io/manual-trigger",
    "application/json",
    requestBody,
)

// 3. Return status to caller

On the Prow side, the manual-trigger service:

// 1. Receives the request
func (s *server) handleManualTrigger(w http.ResponseWriter, r *http.Request) {
    var req triggerRequest
    json.NewDecoder(r.Body).Decode(&req)

    // 2. Looks up the job definition from config
    postsubmits := cfg.PostsubmitsStatic[req.Org+"/"+req.Repo]
    for _, p := range postsubmits {
        if p.Name == req.ProwJob {
            prowJob = createProwJobFromPostsubmit(p, req)
            break
        }
    }

    // 3. Injects AUTHOR environment variable
    if req.User != "" {
        addAuthorEnvToProwJob(prowJob, req.User)
    }

    // 4. Creates ProwJob in Kubernetes
    prowJobClient.Create(ctx, prowJob, metav1.CreateOptions{})

    // 5. Waits for BuildID and returns status link
    statusLink := fmt.Sprintf("https://prow.tess.io/prowjob?prowjob=%s", prowJob.Name)
    logLink := fmt.Sprintf("https://prow.tess.io/log?job=%s&id=%s", req.ProwJob, buildID)
}

Request-Response Flow

┌──────────┐         ┌──────────┐         ┌──────────────┐         ┌────────────┐
│   User   │         │ <USER_NAME> │         │ Prow Manual  │         │ Kubernetes │
│          │         │          │         │   Trigger    │         │            │
└────┬─────┘         └────┬─────┘         └──────┬───────┘         └─────┬──────┘
     │                    │                      │                       │
     │  POST /createcluster                      │                       │
     │  user=john        │                      │                       │
     ├──────────────────►│                      │                       │
     │                   │                      │                       │
     │                   │  POST /manual-trigger│                       │
     │                   │  {org, repo, prowjob}│                       │
     │                   ├─────────────────────►│                       │
     │                   │                      │                       │
     │                   │                      │  Create ProwJob       │
     │                   │                      │  with AUTHOR=john     │
     │                   │                      ├──────────────────────►│
     │                   │                      │                       │
     │                   │                      │  ◄─────────────────── │
     │                   │                      │  ProwJob Created      │
     │                   │                      │                       │
     │                   │ ◄────────────────────┤                       │
     │                   │  {success, job_name, │                       │
     │  ◄────────────────┤   status_link}       │                       │
     │  cluster creation │                      │                       │
     │  triggered        │                      │                       │
     │                   │                      │                       │

The triggered ProwJob typically runs infrastructure-as-code (like Terraform or Ansible) to provision a new Kubernetes cluster, which is then added to CIMaster’s pool once ready.

Cluster Lifecycle State Machine

Clusters transition through several states:

┌───────────┐
│  finished │ ◄───────────────────┐
│ (vacant)  │                     │
└─────┬─────┘                     │
      │                           │
      │ /getvacant                │
      │                           │
      ▼                           │
┌──────────┐                      │
│ testing  │                      │
│(occupied)│                      │
└────┬─────┘                      │
     │                            │
     │ /finishtest                │
     │                            │
     └────────────────────────────┘

Hold State (overlay):
┌─────────┐
│ hold=   │
│ false   │◄──── /releasecluster ────┐
└────┬────┘                           │
     │                                │
     │ /holdcluster                   │
     │                                │
     ▼                                │
┌─────────┐                           │
│ hold=   │                           │
│ true    │────────────────────────────┘
└─────────┘    (auto-expires in 6h)

Key Features and Implementation Details

1. Intelligent Allocation with Retry Logic

CIMaster implements exponential backoff with jitter to handle concurrent allocation:

type RandomBackoff struct {
    MinBackoff time.Duration
    MaxBackoff time.Duration
    rng        *rand.Rand
}

func (rb *RandomBackoff) GetRetryInterval() time.Duration {
    delta := rb.MaxBackoff - rb.MinBackoff
    return rb.MinBackoff + time.Duration(rb.rng.Int63n(int64(delta)+1))
}

Each operation retries up to 3 times with random 50-200ms backoff to avoid thundering herd problems.

2. Automatic Hold Expiration

A background goroutine continuously checks for expired holds:

func (cm *ClusterManager) runCronReleaseHeldEnvs() {
    for {
        durationUntilNextExpire, err := cm.clearExpiredHolds()
        timer.Reset(durationUntilNextExpire)
        <-timer.C
    }
}

This ensures clusters don’t remain locked indefinitely if developers forget to release them.

3. Multi-Purpose Cluster Support

CIMaster supports different cluster types:

tess-ci: Standard CI test clusters
tnet-ci: Network-specific test clusters with OS image selection

Allocation respects purpose and OS image requirements:

if cluster.Purpose != purpose {
    continue  // skip incompatible clusters
}
if cluster.Purpose == TnetCI && cluster.OSImage != osimage {
    continue  // skip wrong OS image
}

4. Admin Authorization

Protected endpoints use a simple file-based authorization:

func checkUser(h http.HandlerFunc, users []string) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        userName := r.URL.Query().Get("name")
        if !contains(users, userName) {
            fmt.Fprintf(w, "user %s is not authorized", userName)
            return
        }
        h(w, r)
    }
}

Admin users are loaded from /botadmin/users file (semicolon-separated).

5. Observability

Prometheus Metrics: Exposed on port 8090 (/metrics)
Structured Logging: All operations logged with correlation IDs
Graceful Shutdown: 120-second grace period to handle in-flight requests

API Examples

Allocating a Cluster for CI

# Get a vacant cluster for build #123
CLUSTER=$(curl -s "http://cimaster:8080/getvacant?build=123&job=e2e-test&email=ci-bot@ebay.com")
echo "Using cluster: $CLUSTER"

# Run tests...

# Return cluster to pool
curl "http://cimaster:8080/finishtest?cluster=$CLUSTER"

Debugging Workflow

# Hold cluster for investigation
curl "http://cimaster:8080/holdcluster?cluster=cluster-05&name=alice&desc=debugging+network+issue"

# Investigate...
kubectl get pods -n test-namespace

# Release when done
curl "http://cimaster:8080/releasecluster?cluster=cluster-05&name=alice"

Creating a New Cluster

# Trigger cluster creation via Prow
curl "http://cimaster:8080/createcluster?user=alice&branch=master&job=e2e-k8s-1.32"
# Response: cluster creation triggered successfully: {...}

# Monitor Prow job
# https://prow.tess.io/prowjob?prowjob=<job-name>

# Once ready, admin adds it to the pool
curl "http://cimaster:8080/addcluster?cluster=cluster-20&region=eu-central&name=admin"

JSON API Support

For programmatic access:

1	curl -H "Accept: application/json" "http://cimaster:8080/getvacant?build=123&job=test"

Response:

{
  "name": "cluster-07",
  "region": "us-west",
  "status": "testing",
  "lastBuild": "123",
  "lastJob": "test",
  "lastTriggerName": "john",
  "purpose": "tess-ci"
}

Deployment

CIMaster runs as a Kubernetes Deployment with 3 replicas for high availability:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cimaster
  namespace: ci
spec:
  replicas: 3
  minReadySeconds: 90
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
  template:
    spec:
      containers:
      - name: cimaster
        image: hub.tess.io/tess/cimaster:v0.0.37
        command:
        - cimaster
        - --manageCluster=true
        - --cluster-config-map=clusters
        - --botAdminDir=/botadmin
        - --prow-url=https://prow.tess.io/manual-trigger
        - --default-prow-job=e2e-k8s-1.32
        ports:
        - containerPort: 8080  # API
        - containerPort: 8090  # Metrics

Performance Characteristics

Allocation Latency: ~100-300ms (includes ConfigMap read-write cycle)
Retry Overhead: 50-200ms per retry (max 3 attempts)
Hold Expiration Check: Every 10 minutes (default)
Concurrency: Safe for multiple replicas via optimistic locking

Real-World Impact

At eBay’s TESS platform, CIMaster manages:

20+ shared test clusters across multiple regions
Hundreds of CI jobs daily from various teams
6-hour automatic hold expiration preventing resource lock-ups
Sub-second allocation for most requests
Dynamic scaling through Prow integration

Future Enhancements

Potential improvements being considered:

Priority Queues: Allow critical jobs to jump the allocation queue
Cluster Health Checks: Automatic disabling of unhealthy clusters
Usage Analytics: Track allocation patterns and optimize capacity
Webhook Notifications: Slack/email alerts for hold expirations
Multi-Cluster Federation: Coordinate across multiple Kubernetes clusters

Conclusion

CIMaster demonstrates how a relatively simple coordination service can solve complex resource management challenges in CI/CD infrastructure. By combining:

Stateful cluster tracking in Kubernetes ConfigMaps
Optimistic locking for safe concurrent access
Automatic expiration for abandoned holds
Prow integration for dynamic provisioning
REST API for easy integration

…it provides a robust foundation for shared test infrastructure at scale.

The integration with Prow’s manual-trigger component is particularly elegant—CIMaster doesn’t need to know how to create clusters, only when to request them. This separation of concerns allows infrastructure teams to evolve cluster provisioning strategies independently.

Whether you’re building CI infrastructure for a large organization or looking to optimize resource utilization in your Kubernetes platform, the patterns demonstrated by CIMaster offer valuable insights into distributed system coordination.

Links and Resources

CIMaster Repository: tess.io/contrib/cimaster
Prow Documentation: k8s.io/test-infra/prow
Manual Trigger Component: /Users/tashen/test-infra/prow/cmd/manual-trigger
API Guide: /Users/tashen/cimaster/doc/guide.md

This article explores the internal architecture of CIMaster, a production cluster coordination service. All code examples are from the actual implementation.

Prow Manual-Trigger 深度解析：架构、工作流程与实战

Posted on 2026-03-28 Edited on 2026-06-23 In Cloud

Prow Manual-Trigger 深度解析：架构、工作流程与实战

📚 前言

本文深入剖析 Kubernetes test-infra 仓库中的 prow/cmd/manual-trigger 项目，详细讲解其技术架构、工作流程和使用方法。Manual-Trigger 是一个 HTTP 服务，允许用户手动触发 Prow CI/CD 任务，而无需依赖 GitHub webhook 事件。

Kubernetes API Server故障排查完整指南 - 诊断工具与恢复策略

Posted on 2026-03-18 Edited on 2026-06-23 In Cloud

全面的Kubernetes API Server故障排查指南，涵盖启动失败、性能问题、内存泄漏、认证授权问题的诊断和解决方案

Kubernetes API Server实战示例 - 认证授权与性能测试完整指南

Posted on 2026-03-18 Edited on 2026-06-10 In Cloud

全面的Kubernetes API Server实战教程，包含认证授权测试、版本转换验证、Watch机制演示和性能压力测试

Linux 存储与文件系统深度剖析（五）：Ext4 文件系统源码分析

1. 历史演进：从 Ext2 到 Ext4

1.1 Ext2：奠定基础（1993）

1.2 Ext3：引入日志（2001）

1.3 Ext4：突破限制（2008）

2. 磁盘布局

2.1 整体结构

2.2 超级块：struct ext4_super_block

2.3 块组描述符：struct ext4_group_desc

2.4 inode 结构：struct ext4_inode

3. Extent 树机制

3.1 核心数据结构

3.2 ext4_find_extent：树遍历实现

3.3 叶子层二分搜索：ext4_ext_binsearch

4. 块组管理

4.1 块分配器：mballoc

4.2 灵活块组（Flex_BG）

5. JBD2 日志机制与崩溃一致性

5.1 JBD2 架构

5.2 事务生命周期

5.3 三种数据模式的实现

6. 读写实现：从 VFS 到磁盘

6.1 读路径：ext4_file_read_iter

6.2 写路径：ext4_file_write_iter

6.3 逻辑块到物理块映射：ext4_map_blocks

7. 延迟分配（Delayed Allocation）

7.1 原理

7.2 实现：ext4_da_write_begin 和 ext4_da_get_block_prep

8. 目录索引：HTree（dx_root）

8.1 线性目录 vs HTree

8.2 核心数据结构（fs/ext4/namei.c）

9. 性能调优挂载选项

9.1 数据写入模式

9.2 日志相关

9.3 分配与预分配

9.4 IO 特性

9.5 错误处理

9.6 推荐生产配置示例

10. 常见问题排查

10.1 e2fsck：文件系统一致性检查

10.2 debugfs：低级调试

10.3 tune2fs：调整文件系统参数

10.4 常见告警与处理

11. 总结

前言

一、块设备层架构概述

1.1 传统单队列架构（legacy single-queue）

1.2 blk-mq 多队列架构

二、核心数据结构深度分析

2.1 struct bio —— I/O 操作的基本单元

2.2 struct request —— 调度器视角的 I/O 请求

2.3 struct request_queue —— 块设备的全局控制中心

2.4 struct blk_mq_hw_ctx —— 硬件队列状态机

2.5 struct blk_mq_ctx —— 软件队列（Per-CPU）

三、bio 的完整生命周期

3.1 bio 分配：bio_alloc_bioset()

3.2 bio 提交：submit_bio() → submit_bio_noacct()

3.3 blk_mq_submit_bio() —— blk-mq 提交路径

3.4 bio 完成：bio_endio()

四、blk-mq 多队列架构详解

4.1 队列映射：CPU → 软件队列 → 硬件队列

4.2 请求分发：blk_mq_dispatch_rq_list()

4.3 请求完成：跨 CPU 的 IPI 机制

五、请求合并机制深度分析

5.1 合并的类型

5.2 合并判断：blk_try_merge()

5.3 后向合并的完整检查：ll_back_merge_fn()

5.4 调度器的哈希加速

六、块设备 I/O 统计：/proc/diskstats 的数据来源

七、实际调试技巧

7.1 blktrace：内核级 I/O 追踪

7.2 BPF 工具：biolatency 和 biosnoop

7.3 系统调优参数

7.4 使用 debugfs 分析 blk-mq 状态

八、blk-mq 性能调优实战

8.1 选择合适的 I/O 调度器

8.2 NUMA 感知调优

九、小结

参考资料

Linux 存储与文件系统深度剖析系列（一）：总览与架构

2.2 超级块：`struct ext4_super_block`

2.3 块组描述符：`struct ext4_group_desc`

2.4 inode 结构：`struct ext4_inode`

3.2 `ext4_find_extent`：树遍历实现

3.3 叶子层二分搜索：`ext4_ext_binsearch`

6.1 读路径：`ext4_file_read_iter`

6.2 写路径：`ext4_file_write_iter`

6.3 逻辑块到物理块映射：`ext4_map_blocks`

7.2 实现：`ext4_da_write_begin` 和 `ext4_da_get_block_prep`

8.2 核心数据结构（`fs/ext4/namei.c`）

10.1 `e2fsck`：文件系统一致性检查

10.2 `debugfs`：低级调试

10.3 `tune2fs`：调整文件系统参数

2.1 `struct bio` —— I/O 操作的基本单元

2.2 `struct request` —— 调度器视角的 I/O 请求

2.3 `struct request_queue` —— 块设备的全局控制中心

2.4 `struct blk_mq_hw_ctx` —— 硬件队列状态机

2.5 `struct blk_mq_ctx` —— 软件队列（Per-CPU）

3.1 bio 分配：`bio_alloc_bioset()`

3.2 bio 提交：`submit_bio()` → `submit_bio_noacct()`

3.3 `blk_mq_submit_bio()` —— blk-mq 提交路径

3.4 bio 完成：`bio_endio()`

4.2 请求分发：`blk_mq_dispatch_rq_list()`

5.2 合并判断：`blk_try_merge()`

5.3 后向合并的完整检查：`ll_back_merge_fn()`

六、块设备 I/O 统计：`/proc/diskstats` 的数据来源

1. 集群管理器 (`cluster-manager.go`)

2. 集群操作 (`cluster-ops.go`)

1. Cluster Manager (`cluster-manager.go`)

2. Cluster Operations (`cluster-ops.go`)