Linux 存储与文件系统深度剖析（四）：页缓存与缓冲区缓存

Posted on 2026-04-12 Edited on 2026-04-13 In Linux Kernel , Storage

页缓存（Page Cache）是 Linux 内核中性能优化最关键的子系统之一。它充当内存与磁盘之间的高速缓冲层，使得绝大多数文件读写操作无需真正触达磁盘。本文基于 Linux 6.4-rc1 内核源码，从数据结构到核心算法，系统地剖析页缓存的实现原理。

一、设计目标与基本原理

1.1 为什么需要页缓存

磁盘的随机访问延迟（机械硬盘约 5-10ms，SSD 约 0.1ms）与内存访问延迟（约 100ns）之间存在数个数量级的差距。如果每次文件读写都直接访问磁盘，I/O 密集型应用的性能将无法接受。

Linux 页缓存的设计目标体现在以下几个维度：

读缓存（Read Cache）：将磁盘数据缓存到内存，相同数据的再次读取直接从内存返回
写缓存（Write Cache）：写操作先写入缓存页，由内核统一批量回写（writeback），实现写聚合
共享内存映射：多个进程可以通过 mmap 共享同一文件的页缓存，避免数据在内核态和用户态之间的拷贝
预读（Readahead）：通过访问模式预测，提前将后续数据加载到缓存中，隐藏 I/O 延迟

1.2 从 Buffer Cache 到统一的 Page Cache

在 Linux 2.2 以前，内核同时维护两套缓存：页缓存（以内存页为单位，缓存文件数据）和缓冲区缓存（以 buffer_head 为单位，缓存块设备原始数据）。两套缓存相互独立，同一磁盘数据可能同时占据两份内存，造成严重浪费。

Linux 2.4 完成了两者的统一：buffer cache 不再独立存在，而是以 buffer_head 的形式附着在页缓存的 folio（页）上，作为页到磁盘块映射关系的元数据。

1.3 Folio：面向未来的抽象

Linux 5.16 引入了 struct folio，6.4 中已大量使用。一个 folio 可以代表一个或多个连续的物理页（对应透明大页 THP），是比 struct page 更高层次的抽象。本文中凡涉及内核实现，均使用 folio 接口。

二、核心数据结构

2.1 `struct address_space`：文件到页的映射核心

address_space 是页缓存最核心的数据结构，每个可缓存的对象（inode 或块设备）都有一个与之关联的 address_space 实例。它负责管理该对象在页缓存中的所有缓存页，以及提供操作方法集。

定义位于 include/linux/fs.h：

/**
 * struct address_space - Contents of a cacheable, mappable object.
 * @host: Owner, either the inode or the block_device.
 * @i_pages: Cached pages.
 * @invalidate_lock: Guards coherency between page cache contents and
 *   file offset->disk block mappings in the filesystem during invalidates.
 *   It is also used to block modification of page cache contents through
 *   memory mappings.
 * @gfp_mask: Memory allocation flags to use for allocating pages.
 * @i_mmap_writable: Number of VM_SHARED mappings.
 * @nr_thps: Number of THPs in the pagecache (non-shmem only).
 * @i_mmap: Tree of private and shared mappings.
 * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
 * @nrpages: Number of page entries, protected by the i_pages lock.
 * @writeback_index: Writeback starts here.
 * @a_ops: Methods.
 * @flags: Error bits and flags (AS_*).
 * @wb_err: The most recent error which has occurred.
 * @private_lock: For use by the owner of the address_space.
 * @private_list: For use by the owner of the address_space.
 * @private_data: For use by the owner of the address_space.
 */
struct address_space {
    struct inode            *host;
    struct xarray           i_pages;
    struct rw_semaphore     invalidate_lock;
    gfp_t                   gfp_mask;
    atomic_t                i_mmap_writable;
#ifdef CONFIG_READ_ONLY_THP_FOR_FS
    /* number of thp, only for non-shmem files */
    atomic_t                nr_thps;
#endif
    struct rb_root_cached   i_mmap;
    struct rw_semaphore     i_mmap_rwsem;
    unsigned long           nrpages;
    pgoff_t                 writeback_index;
    const struct address_space_operations *a_ops;
    unsigned long           flags;
    errseq_t                wb_err;
    spinlock_t              private_lock;
    struct list_head        private_list;
    void                    *private_data;
} __attribute__((aligned(sizeof(long)))) __randomize_layout;

各字段详解：

字段	类型	作用
`host`	`struct inode *`	拥有此 address_space 的 inode（或块设备的 inode）
`i_pages`	`struct xarray`	核心索引树，以文件页偏移（`pgoff_t`）为键，存储所有缓存的 folio 指针。XArray 替代了旧版的 radix_tree，提供更好的并发支持
`invalidate_lock`	`struct rw_semaphore`	在 truncate / 失效操作与页缓存写入之间保证一致性；也用于阻止通过 mmap 修改页缓存内容
`gfp_mask`	`gfp_t`	分配缓存页时使用的内存分配标志，文件系统通常设置 `__GFP_NOFS` 防止在回收过程中重入文件系统
`i_mmap_writable`	`atomic_t`	引用此 mapping 的 `VM_SHARED` 可写 mmap 数量；当此值非零时，写入操作需要额外注意 D-cache 一致性
`i_mmap`	`struct rb_root_cached`	红黑树，索引所有映射到此文件的 VMA（虚拟内存区域），用于反向映射（RMAP）和 `msync` 操作
`nrpages`	`unsigned long`	当前缓存的页数（受 `i_pages` 锁保护）
`writeback_index`	`pgoff_t`	上次回写操作结束的位置，用于范围循环（range_cyclic）回写策略
`a_ops`	`const struct address_space_operations *`	文件系统或块设备实现的操作方法集
`flags`	`unsigned long`	错误标志位，包括 `AS_EIO`（I/O 错误）、`AS_ENOSPC`（空间不足）等
`wb_err`	`errseq_t`	最近一次写回错误，使用序列化错误机制（errseq），支持多进程独立感知错误

__attribute__((aligned(sizeof(long)))) 的意义：确保结构体对齐到指针大小的整数倍。struct page 的 mapping 字段的最低位被用于区分匿名页（PAGE_MAPPING_ANON = 1），强制对齐确保 address_space 指针的最低位始终为 0，避免混淆。

XArray 与 Radix Tree：i_pages 从旧的 radix_tree_root 升级为 struct xarray。XArray 提供了更好的并发支持（通过 xa_lock 替代外部自旋锁），支持在迭代时插入和删除，并内置了多标记（marks）机制——PAGECACHE_TAG_DIRTY、PAGECACHE_TAG_WRITEBACK、PAGECACHE_TAG_TOWRITE 用于高效地标记和查找脏页、回写中的页。

/* XArray tags, for tagging dirty and writeback pages in the pagecache. */
#define PAGECACHE_TAG_DIRTY     XA_MARK_0
#define PAGECACHE_TAG_WRITEBACK XA_MARK_1
#define PAGECACHE_TAG_TOWRITE   XA_MARK_2

2.2 `struct address_space_operations`：操作方法集

每种文件系统（ext4、XFS、btrfs 等）以及块设备都需要实现 address_space_operations，告知页缓存如何读写数据。定义位于 include/linux/fs.h：

struct address_space_operations {
    int (*writepage)(struct page *page, struct writeback_control *wbc);
    int (*read_folio)(struct file *, struct folio *);

    /* Write back some dirty pages from this mapping. */
    int (*writepages)(struct address_space *, struct writeback_control *);

    /* Mark a folio dirty.  Return true if this dirtied it */
    bool (*dirty_folio)(struct address_space *, struct folio *);

    void (*readahead)(struct readahead_control *);

    int (*write_begin)(struct file *, struct address_space *mapping,
                loff_t pos, unsigned len,
                struct page **pagep, void **fsdata);
    int (*write_end)(struct file *, struct address_space *mapping,
                loff_t pos, unsigned len, unsigned copied,
                struct page *page, void *fsdata);

    /* Unfortunately this kludge is needed for FIBMAP. Don't use it */
    sector_t (*bmap)(struct address_space *, sector_t);
    void (*invalidate_folio) (struct folio *, size_t offset, size_t len);
    bool (*release_folio)(struct folio *, gfp_t);
    void (*free_folio)(struct folio *folio);
    ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
    int (*migrate_folio)(struct address_space *, struct folio *dst,
            struct folio *src, enum migrate_mode);
    int (*launder_folio)(struct folio *);
    bool (*is_partially_uptodate) (struct folio *, size_t from,
            size_t count);
    void (*is_dirty_writeback) (struct folio *, bool *dirty, bool *wb);
    int (*error_remove_page)(struct address_space *, struct page *);

    /* swapfile support */
    int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
                sector_t *span);
    void (*swap_deactivate)(struct file *file);
    int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
};

关键方法说明：

**read_folio**：从存储设备读取单个 folio 的内容。当页缓存中不存在某页时，由此方法触发实际的磁盘 I/O（取代旧版 readpage）。
**readahead**：批量预读操作，比逐页调用 read_folio 更高效。文件系统通过 readahead_folio() 迭代获取待预读的页，并统一提交 I/O。
**writepage / writepages**：将脏页写回磁盘。writepages 是批量版本，允许文件系统将多个连续页合并为一次 I/O 请求。
**dirty_folio**：将 folio 标记为脏（dirty）。默认实现为 filemap_dirty_folio，设置 folio 的 dirty 标志并将其加入回写队列。
**write_begin / write_end**：写操作的两阶段提交。write_begin 准备（可能分配并锁定）目标 folio，write_end 在数据复制后完成状态更新。
**invalidate_folio**：在 truncate 等操作使页失效时调用，清除该页上的 buffer_head 等私有数据。
**release_folio**：在内存回收时，判断是否可以释放 folio 上的私有数据（如 buffer_head）。
**direct_IO**：绕过页缓存的直接 I/O 实现（对应 O_DIRECT 标志）。

2.3 `struct file_ra_state`：预读状态

每个打开的文件描述符（struct file）都持有一个 file_ra_state，记录该文件句柄的预读窗口状态：

struct file_ra_state {
    pgoff_t start;          /* 当前预读窗口的起始页偏移 */
    unsigned int size;      /* 当前预读窗口的总页数 */
    unsigned int async_size;/* 异步预读部分的页数（窗口末尾） */
    unsigned int ra_pages;  /* 最大预读页数，从 bdi 复制 */
    unsigned int mmap_miss; /* mmap 访问未命中页缓存的次数 */
    loff_t prev_pos;        /* 上次读请求的最后一个字节位置 */
};

预读窗口的示意图（来自内核注释）：

                    |<----- async_size ---------|
|------------------- size -------------------->|
|==================#===========================|
^start             ^page marked with PG_readahead

PG_readahead 标志被设置在异步预读部分的第一个页上。当应用程序访问到该页时，触发下一轮异步预读，形成预读流水线，使磁盘 I/O 与应用处理时间重叠。

三、页缓存的查找与分配

3.1 查找页缓存

页缓存的查找通过 XArray 接口完成。filemap_get_folio() 是查找的入口，它在 mapping->i_pages 中以页偏移为键进行查找：

1 2	// 快速路径：通过 xa_load 在 XArray 中查找 folio = xa_load(&mapping->i_pages, index);

在 filemap_get_pages()（mm/filemap.c）中可以看到完整的查找逻辑：

static int filemap_get_pages(struct kiocb *iocb, size_t count,
        struct folio_batch *fbatch, bool need_uptodate)
{
    struct file *filp = iocb->ki_filp;
    struct address_space *mapping = filp->f_mapping;
    struct file_ra_state *ra = &filp->f_ra;
    pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
    pgoff_t last_index;
    struct folio *folio;
    int err = 0;

    /* "last_index" is the index of the page beyond the end of the read */
    last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
retry:
    if (fatal_signal_pending(current))
        return -EINTR;

    filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
    if (!folio_batch_count(fbatch)) {
        if (iocb->ki_flags & IOCB_NOIO)
            return -EAGAIN;
        page_cache_sync_readahead(mapping, ra, filp, index,
                last_index - index);
        filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
    }
    if (!folio_batch_count(fbatch)) {
        if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
            return -EAGAIN;
        err = filemap_create_folio(filp, mapping,
                iocb->ki_pos >> PAGE_SHIFT, fbatch);
        if (err == AOP_TRUNCATED_PAGE)
            goto retry;
        return err;
    }
    // ... 处理预读标志和 uptodate 状态
}

查找路径的三个层次：

批量查找（filemap_get_read_batch）：直接在 XArray 中扫描连续页，填充 folio_batch
同步预读触发：缓存未命中时，触发 page_cache_sync_readahead，然后再次查找
分配新页（filemap_create_folio）：若预读也未能填充页缓存（如随机读），则分配新 folio 并发起单页读取

3.2 向页缓存添加新页

filemap_add_folio()（mm/filemap.c）负责将新分配的 folio 插入页缓存：

int filemap_add_folio(struct address_space *mapping, struct folio *folio,
                pgoff_t index, gfp_t gfp)
{
    void *shadow = NULL;
    int ret;

    __folio_set_locked(folio);
    ret = __filemap_add_folio(mapping, folio, index, gfp, &shadow);
    if (unlikely(ret))
        __folio_clear_locked(folio);
    else {
        /*
         * The folio might have been evicted from cache only
         * recently, in which case it should be activated like
         * any other repeatedly accessed folio.
         * The exception is folios getting rewritten; evicting other
         * data from the working set, only to cache data that will
         * get overwritten with something else, is a waste of memory.
         */
        WARN_ON_ONCE(folio_test_active(folio));
        if (!(gfp & __GFP_WRITE) && shadow)
            workingset_refault(folio, shadow);
        folio_add_lru(folio);
    }
    return ret;
}

底层 __filemap_add_folio() 完成真正的插入工作：

noinline int __filemap_add_folio(struct address_space *mapping,
        struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
{
    XA_STATE(xas, &mapping->i_pages, index);
    // ...
    folio_ref_add(folio, nr);    // 增加引用计数
    folio->mapping = mapping;    // 设置反向指针
    folio->index = xas.xa_index; // 记录页偏移

    do {
        xas_lock_irq(&xas);
        xas_for_each_conflict(&xas, entry) {
            old = entry;
            if (!xa_is_value(entry)) {
                xas_set_err(&xas, -EEXIST); // 页已存在
                goto unlock;
            }
        }
        xas_store(&xas, folio);  // 存入 XArray
        mapping->nrpages += nr;  // 更新页数统计
        __lruvec_stat_mod_folio(folio, NR_FILE_PAGES, nr); // 更新 LRU 统计
unlock:
        xas_unlock_irq(&xas);
    } while (xas_nomem(&xas, gfp));
    // ...
}

shadow 机制值得关注：当一个页从缓存中被驱逐时，内核会在其 XArray 槽位留下一个 “shadow” 值（通过 workingset_eviction() 生成）。当同一页再次被加载时，workingset_refault() 检测到 shadow，可以推断该页属于活跃工作集，并将其直接放入 active LRU 列表，避免再次被快速驱逐。

四、读文件的页缓存流程

4.1 `filemap_read`：系统调用读路径

read() 系统调用经由 VFS 层，最终调用到 generic_file_read_iter()，再到 filemap_read()（mm/filemap.c）。这是所有通用文件系统的标准读路径：

ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
        ssize_t already_read)
{
    struct file *filp = iocb->ki_filp;
    struct file_ra_state *ra = &filp->f_ra;
    struct address_space *mapping = filp->f_mapping;
    struct inode *inode = mapping->host;
    struct folio_batch fbatch;
    int i, error = 0;
    bool writably_mapped;
    loff_t isize, end_offset;

    if (unlikely(iocb->ki_pos >= inode->i_sb->s_maxbytes))
        return 0;
    if (unlikely(!iov_iter_count(iter)))
        return 0;

    iov_iter_truncate(iter, inode->i_sb->s_maxbytes);
    folio_batch_init(&fbatch);

    do {
        cond_resched();

        if (unlikely(iocb->ki_pos >= i_size_read(inode)))
            break;

        error = filemap_get_pages(iocb, iter->count, &fbatch,
                          iov_iter_is_pipe(iter));
        if (error < 0)
            break;

        isize = i_size_read(inode);
        if (unlikely(iocb->ki_pos >= isize))
            goto put_folios;
        end_offset = min_t(loff_t, isize, iocb->ki_pos + iter->count);

        writably_mapped = mapping_writably_mapped(mapping);

        if (!pos_same_folio(iocb->ki_pos, ra->prev_pos - 1,
                                        fbatch.folios[0]))
            folio_mark_accessed(fbatch.folios[0]);

        for (i = 0; i < folio_batch_count(&fbatch); i++) {
            struct folio *folio = fbatch.folios[i];
            size_t fsize = folio_size(folio);
            size_t offset = iocb->ki_pos & (fsize - 1);
            size_t bytes = min_t(loff_t, end_offset - iocb->ki_pos,
                             fsize - offset);
            size_t copied;

            if (end_offset < folio_pos(folio))
                break;
            if (i > 0)
                folio_mark_accessed(folio);
            if (writably_mapped)
                flush_dcache_folio(folio);  // 处理 D-cache 别名

            copied = copy_folio_to_iter(folio, offset, bytes, iter);

            already_read += copied;
            iocb->ki_pos += copied;
            ra->prev_pos = iocb->ki_pos;

            if (copied < bytes) {
                error = -EFAULT;
                break;
            }
        }
put_folios:
        for (i = 0; i < folio_batch_count(&fbatch); i++)
            folio_put(fbatch.folios[i]);
        folio_batch_init(&fbatch);
    } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);

    file_accessed(filp);
    return already_read ? already_read : error;
}

整个 filemap_read 的执行循环：

调用 filemap_get_pages() 获取一批已就绪的 folio（必要时触发预读或同步 I/O）
对每个 folio 调用 folio_mark_accessed() 更新 LRU 活跃度
对于可写 mmap 映射的页，调用 flush_dcache_folio() 确保 D-cache 一致性
copy_folio_to_iter() 将数据从内核页拷贝到用户空间 iov_iter
循环直到读取完请求的字节数或到达文件末尾

folio_mark_accessed() 的 LRU 含义：每次读取都会调用此函数，它将 folio 从 inactive LRU 提升到 active LRU，或在已是 active 时设置 referenced 标志。这是 LRU 近似算法的核心——内核没有维护精确的访问计数，而是通过两阶段 LRU（active/inactive）模拟 LRU 驱逐策略。

4.2 `filemap_fault`：mmap 缺页中断路径

当进程通过 mmap 访问文件，触发缺页中断（page fault）时，VFS 调用 filemap_fault()（mm/filemap.c）：

vm_fault_t filemap_fault(struct vm_fault *vmf)
{
    int error;
    struct file *file = vmf->vma->vm_file;
    struct file *fpin = NULL;
    struct address_space *mapping = file->f_mapping;
    struct inode *inode = mapping->host;
    pgoff_t max_idx, index = vmf->pgoff;
    struct folio *folio;
    vm_fault_t ret = 0;
    bool mapping_locked = false;

    max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
    if (unlikely(index >= max_idx))
        return VM_FAULT_SIGBUS;

    /*
     * Do we have something in the page cache already?
     */
    folio = filemap_get_folio(mapping, index);
    if (likely(!IS_ERR(folio))) {
        /* 页已在缓存中：触发异步预读，但不等待 */
        if (!(vmf->flags & FAULT_FLAG_TRIED))
            fpin = do_async_mmap_readahead(vmf, folio);
        if (unlikely(!folio_test_uptodate(folio))) {
            filemap_invalidate_lock_shared(mapping);
            mapping_locked = true;
        }
    } else {
        /* 主缺页（major fault）：页不在缓存中 */
        count_vm_event(PGMAJFAULT);
        count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
        ret = VM_FAULT_MAJOR;
        fpin = do_sync_mmap_readahead(vmf);  // 同步预读
retry_find:
        if (!mapping_locked) {
            filemap_invalidate_lock_shared(mapping);
            mapping_locked = true;
        }
        folio = __filemap_get_folio(mapping, index,
                          FGP_CREAT|FGP_FOR_MMAP,
                          vmf->gfp_mask);
        if (IS_ERR(folio)) {
            if (fpin)
                goto out_retry;
            filemap_invalidate_unlock_shared(mapping);
            return VM_FAULT_OOM;
        }
    }

    // 锁定 folio，等待 I/O 完成...
    if (!lock_folio_maybe_drop_mmap(vmf, folio, &fpin))
        goto out_retry;

    // 检查是否被 truncate 截断
    if (unlikely(folio->mapping != mapping)) {
        folio_unlock(folio);
        folio_put(folio);
        goto retry_find;
    }

    // 检查数据是否就绪
    if (unlikely(!folio_test_uptodate(folio))) {
        if (!mapping_locked) {
            folio_unlock(folio);
            folio_put(folio);
            goto retry_find;
        }
        goto page_not_uptodate;
    }

    // 成功：设置 vmf->page 并返回 VM_FAULT_LOCKED
    vmf->page = folio_file_page(folio, index);
    return ret | VM_FAULT_LOCKED;
    // ...
}

filemap_fault 与 filemap_read 的关键区别在于：

Minor Fault（缓存命中）：页在缓存中，直接将其映射到 PTE，返回 VM_FAULT_LOCKED
Major Fault（缓存未命中）：需要从磁盘读取，统计 PGMAJFAULT 事件，触发同步预读
mmap 路径不经过 copy_folio_to_iter，而是通过 MMU 直接建立进程虚拟地址到物理页帧的映射

五、预读机制（Readahead）详解

5.1 预读的设计哲学

预读的核心思想是：顺序读取是最常见的 I/O 模式，通过提前加载后续数据，可以将磁盘 I/O 延迟（对用户程序）”隐藏”到零。内核通过 file_ra_state 维护每个文件句柄的预读状态，并动态调整预读窗口大小。

mm/readahead.c 开头的注释详细描述了预读窗口的工作原理：

To overlap application thinking time and disk I/O time, we do ‘readahead pipelining’: Do not wait until the application consumed all readahead pages and stalled on the missing page at readahead_index; Instead, submit an asynchronous readahead I/O as soon as there are only async_size pages left in the readahead window.

5.2 初始化与窗口增长

预读窗口从小到大动态增长，通过两个辅助函数控制：

/* 初始窗口大小：根据请求大小，按平方/4倍/2倍规则确定 */
static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
{
    unsigned long newsize = roundup_pow_of_two(size);

    if (newsize <= max / 32)
        newsize = newsize * 4;
    else if (newsize <= max / 4)
        newsize = newsize * 2;
    else
        newsize = max;

    return newsize;
}

/* 下一窗口大小：小窗口 4x 增长，中等 2x 增长，大窗口维持最大值 */
static unsigned long get_next_ra_size(struct file_ra_state *ra,
                      unsigned long max)
{
    unsigned long cur = ra->size;

    if (cur < max / 16)
        return 4 * cur;
    if (cur <= max / 2)
        return 2 * cur;
    return max;
}

这种指数增长策略确保：对于小文件，预读不会过度消耗内存；对于大文件的顺序读，预读窗口迅速扩大到最大值（通常由 bdi->ra_pages 决定，默认 128KB 到几 MB）。

5.3 `page_cache_ra_unbounded`：预读 I/O 的核心

page_cache_ra_unbounded() 是实际提交预读 I/O 的底层函数：

void page_cache_ra_unbounded(struct readahead_control *ractl,
        unsigned long nr_to_read, unsigned long lookahead_size)
{
    struct address_space *mapping = ractl->mapping;
    unsigned long index = readahead_index(ractl);
    gfp_t gfp_mask = readahead_gfp_mask(mapping);
    unsigned long i;

    /*
     * 使用 __GFP_NOFS 防止在分配页时重入文件系统，避免死锁。
     */
    unsigned int nofs = memalloc_nofs_save();

    filemap_invalidate_lock_shared(mapping);
    /*
     * 第一阶段：预分配所有需要的页并加入页缓存
     */
    for (i = 0; i < nr_to_read; i++) {
        struct folio *folio = xa_load(&mapping->i_pages, index + i);

        if (folio && !xa_is_value(folio)) {
            /* 页已存在：提交当前批次，跳过 */
            read_pages(ractl);
            ractl->_index++;
            i = ractl->_index + ractl->_nr_pages - index - 1;
            continue;
        }

        folio = filemap_alloc_folio(gfp_mask, 0);
        if (!folio)
            break;
        if (filemap_add_folio(mapping, folio, index + i, gfp_mask) < 0) {
            folio_put(folio);
            read_pages(ractl);
            ractl->_index++;
            i = ractl->_index + ractl->_nr_pages - index - 1;
            continue;
        }
        /* 在异步部分的第一个页上设置 PG_readahead 标志 */
        if (i == nr_to_read - lookahead_size)
            folio_set_readahead(folio);
        ractl->_workingset |= folio_test_workingset(folio);
        ractl->_nr_pages++;
    }

    /*
     * 第二阶段：提交 I/O（忽略 I/O 错误，后续 read_folio 会重试）
     */
    read_pages(ractl);
    filemap_invalidate_unlock_shared(mapping);
    memalloc_nofs_restore(nofs);
}

两阶段设计的意义：先分配所有页并加入页缓存，再统一提交 I/O。这避免了”边分配边提交”可能触发内存回收 writeback，进而形成读写混合 I/O 的问题（内核注释：This avoids the very bad behaviour which would occur if page allocations are causing VM writeback）。

5.4 `ondemand_readahead`：按需预读状态机

ondemand_readahead() 是预读的核心决策函数，它根据当前访问模式决定预读策略：

static void ondemand_readahead(struct readahead_control *ractl,
        struct folio *folio, unsigned long req_size)
{
    struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host);
    struct file_ra_state *ra = ractl->ra;
    unsigned long max_pages = ra->ra_pages;
    unsigned long add_pages;
    pgoff_t index = readahead_index(ractl);
    pgoff_t expected, prev_index;
    unsigned int order = folio ? folio_order(folio) : 0;

    /* 情况 1：文件起始处，直接使用初始窗口 */
    if (!index)
        goto initial_readahead;

    /* 情况 2：顺序访问，命中预期位置，扩大窗口并推进 */
    expected = round_up(ra->start + ra->size - ra->async_size,
            1UL << order);
    if (index == expected || index == (ra->start + ra->size)) {
        ra->start += ra->size;
        ra->size = get_next_ra_size(ra, max_pages);
        ra->async_size = ra->size;
        goto readit;
    }

    /* 情况 3：命中 PG_readahead 标志但状态不匹配（交错读） */
    if (folio) {
        pgoff_t start;
        rcu_read_lock();
        start = page_cache_next_miss(ractl->mapping, index + 1, max_pages);
        rcu_read_unlock();
        if (!start || start - index > max_pages)
            return;
        ra->start = start;
        ra->size = start - index;
        ra->size += req_size;
        ra->size = get_next_ra_size(ra, max_pages);
        ra->async_size = ra->size;
        goto readit;
    }

    /* 情况 4：顺序缓存未命中（prev_index 紧邻 index） */
    prev_index = (unsigned long long)ra->prev_pos >> PAGE_SHIFT;
    if (index - prev_index <= 1UL)
        goto initial_readahead;

    /* 情况 5：通过页缓存历史记录检测顺序模式 */
    if (try_context_readahead(ractl->mapping, ra, index, req_size, max_pages))
        goto readit;

    /* 情况 6：独立的小随机读，直接读取，不污染预读状态 */
    do_page_cache_ra(ractl, req_size, 0);
    return;

initial_readahead:
    ra->start = index;
    ra->size = get_init_ra_size(req_size, max_pages);
    ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;

readit:
    /* 自触发优化：合并当前窗口和下一窗口 */
    if (index == ra->start && ra->size == ra->async_size) {
        add_pages = get_next_ra_size(ra, max_pages);
        if (ra->size + add_pages <= max_pages) {
            ra->async_size = add_pages;
            ra->size += add_pages;
        } else {
            ra->size = max_pages;
            ra->async_size = max_pages >> 1;
        }
    }

    ractl->_index = ra->start;
    page_cache_ra_order(ractl, ra, order);
}

预读状态机的六种情况覆盖了几乎所有真实访问场景：

文件顺序读：初始窗口 -> 指数扩大 -> 维持最大窗口
随机读：不更新预读状态，只读请求的页
交错顺序读：通过页缓存历史推断流的位置

六、脏页回写（Writeback）机制

6.1 脏页的产生

当应用程序通过 write() 或 mmap 写入文件数据时，内核将对应 folio 标记为脏（设置 PG_dirty 标志），并通过 a_ops->dirty_folio() 将其加入 inode 的回写队列。脏页不会立即写回磁盘——内核采用延迟写（Write-Behind）策略，积累一定量后批量回写，实现写 I/O 的聚合。

6.2 回写线程与触发时机

每个块设备的后端（backing_dev_info）都有一个或多个回写线程（bdi_writeback，即 wb）。回写的触发有三种来源：

定期回写（kupdate）：每隔 dirty_writeback_interval（默认 5 秒）唤醒一次，回写所有超过 dirty_expire_interval（默认 30 秒）未回写的脏页
后台回写（background）：当脏页比例超过 dirty_background_ratio（默认 10%）阈值时，后台触发回写
压力回写（sync/fsync）：sync()、fsync()、fdatasync() 等系统调用触发同步回写

wb_do_writeback() 是回写线程的主循环：

static long wb_do_writeback(struct bdi_writeback *wb)
{
    struct wb_writeback_work *work;
    long wrote = 0;

    set_bit(WB_writeback_running, &wb->state);
    /* 处理工作队列中的显式回写请求（如 sync） */
    while ((work = get_next_work_item(wb)) != NULL) {
        trace_writeback_exec(wb, work);
        wrote += wb_writeback(wb, work);
        finish_writeback_work(wb, work);
    }

    /* 检查全局 sync 请求 */
    wrote += wb_check_start_all(wb);

    /* 检查 kupdate 定期回写（WB_REASON_PERIODIC） */
    wrote += wb_check_old_data_flush(wb);

    /* 检查后台阈值回写（WB_REASON_BACKGROUND） */
    wrote += wb_check_background_flush(wb);
    clear_bit(WB_writeback_running, &wb->state);

    return wrote;
}

6.3 `wb_writeback`：回写执行循环

wb_writeback() 是实际执行回写的核心函数：

static long wb_writeback(struct bdi_writeback *wb,
             struct wb_writeback_work *work)
{
    long nr_pages = work->nr_pages;
    unsigned long dirtied_before = jiffies;
    struct inode *inode;
    long progress;
    struct blk_plug plug;

    blk_start_plug(&plug);   // 批量提交 I/O（合并为更大的 bio）
    spin_lock(&wb->list_lock);
    for (;;) {
        if (work->nr_pages <= 0)    // 已回写足够页数，退出
            break;

        /* 后台/定期回写让步于其他高优先级工作 */
        if ((work->for_background || work->for_kupdate) &&
            !list_empty(&wb->work_list))
            break;

        /* 后台回写：脏页已降到阈值以下，退出 */
        if (work->for_background && !wb_over_bg_thresh(wb))
            break;

        /* kupdate：设置过期时间戳过滤器（只回写超时的脏页） */
        if (work->for_kupdate) {
            dirtied_before = jiffies -
                msecs_to_jiffies(dirty_expire_interval * 10);
        } else if (work->for_background)
            dirtied_before = jiffies;

        if (list_empty(&wb->b_io))
            queue_io(wb, work, dirtied_before);   // 将 inode 放入 b_io 队列

        if (work->sb)
            progress = writeback_sb_inodes(work->sb, wb, work);
        else
            progress = __writeback_inodes_wb(wb, work);

        if (progress)
            continue;

        if (list_empty(&wb->b_more_io))
            break;

        /* 等待某个 inode 变为可写 */
        inode = wb_inode(wb->b_more_io.prev);
        spin_lock(&inode->i_lock);
        spin_unlock(&wb->list_lock);
        inode_sleep_on_writeback(inode);
        spin_lock(&wb->list_lock);
    }
    spin_unlock(&wb->list_lock);
    blk_finish_plug(&plug);   // 提交所有积累的 I/O

    return nr_pages - work->nr_pages;
}

blk_plug 的作用：在回写过程中，通过 blk_start_plug() 延迟将 bio 提交到块设备层，等 plug 关闭时（blk_finish_plug()）统一提交，使得多个相邻块的 I/O 可以被合并，提升磁盘吞吐量。

6.4 回写的三个 inode 队列

bdi_writeback 维护三个 inode 列表：

b_dirty：所有脏 inode 列表，按”首次变脏时间”排序
b_io：当前轮次要处理的 inode，queue_io() 将 b_dirty 中符合条件的 inode 移入此队列
b_more_io：因 I/O 拥塞暂时无法处理的 inode，等待 I/O 完成后重试

七、缓冲区缓存（Buffer Cache）与页缓存的关系

7.1 `struct buffer_head`：磁盘块到页的映射

在统一的页缓存架构下，buffer_head 不再是独立的缓存单元，而是作为”页到磁盘块映射”的元数据，附着在页缓存的 folio 上。定义位于 include/linux/buffer_head.h：

struct buffer_head {
    unsigned long b_state;          /* 缓冲区状态位图（BH_Uptodate、BH_Dirty 等） */
    struct buffer_head *b_this_page;/* 同一页内缓冲区的环形链表 */
    union {
        struct page *b_page;        /* 所属页 */
        struct folio *b_folio;      /* 所属 folio */
    };

    sector_t b_blocknr;             /* 起始块号（磁盘逻辑块地址） */
    size_t b_size;                  /* 映射大小 */
    char *b_data;                   /* 指向页内数据的指针 */

    struct block_device *b_bdev;
    bh_end_io_t *b_end_io;          /* I/O 完成回调 */
    void *b_private;                /* 供 b_end_io 使用的私有数据 */
    struct list_head b_assoc_buffers;/* 关联到其他 mapping 的链表（journal） */
    struct address_space *b_assoc_map;  /* 关联的 mapping */
    atomic_t b_count;               /* 引用计数 */
    spinlock_t b_uptodate_lock;     /* 保护本页第一个 bh 的锁 */
};

b_state 位图关键状态：

状态位	含义
`BH_Uptodate`	缓冲区内容有效
`BH_Dirty`	数据已修改，需要写回
`BH_Lock`	缓冲区被锁定（I/O 进行中）
`BH_Mapped`	已建立到磁盘块的映射
`BH_New`	新分配的磁盘块，尚未初始化
`BH_Async_Read` / `BH_Async_Write`	异步 I/O 进行中
`BH_Delay`	延迟分配：块已分配但磁盘位置尚未确定（btrfs/ext4 延迟分配特性）

7.2 一个页面上的多个 buffer_head

对于块大小小于页大小的文件系统（如 1KB block size + 4KB page），一个 folio 上会有多个 buffer_head，它们形成一个以 b_this_page 连接的环形链表，附着在页的 private 字段上（通过 attach_page_private()）：

folio->private
     |
     v
[bh0] -> [bh1] -> [bh2] -> [bh3] -> [bh0]（环形）
 block0    block1   block2   block3

每个 buffer_head 记录了页内对应区域（b_data 指针和 b_size）与磁盘块（b_blocknr）之间的映射关系。文件系统通过 get_block_t 回调（如 ext4 的 ext4_get_block()）填充这些映射。

7.3 Direct I/O 绕过页缓存

当文件以 O_DIRECT 标志打开时，读写操作通过 a_ops->direct_IO() 直接在用户缓冲区与块设备之间传输数据，完全绕过页缓存。这消除了内核/用户空间的拷贝和缓存开销，适用于数据库等自行管理缓存的应用。但使用时需要注意：请求必须按扇区对齐，且同一文件同时使用 O_DIRECT 和 O_SYNC（缓存 I/O）可能导致一致性问题。

八、页缓存的内存压力处理

8.1 双链表 LRU 近似算法

Linux 为每个 NUMA 节点的每个内存 zone 维护一个 lruvec，其中包含五个 LRU 列表：

LRU_INACTIVE_ANON   - 匿名页（堆/栈/私有 mmap），未活跃
LRU_ACTIVE_ANON     - 匿名页，活跃
LRU_INACTIVE_FILE   - 文件页（页缓存），未活跃
LRU_ACTIVE_FILE     - 文件页，活跃
LRU_UNEVICTABLE     - 不可驱逐页（mlock 锁定等）

文件页缓存使用 LRU_INACTIVE_FILE 和 LRU_ACTIVE_FILE。页的 LRU 状态转换：

新加入：filemap_add_folio() 调用 folio_add_lru() -> 加入 inactive 列表
访问：folio_mark_accessed() -> 若在 inactive 且已被引用，提升到 active 列表
回收压力：shrink_active_list() 周期性地将 active 列表末尾的页降级到 inactive 列表

8.2 `folio_check_references`：判断是否可回收

在实际驱逐一个 folio 之前，shrink_page_list() 通过 folio_check_references() 检查它的引用情况：

FOLIOREF_ACTIVATE：有活跃的 PTE 引用（页表项中的 Accessed 位被设置），重新激活
FOLIOREF_KEEP：被引用但不需要激活（如单次访问）
FOLIOREF_RECLAIM：可以回收（脏页先写回，然后回收）
FOLIOREF_RECLAIM_CLEAN：干净页（或写回完成），可以直接释放

8.3 工作集检测（Workingset）

工作集检测（mm/workingset.c）是页缓存回收的重要优化。当一个 folio 被驱逐时，其 XArray 槽位留下 shadow 值，记录驱逐时的 lruvec 时间戳。当同一页再次被加载时，workingset_refault() 计算从驱逐到重新加载的时间差：

若时间差短（在工作集的活跃周期内被再次访问）：说明该页属于工作集，直接加入 active 列表，避免再次被快速驱逐
若时间差长：按正常路径加入 inactive 列表

这个机制有效防止了 thrashing（抖动）：当内存严重不足时，刚被驱逐的页被频繁重新加载，workingset 检测会将这些页直接激活，减少驱逐-重新加载循环。

九、页缓存调优参数

Linux 通过 /proc/sys/vm/ 提供了一系列控制页缓存行为的参数，以下是最重要的几个：

9.1 脏页回写相关

参数	默认值	说明
`dirty_ratio`	20	进程产生脏页超过内存的 20% 时，写操作被阻塞（同步回写）
`dirty_background_ratio`	10	脏页超过内存的 10% 时，后台线程开始回写（异步，不阻塞进程）
`dirty_expire_centisecs`	3000	脏页超过 30 秒未回写时，强制回写（单位：百分之一秒）
`dirty_writeback_centisecs`	500	回写线程唤醒间隔，默认 5 秒（0 表示禁用定期回写）

调优建议：对于写密集型应用，适当降低 dirty_background_ratio（如 5%）可以减少 I/O 突发；对于数据库等对一致性要求高的应用，应确保使用 O_DIRECT 或 O_SYNC 绕过延迟回写。

9.2 内存压力与回收

参数	默认值	说明
`vfs_cache_pressure`	100	控制内核回收 VFS 缓存（dentry/inode）相对于文件页缓存的倾向。设为 0 几乎不回收 VFS 缓存；设为 1000 则激进回收
`min_free_kbytes`	自动计算	系统保留的最小空闲内存，内存不足时触发直接回收
`watermark_scale_factor`	10	控制内存水位线的间隔，影响后台回收的触发时机

9.3 预读调优

参数	位置	说明
`read_ahead_kb`	`/sys/block/<dev>/queue/read_ahead_kb`	单设备的最大预读窗口（字节），覆盖 bdi 的 `ra_pages`
`fadvise(POSIX_FADV_SEQUENTIAL)`	应用层系统调用	提示内核使用更大的预读窗口
`fadvise(POSIX_FADV_RANDOM)`	应用层系统调用	禁用预读（随机访问场景）

9.4 实时观测

通过以下接口可以观察页缓存的状态：

# 查看整体内存使用，Cached 列为页缓存大小
free -h

# 详细内存分类统计
cat /proc/meminfo | grep -E "Cached|Dirty|Writeback|Mapped"

# 查看回写统计（需要 procps-ng）
vmstat -w 1

# 查看特定进程的页缓存占用
cat /proc/<pid>/smaps | grep -E "^(Size|Rss|Shared_Clean|Shared_Dirty|Private)"

# 使用 cachestat/cachetop（来自 BCC/bpftrace）实时观察缓存命中率
cachestat 1

十、总结

Linux 页缓存是一个设计精妙的多层次系统，其核心设计思路可以归纳为：

统一的抽象层：address_space 将文件、块设备、共享内存等不同对象以统一接口纳入页缓存管理，address_space_operations 提供了灵活的多态扩展点
XArray 作为核心索引：以文件页偏移为键，XArray 提供了 O(log n) 的查找性能以及内置的标记机制（dirty/writeback tags），使得批量处理脏页和回写页的效率极高
预读流水线：通过 PG_readahead 标志和 file_ra_state 窗口管理，实现了 I/O 延迟的流水线隐藏。ondemand_readahead 的状态机覆盖了从顺序读到随机读的各种模式
延迟写与批量回写：脏页积累后由专门的回写线程统一处理，通过 blk_plug 机制合并相邻 I/O，显著提升写吞吐量
两级 LRU 与工作集检测：active/inactive 双链表模拟 LRU 算法，workingset shadow 机制防止频繁驱逐工作集页，在内存压力下保护热数据
Buffer Cache 的融合：buffer_head 作为元数据附着在页缓存 folio 上，消除了数据的双重缓存，统一了文件数据缓存和块设备元数据缓存

理解页缓存的工作原理，是进行 Linux 存储性能调优、诊断 I/O 问题的基础。下一篇文章将深入 VFS 层，分析 inode、dentry 缓存及文件操作的完整调用链。

本文源码分析基于 Linux 6.4-rc1（commit ac9a78681b92）。主要参考文件：mm/filemap.c、mm/readahead.c、fs/fs-writeback.c、fs/buffer.c、include/linux/fs.h、include/linux/pagemap.h、include/linux/buffer_head.h。