Kubernetes核心组件学习系列 - 完整指南与学习路线图

Posted on 2026-03-17 In cloud , kubernetes-learning

Kubernetes核心组件深度学习系列文章导航，提供系统性的学习路径和面试准备指南

Fixing an AB-BA Deadlock in SCST User-Space Device Handler

Posted on 2026-06-03 In Linux Kernel

Background

The SCST (SCSI Target) framework provides a user-space device handler (scst_user) that allows implementing SCSI target devices in userspace. Recently, we encountered a critical issue where the system would hang during cleanup when the userspace handler crashed or was killed while I/O operations were in progress.

The symptom appeared as an infinite loop in the cleanup code, requiring workarounds like forcing cleanup after 1000 iterations or even triggering a kernel panic. However, these were just treating the symptoms, not the root cause.

SCST命令处理流程详解

Posted on 2026-06-02 In Linux-Kernel , Storage

概述

SCST (SCSI Target Subsystem for Linux) 的核心是一个精心设计的命令处理引擎。本文将深入剖析SCST如何从接收SCSI命令到完成响应的整个处理流程，包括状态机转换、线程模型、执行上下文切换等关键机制。

SCST基础概念深度解析

Posted on 2026-06-02 In Linux-Kernel , Storage

概述

SCST (SCSI Target Subsystem for Linux) 是Linux内核中用于构建SCSI目标设备（Target）的高性能框架。它允许Linux系统将本地存储资源通过各种SCSI传输协议（如iSCSI、FC、SRP等）导出给远程主机使用。本文将深入解析SCST的核心概念、架构设计和关键数据结构。

SCST Infinite Loop: Full Root Cause Analysis and Fix

Posted on 2026-06-01 Edited on 2026-06-05 In Linux Kernel , Storage

SCST Infinite Loop: Full Root Cause Analysis and Fix

Executive Summary

This post documents the complete investigation of a critical infinite loop bug in the
SCST (SCSI Target) kernel module that caused complete system hangs during device cleanup.
The investigation went through two phases: an initial analysis that identified the
symptoms and produced a working workaround, followed by instrumentation that revealed the
true root cause — a missing SGV pool flush — and replaced 80+ lines of workaround code
with two function calls.

The Symptom: During device shutdown, dev_user_process_cleanup() spins at ~2 million
iterations per second, pinning one CPU core and triggering the kernel soft-lockup detector
within seconds.

The Initial Diagnosis (incorrect): Zombie commands with refcount=0 stuck in
ucmd_hash that dev_user_unjam_dev() counts but cannot unjam.

The Real Root Cause: A command in state UCMD_STATE_ON_FREE_SKIPPED (state 7)
remains in ucmd_hash with ref=1 indefinitely because sgv_pool_free() caches the
scatter-gather buffer instead of freeing it, so the allocator’s ucmd_put() callback
never fires and the reference count never reaches zero.

The Fix: Two sgv_pool_flush() calls added after the unjam loop in
dev_user_unjam_dev().

Background: The SCST User-Space Device Handler

SCST is a professional, clustered, high-performance storage target subsystem for Linux.
The scst_user module allows user-space applications to implement SCSI target devices
via a character device interface.

Architecture Overview

User Space Application
       ↕ (ioctl/read/write)
   /dev/scst_user
       ↕
scst_user kernel module
       ↕
ucmd_hash (command tracking)
ready_cmd_list (commands ready for processing)
       ↕
SCST core → Target drivers → Initiators

Key data structures:

ucmd_hash: Hash table tracking all active scst_user_cmd objects
ready_cmd_list: Queue of commands ready for user-space processing
cleanup_cmpl: Completion for device cleanup synchronization
ucmd_ref: Per-command reference count; dev_user_free_ucmd() →
cmd_remove_hash() fires only when atomic_dec_and_test() returns true (ref → 0)

Command Lifecycle

dev_user_alloc_ucmd()        ucmd_ref = 1
dev_user_alloc_pages()       ucmd_ref++ (ucmd_get) for each SG allocation
sent to user space           sent_to_user = 1
reply from user space        processing begins
dev_user_on_free_cmd()       SGV freed, one ucmd_put()
dev_user_free_ucmd()         ref reaches 0 → cmd_remove_hash()

The Symptom: Kernel Soft Lockup During Device Cleanup

Multiple scst_usr_release kernel threads stuck in D state (uninterruptible sleep):

[Thu Jan 23 02:37:11 2025] task:scst_usr_releas state:D stack:    0 pid:334614
[Thu Jan 23 02:37:11 2025] Call Trace:
[Thu Jan 23 02:37:11 2025]  __schedule+0x23d/0x590
[Thu Jan 23 02:37:11 2025]  schedule+0x4e/0xb0
[Thu Jan 23 02:37:11 2025]  schedule_timeout+0xfb/0x140
[Thu Jan 23 02:37:11 2025]  wait_for_completion+0x24/0x30
[Thu Jan 23 02:37:11 2025]  dev_user_exit_dev.isra.0+0x16a/0x1e0 [scst_user]

The threads wait on wait_for_completion(&dev->cleanup_cmpl), which is never signaled
because the cleanup thread is spinning in an infinite loop and never reaches
complete_all(&dev->cleanup_cmpl).

The Cleanup Flow

When a SCST user device is torn down, two functions interact:

dev_user_exit_dev() — unregisters the device, sets dev->cleanup_done = 1,
then blocks on wait_for_completion(&dev->cleanup_cmpl).
dev_user_process_cleanup() — runs in a separate thread, processes remaining
commands, and must call complete_all(&dev->cleanup_cmpl) to unblock step 1.

The exit condition in dev_user_process_cleanup() is:

while (1) {
    rc1 = dev_user_unjam_dev(dev);   /* count + unjam commands in hash */

    if (rc1 == 0 && rc == -EAGAIN && dev->cleanup_done)
        break;   /* normal exit: hash empty, no ready commands, teardown done */

    spin_lock_irq(&dev->udev_cmd_threads.cmd_list_lock);
    rc = dev_user_get_next_cmd(dev, &ucmd, false);
    if (rc == 0)
        dev_user_unjam_cmd(ucmd, 1, NULL);
    spin_unlock_irq(&dev->udev_cmd_threads.cmd_list_lock);

    if (rc == -EAGAIN && dev->cleanup_done) {
        /* rc1 != 0: hash not empty
         * rc == -EAGAIN: nothing in ready list
         * cleanup_done: no new commands will arrive
         * → no exit condition is reachable → infinite loop */
    }
}
complete_all(&dev->cleanup_cmpl);   /* never reached */

The loop has no exit when rc1 > 0 and rc == -EAGAIN simultaneously — commands are
in the hash but not in the ready list, and no new commands will arrive.

Phase 1: Initial Analysis and Workaround

Initial Hypothesis

The first hypothesis was that zombie commands — commands whose reference count had
reached 0 while still in ucmd_hash — were causing the deadlock. Looking at
dev_user_unjam_dev():

list_for_each_entry(ucmd, head, hash_list_entry) {
    res++;                          /* counts EVERY entry, including zombies */

    if (!ucmd->sent_to_user)
        continue;                   /* skip but still counted */

    if (ucmd_get_check(ucmd))
        continue;                   /* ref == 0 → skip but still counted */

    dev_user_unjam_cmd(ucmd, 0, NULL);
    goto repeat;
}
return res;

ucmd_get_check() does atomic_inc_return(): if the result is 1 it means we raced with
destruction (ref was 0), and it decrements back and returns failure. A zombie command
(ref=0) would be counted in res but never unjammed, keeping rc1 > 0 forever.

This analysis was partially correct — counting before checking is a real code smell
— but it turned out not to be the actual root cause in the observed failure.

Workaround: Diagnostics + Force Cleanup

While the root cause was being investigated, a workaround was deployed:

if (rc == -EAGAIN && dev->cleanup_done) {
    loop_count++;

    /* After 100 iterations: log all stuck commands */
    if (unlikely(loop_count == 100)) {
        PRINT_WARNING("Cleanup loop detected after %d iterations", loop_count);
        spin_lock_irq(&dev->udev_cmd_threads.cmd_list_lock);
        for (i = 0; i < ARRAY_SIZE(dev->ucmd_hash); i++) {
            list_for_each_entry(stuck_ucmd, &dev->ucmd_hash[i], hash_list_entry) {
                PRINT_WARNING("Stuck ucmd %p: state=%x sent_to_user=%d ref=%d cmd=%p",
                    stuck_ucmd, stuck_ucmd->state, stuck_ucmd->sent_to_user,
                    atomic_read(&stuck_ucmd->ucmd_ref), stuck_ucmd->cmd);
            }
        }
        spin_unlock_irq(&dev->udev_cmd_threads.cmd_list_lock);
    }

    /* After 1000 iterations (~1s): force-remove all stuck commands */
    if (unlikely(loop_count >= 1000)) {
        spin_lock_irq(&dev->udev_cmd_threads.cmd_list_lock);
        for (i = 0; i < ARRAY_SIZE(dev->ucmd_hash); i++) {
            list_for_each_entry_safe(stuck_ucmd, tmp,
                                     &dev->ucmd_hash[i], hash_list_entry) {
                list_del(&stuck_ucmd->hash_list_entry);
                while (atomic_read(&stuck_ucmd->ucmd_ref) > 1)
                    ucmd_put(stuck_ucmd);
                ucmd_put(stuck_ucmd);
            }
        }
        spin_unlock_irq(&dev->udev_cmd_threads.cmd_list_lock);
        break;
    }

    if (loop_count > 10)
        msleep(1);   /* throttle CPU while waiting */
}

This replaced the original workaround of panic("SCST panic: DeadLoop error occurred!")
after 10,000 iterations. The new approach prevented the system crash and collected the
diagnostic output that led to finding the real root cause.

Phase 2: What Instrumentation Revealed

The diagnostic logging added in Phase 1 produced this output:

[Sun Mar  2 17:00:52 2025] ucmd 00000000ff3afdd1, state 7, scst_cmd 0000000000000000, ucmd_ref: 1
[Sun Mar  2 17:00:52 2025] dev_user_process_cleanup rc -EAGAIN cleanup done
[Sun Mar  2 17:00:52 2025] ucmd 00000000ff3afdd1, state 7, scst_cmd 0000000000000000, ucmd_ref: 1
[Sun Mar  2 17:00:52 2025] dev_user_process_cleanup rc -EAGAIN cleanup done
...
[Sun Mar  2 17:00:53 2025] try to set ucmd_ref as 0 when test_value_2=21 exceed 20
[Sun Mar  2 17:00:53 2025] dev_user_process_cleanup rc -EAGAIN cleanup done
[Sun Mar  2 17:00:53 2025] ucmd 00000000ff3afdd1, state 7, scst_cmd 0000000000000000, ucmd_ref: 0
[Sun Mar  2 17:00:53 2025] dev_user_process_cleanup rc -EAGAIN cleanup done

Three observations immediately invalidated the initial hypothesis:

Always the same ucmd pointer — 00000000ff3afdd1 — across all iterations. Only
one command is stuck.
state = 7 = UCMD_STATE_ON_FREE_SKIPPED, scst_cmd = NULL, ucmd_ref = 1.
The ref is 1, not 0. This is not a zombie.
Forcing ucmd_ref to 0 did not break the loop. A test patch set the atomic to
zero directly, confirmed by the log showing ucmd_ref: 0 — yet the loop continued.

The third observation is decisive. It proves the issue is not a reference-count
value problem. The ucmd is stuck because dev_user_free_ucmd() — the function
containing cmd_remove_hash() — was never invoked. Setting a raw atomic bypasses
ucmd_put() and its atomic_dec_and_test() trigger; nothing removes the entry from
the hash.

Why Initial Analysis Was Wrong

The first analysis assumed ref=0 based on the zombie hypothesis. The logs showed
ref=1. ucmd_get_check() succeeds on this command (ref goes 1→2 on check, 2→1 on
put); the unjam loop skips it only because sent_to_user = 0, not because of a zombie
condition. The root cause is elsewhere.

The Real Root Cause: SGV Pool Caching

State 7: UCMD_STATE_ON_FREE_SKIPPED

There is exactly one place in the codebase that sets this state:

/* scst_user.c — dev_user_on_free_cmd() */
if (ucmd->dev->on_free_cmd_type == SCST_USER_ON_FREE_CMD_IGNORE) {
    ucmd->state = UCMD_STATE_ON_FREE_SKIPPED;
    /* The state assignment must be before freeing sgv! */
    goto out_reply;
}
...
out_reply:
    dev_user_process_reply_on_free(ucmd);

dev_user_process_reply_on_free is:

static int dev_user_process_reply_on_free(struct scst_user_cmd *ucmd)
{
    dev_user_free_sgv(ucmd);   /* free the scatter-gather buffer */
    ucmd_put(ucmd);            /* drop one reference */
    return 0;
}

This looks correct: free the SGV memory, drop the reference. So why does the ucmd stay
in the hash?

The SGV Pool Is a Cache, Not a Direct Free

The answer is in dev_user_free_sgv():

static void dev_user_free_sgv(struct scst_user_cmd *ucmd)
{
    if (ucmd->sgv) {
        sgv_pool_free(ucmd->sgv, &ucmd->dev->udev_mem_lim);
        ucmd->sgv = NULL;
    } else if (ucmd->data_pages) {
        ucmd_get(ucmd);
        __dev_user_free_sg_entries(ucmd);
    }
}

The call is sgv_pool_free(), not a direct page free. The SGV (scatter-gather vector)
pool is a performance cache: it holds recently freed SG buffers so that future
commands can reuse them without going to the page allocator. When sgv_pool_free() is
called:

The SGV object is placed on the pool’s LRU cache.
The allocator’s free callback — dev_user_free_sg_entries() — is not called.
The pages are not freed.
The ucmd_get() reference taken in dev_user_alloc_pages() is not released.

dev_user_free_sg_entries() (and therefore its ucmd_put()) only fires when the pool
evicts a cached object — either because the pool is purged or because
sgv_pool_flush() is called explicitly.

The Complete Reference Count Trace

Here is the exact reference sequence for a command unjammed during device teardown:

Event	Operation	ucmd_ref
`dev_user_alloc_ucmd()`	`atomic_set(&ucmd_ref, 1)`	1
`dev_user_alloc_pages()` first page	`ucmd_get()`	2
`dev_user_unjam_dev()`: `ucmd_get_check()`	`atomic_inc_return()` → 3, ≠ 1 → not zombie	3
`dev_user_unjam_cmd()` → `scst_cmd_done()` → `dev_user_on_free_cmd()` → `dev_user_process_reply_on_free()`: `dev_user_free_sgv()` → `sgv_pool_free()`	SGV goes to pool LRU cache; `dev_user_free_sg_entries()` not called; alloc_pages ref not released	3
`dev_user_process_reply_on_free()`: `ucmd_put()`	`atomic_dec_and_test()` → 2, ≠ 0	2
`dev_user_unjam_dev()`: `ucmd_put()` for `ucmd_get_check` ref	`atomic_dec_and_test()` → 1, ≠ 0	1

atomic_dec_and_test() calls dev_user_free_ucmd() → cmd_remove_hash() only when
the result is 0. It never reaches 0 because the alloc_pages reference is never
released — dev_user_free_sg_entries() never fires. The ucmd stays in the hash
indefinitely.

Why the Loop Runs at 2 Million Iterations per Second

After the unjam pass, the stuck ucmd has:

sent_to_user = 0 (cleared by dev_user_unjam_cmd() before calling scst_cmd_done)
state = UCMD_STATE_ON_FREE_SKIPPED
cmd = NULL
not in ready_cmd_list

On every subsequent call to dev_user_unjam_dev():

list_for_each_entry(ucmd, head, hash_list_entry) {
    res++;                   /* ← always incremented: hash is not empty */
    if (!ucmd->sent_to_user)
        continue;            /* ← always taken: sent_to_user == 0 */
    ...
}

res is non-zero but the command is never unjammed. dev_user_get_next_cmd() returns
-EAGAIN because the ucmd is not in ready_cmd_list. With cleanup_done = 1, the
outer loop’s exit condition (rc1 == 0) can never be reached:

while (1) {
    rc1 = dev_user_unjam_dev(dev);   /* rc1 = 1, always */
    if (rc1 == 0 && ...)
        break;                        /* never reached */
    rc = dev_user_get_next_cmd(...);  /* rc = -EAGAIN, always */
    ...
}

Both dev_user_unjam_dev() and dev_user_get_next_cmd() acquire and release a spinlock
in under a microsecond. Result: ~2 million iterations per second, 100% CPU on one core,
kernel soft-lockup detector fires within seconds.

The Fix: Post-Unjam SGV Pool Flush

Why the Existing Pre-Loop Flush Was Not Enough

dev_user_unjam_dev() already calls sgv_pool_flush() before acquiring the spinlock:

static int dev_user_unjam_dev(struct scst_user_dev *dev)
{
    sgv_pool_flush(dev->pool);       /* flush BEFORE unjamming */
    sgv_pool_flush(dev->pool_clust);

    spin_lock_irq(&dev->udev_cmd_threads.cmd_list_lock);
    /* ... unjam loop ... */
    spin_unlock_irq(&dev->udev_cmd_threads.cmd_list_lock);

    return res;
}

This pre-loop flush runs before any command is unjammed. SGV objects are placed into
the pool cache during unjamming (when scst_cmd_done → dev_user_on_free_cmd →
dev_user_free_sgv → sgv_pool_free executes). Those objects are invisible to a flush
that precedes them.

The Two-Line Fix

static int dev_user_unjam_dev(struct scst_user_dev *dev)
{
    sgv_pool_flush(dev->pool);       /* existing flush — before unjamming */
    sgv_pool_flush(dev->pool_clust);

    spin_lock_irq(&dev->udev_cmd_threads.cmd_list_lock);
    /* ... unjam loop ... */
    spin_unlock_irq(&dev->udev_cmd_threads.cmd_list_lock);

    /* Post-unjam SGV pool flush — this is the actual fix.
     * sgv_pool_free() during unjamming caches the SGV object on the pool LRU;
     * dev_user_free_sg_entries() (and its ucmd_put) fires only on eviction.
     * The pre-loop flush misses objects cached during unjamming.
     * Flushing here evicts them: dev_user_free_sg_entries() → ucmd_put()
     * → atomic_dec_and_test() → 0 → dev_user_free_ucmd() → cmd_remove_hash().
     */
    sgv_pool_flush(dev->pool);
    sgv_pool_flush(dev->pool_clust);

    return res;
}

Call Chain After the Fix

sgv_pool_flush()
  → dev_user_free_sg_entries()       ← allocator callback, finally called
      → __dev_user_free_sg_entries()
          → dev_user_unmap_buf()
          → ucmd_put()               ← releases the alloc_pages ref
              → atomic_dec_and_test() → 0 → dev_user_free_ucmd()
                  → cmd_remove_hash()    ← ucmd removed from hash

On the next call to dev_user_unjam_dev(), the hash is empty, res = 0, and
dev_user_process_cleanup() breaks out of its loop normally — within 2–3 iterations.

Replace panic() with schedule()

The panic() workaround (and later the force-cleanup workaround) is replaced with a
simple CPU yield:

/* Before: crash the system */
if (unlikely(loop_count > 10000))
    panic("SCST panic: DeadLoop error occurred!");

/* After: yield CPU so SGV callbacks can run */
schedule();

With the SGV flush fix in place, schedule() is belt-and-suspenders: it ensures any
async callbacks triggered by the flush have CPU time before the next iteration. In
practice the loop exits before the yield matters.

End-to-End Fix Summary

	Detail
Symptom	`dev_user_process_cleanup()` loops at ~2M iterations/second; soft lockup
Stuck ucmd	`state=7 (ON_FREE_SKIPPED)`, `scst_cmd=NULL`, `ref=1`, not in ready list
Why ref stays at 1	`sgv_pool_free()` caches the SGV object on the pool LRU; `dev_user_free_sg_entries()` never fires; the `ucmd_get()` taken in `dev_user_alloc_pages()` is never matched by a `ucmd_put()`
Why pre-loop flush failed	It runs before unjamming; SGV objects are cached during unjamming and are invisible to it
Fix	Add `sgv_pool_flush()` for both pools after the unjam loop in `dev_user_unjam_dev()`
Fix size	2 function calls
Workaround replaced	80+ lines of diagnostic logging, force-cleanup, and msleep throttling

Lessons

1. Instrumentation before conclusion

The initial analysis correctly identified the symptoms and the code smell (counting
before checking in dev_user_unjam_dev). But the true cause required observing the
actual ucmd_ref value at runtime. A single log line — showing ref=1 instead of the
assumed ref=0 — immediately invalidated the zombie hypothesis and pointed to a
missing ucmd_put, which led straight to the SGV pool caching behavior.

2. Falsification experiments are decisive

The test that forced ucmd_ref to 0 and observed the loop continuing was the key
experiment. It proved the issue was not about the reference value but about the path
that calls dev_user_free_ucmd(). No matter what the counter shows, nothing removes the
ucmd from the hash unless atomic_dec_and_test() returns true inside ucmd_put().

3. Pool caches hide reference lifetimes

The SGV pool decouples sgv_pool_free() from the actual page release. Code that relies
on “free → callback → ucmd_put” must account for the possibility that the callback fires
on eviction (asynchronously), not on free (synchronously). At teardown time, explicit
sgv_pool_flush() is necessary to force eviction.

4. Minimal fixes beat complex workarounds

The final fix is two function calls. The workaround it replaced was 80+ lines of
diagnostic logging, force-cleanup state machines, and sleep throttling — none of which
addressed the root cause and all of which added complexity and potential for new bugs.

5. Kernel infinite loops require an escape hatch

Unlike user space, a kernel spin at 2M iterations per second will trigger the soft-lockup
detector and can destabilize the entire host. Even while the root cause is being
investigated, a bounded fallback (diagnostics + timeout + yield) is preferable to either
a bare panic or an unbounded spin.

Code Locations

File: scst/src/dev_handlers/scst_user.c
dev_user_unjam_dev() — where the post-loop flush is added
dev_user_process_cleanup() — outer cleanup loop with the exit condition
dev_user_on_free_cmd() — where UCMD_STATE_ON_FREE_SKIPPED is set
dev_user_free_sgv() — where sgv_pool_free() is called instead of a direct free
dev_user_alloc_pages() — where the unmatched ucmd_get() is taken

Tags: #kernel #scst #storage #debugging #linux #memory-management #sgv-pool #reference-counting #deadlock

业界如何在 Kubernetes 上跑 vLLM：训练与推理分离的实践

Posted on 2026-05-06 In AI/ML

系统梳理业界在 Kubernetes 上运行 vLLM 的常见模式，以及训练和推理为什么通常要解耦部署。

KAI Scheduler 深入拆解（五）：插件、分片、拓扑感知与时间公平性

Posted on 2026-04-23 Edited on 2026-05-06 In cloud

前几篇文章里，我一直在强调一件事：KAI Scheduler 的价值不只是“今天已经实现了哪些特性”，更重要的是 它为什么可以持续装下更多特性。

这一篇就专门看这个问题。

如果只看表层功能，KAI 已经足够复杂：

hierarchical queues
reclaim / preempt
topology-aware scheduling
hierarchical podgroups
DRA
GPU sharing
time-based fairshare
scheduling shards

但真正有意思的地方在于，这些能力并不是一堆互不相干的特判，而是大多能在同一套框架下自然落位：

流程由 action 管
策略由 plugin 管
运行边界由 session / statement 管
平台级定制由 SchedulingShard 管
工作负载语义由 PodGroup 管
历史反馈由 queue controller 导出指标，再经 Prometheus / usage DB client 回流

换句话说，KAI 的扩展性不是“多写几个 if”，而是 架构层面为扩展预留了空间。

先看最关键的一点：plugin 不是装饰品，而是调度策略的主入口

pkg/scheduler/plugins/factory.go 把默认插件注册得非常完整：

predicates
priority
nodeplacement
nominatednode
nodeavailability
gpusharingorder
gpupack
gpuspread
resourcetype
podaffinity
elastic
kubeflow
ray
taskorder
subgrouporder
dynamicresources
topology
proportion
minruntime
snapshot
reflectjoborder

这张表最重要的意义不是“插件很多”，而是它说明 KAI 把以下问题都视作 策略问题，而不是写死在主循环里的逻辑：

节点排序
GPU 排序
queue fairness
reclaim / preempt 判断
workload integration
subgroup 顺序
topology 约束
dynamic resources
bind request mutation

这就是为什么 KAI 的高级特性能持续叠加，而不会轻易把主调度流程搞乱。

Session 暴露的扩展点，基本等于 KAI 的能力边界

在 pkg/scheduler/framework/session.go / session_plugins.go 里，Session 暴露的函数槽位和注册方法比前几篇里举的更宽，最有代表性的包括：

JobOrderFns
TaskOrderFns
SubGroupOrderFns
QueueOrderFns
NodeOrderFns
GpuOrderFns
PrePredicateFns
PredicateFns
CanReclaimResourcesFns
ReclaimVictimFilterFns
PreemptVictimFilterFns
GetQueueAllocatedResourcesFns
GetQueueDeservedResourcesFns
GetQueueFairShareFns
BindRequestMutateFns
AddSubsetNodesFn(...)
AddPreJobAllocationFn(...)
AddReclaimScenarioValidatorFn(...)
AddPreemptScenarioValidatorFn(...)
AddHttpHandler(...)
AddIsJobOverCapacityFn(...)
event handlers

只要看见这些接口，就能立刻理解 KAI 的插件思想：

它不是只开放一个 score 钩子，而是把“调度过程中可能需要定制的所有局部决策”尽量显式化。

这带来了两个直接好处：

新能力可以落在明确的扩展点上，而不是侵入主流程。
现有能力可以通过配置和优先级组合，而不是硬编码互斥。

换句话说，KAI 开放的不只是排序/过滤，还包括候选节点裁剪、pre-job allocation、reclaim/preempt 场景校验、执行前 BindRequest mutate，甚至给插件挂自定义 HTTP handler 的能力。

别忘了：binder 这一侧也有正式插件面

前面几篇更容易让人把注意力放在 scheduler session 上，但 KAI 的扩展面并不只存在于决策侧。

在 pkg/binder/plugins/interface.go 里，binder plugin 暴露的就是一条很清楚的执行期接口：

PreBind(...)
PostBind(...)
Rollback(...)

这意味着执行落地这一侧也不是硬编码死的。凡是和真实 bind、副作用处理、失败回滚相关的逻辑，都可以在 binder 这条插件面上扩展，而不用把它们重新塞回 scheduler 主循环。

`SchedulingShard`：KAI 真正的平台配置面

如果说 plugin 是“代码级扩展点”，那 SchedulingShard 就是“平台级策略面”。

我认为这是 KAI 最值得单独夸的一层设计。

在 pkg/apis/kai/v1/schedulingshard_types.go 里，SchedulingShardSpec 暴露的内容包括：

Args
PlacementStrategy
PartitionLabelValue
QueueDepthPerAction
MinRuntime
KValue
UsageDBConfig
Plugins
Actions

这意味着什么？

这意味着 KAI 把很多原本应该藏在源码、flag 或者某个配置文件里的策略选择，提升成了 可声明、可演化、可分片实例化的 Kubernetes API。

还有一个容易被忽略的细节：这些默认 plugin/action 并不是无条件全开。operator 在把 SchedulingShard 物化成 scheduler 配置时，会根据 PlacementStrategy 条件启用或关闭 gpusharingorder、gpupack、gpuspread、consolidation 等默认项。

为什么这层设计非常强

因为它让“调度器实例”不再是同质的。

你可以想象这样几种 shard：

一个 shard 专门服务 GPU 训练任务，偏向 binpack
一个 shard 服务在线推理，偏向 spread / 更保守 preempt
一个 shard 打开更积极的 time-based fairshare
一个 shard 限制特定 action 的 queue depth
一个 shard 针对某类 workload 打开或关闭特定 plugin

这就把 KAI 从“一个调度器程序”推进成了“一个可按业务场景切片的调度平台”。

分片不是部署技巧，而是语义隔离能力

scheduling-shards.md 里说得很清楚：每个 shard 实际上围绕同一个 partition label key/value 工作。也就是说，scheduler 只会考虑带有同一组分片标签的：

Nodes
Queues
PodGroups

这是一种非常典型的 Kubernetes 风格：

用 label 做资源归属
用独立 scheduler deployment 做计算隔离
用 CRD 做配置入口

这样带来的好处不只是扩容，更重要的是 策略与资源空间同时分片。

这里还有一个容易说错的细节：默认 shard 并不是用“label 等于空字符串”来兜底未分片资源。当前 selector 语义更接近 DoesNotExist——也就是当某个 partition / nodepool value 为空时，它会接住“没有这个标签”的资源，而不是匹配一个空值标签。

也就是说，不同 shard 不只是“不同副本数的 scheduler”，而是：

看到的资源集合不同
使用的策略可以不同
action/plugin 组合可以不同
fairness / minRuntime / placement 参数可以不同

拓扑感知：KAI 的 topology 不是附属 feature，而是主框架里的自然延伸

很多调度器支持 topology，最终只是把它当作额外 affinity。

KAI 的 topology 明显不是这个级别。

docs/topology/README.md 描述得很清楚，它依赖两个东西：

Topology CRD 定义层级拓扑
PodGroup / workload 声明 required / preferred placement

然后调度时做两层决策：

第一层：domain selection

先选择应该把 workload 放进哪个 topology domain。

这里的策略偏向 bin-packing：优先选择“相对更满但仍能容纳该 workload”的 domain，以减少碎片。

第二层：node ordering within domain

在选中的 domain 内，再按 preferred placement 做节点排序，让 Pod 尽量聚拢到更合适的子域。

这套设计为什么自然？

因为 KAI 的框架本身就已经有：

workload-level abstraction（PodGroup）
node / GPU ordering 扩展点
scenario / statement 模拟能力
多级 subgroup 结构

所以 topology 不需要硬插一条特别路径，而是能顺着既有框架表达出来。

从当前插件实现看，topology 主要挂在三类扩展点上：AddSubsetNodesFn(...)、AddNodeOrderFn(...) 和 AddPreJobAllocationFn(...)。也就是说，它更像“先裁候选节点，再做域内节点排序，并在分配前补一次校验/整理”，而不是单独依赖某个 GPU order 钩子。

公平性：KAI 的 fairshare 不是单一排序函数，而是 queue 系统的一部分

docs/fairness/README.md 里最重要的一句话是：

deserved quota 一定优先满足
剩余资源再按 priority / weight 分 over-quota share

这意味着 KAI 的公平性不是简单“谁资源少谁先跑”，而是建立在 queue hierarchy 上的：

先分配 deserved quota
再按 over-quota weight/priority 分多出来的资源
如果当前实际占用和 fair share 偏离太多，就通过 reclaim 收敛

这件事为什么离不开框架抽象

因为 fairshare 不只是 queue 排序，还会影响：

queue ordering
reclaim eligibility
victim filtering
资源记账
action 执行顺序

如果没有 session 里的 queue callback、action 框架和 statement 模型，fairshare 很容易变成一堆局部规则拼接。KAI 的实现方式则更像把它作为一个正式的一等公民。

Time-based fairshare：KAI 开始把“历史行为”放进调度决策

我觉得 time-based fairshare 是 KAI 很有代表性的一个能力，因为它显示出这套架构并不只看“这一刻的资源状态”，而是开始纳入时间维度。

docs/time-based-fairshare/README.md 给出的主线是：

PodGroup Controller 更新 podgroup 资源状态
Queue Controller patch queue 状态并导出队列级指标
Prometheus 存储这些历史指标
scheduler 通过 usage DB client 拉取 usage
proportion 等 fairness 相关逻辑把历史 usage 纳入 over-quota 资源分配

这个闭环很说明问题：

KAI 不只是个实时调度器，它正在演化成一个“结合当前状态与历史行为”的资源治理系统。

为什么这件事能自然接进去

因为 KAI 之前已经有了：

Queue 作为公平性主语
Queue Controller 作为聚合器
SchedulerCache / usage DB 作为数据接入点
KValue、UsageDBConfig 等策略面配置
GetQueueFairShareFns 这类 session 级扩展点

所以历史 usage 的接入，不需要推翻已有架构，只需要在已有资源分配框架上把输入变得更丰富。

`Config` + `SchedulingShard` 的组合，让 operator 也变成架构的一部分

前面聊得多的是 scheduler 内核，但 KAI 的平台味道其实还来自 operator 层。

在 pkg/apis/kai/v1/config_types.go 里，ConfigSpec 直接声明了整套控制面的组件：

PodGrouper
Binder
Admission
Scheduler
QueueController
PodGroupController
NodeScaleAdjuster
Prometheus

也就是说，KAI 不是把 operator 当“安装器”，而是把它做成了这套架构的一部分：

Config 负责描述整个平台的全局组件组合
SchedulingShard 负责描述单个调度分片的策略配置

这两个 CRD 组合起来，基本就是 KAI 的控制面 API。

为什么我认为 KAI 的扩展路径是健康的

很多系统在功能越来越多以后，会出现一种典型问题：

新特性只能往旧代码里塞特判
不同能力互相穿透
配置开始变得不可推理
主循环越来越重
某个高级功能一加，全局行为就变得不可预测

KAI 目前给我的感受恰好相反。它的高级能力大多还能找到一个比较自然的落点：

1. workload 语义落在 `PodGroup`

而不是散落在所有 plugin 里。

2. 公平性与资源治理落在 `Queue`

而不是偷偷写在 node score 里。

3. 执行流程落在 action

而不是被 plugin 任意控制。

4. 局部策略落在 plugin

而不是让 action 变成巨无霸。

5. 中间态执行落在 `BindRequest` + binder

而不是让 scheduler 同步做完所有副作用。

6. 平台调优落在 `SchedulingShard`

而不是只能改源码或启动参数。

这就是为什么我觉得 KAI 的架构不仅功能多，而且有继续长下去的空间。

如果想扩展 KAI，最值得看哪几类点

从阅读体验上，我会把 KAI 的扩展入口分成五类。

1. 想加 workload 语义

先看：

PodGroup API
pod-grouper 的 supported types / grouping plugins
subgroup / topology 相关设计文档

2. 想改调度策略

先看：

Session 暴露的 callback 槽位
plugins/factory.go
现有 plugin 的 OnSessionOpen

3. 想改调度流程

先看：

action framework
actions/factory.go
action 优先级和 SchedulingShard.Actions

4. 想做平台级多租户 / 多场景隔离

先看：

SchedulingShard
partition label 机制
operator 相关 reconciliation

5. 想改执行落地路径

先看：

pkg/binder/plugins/interface.go
PreBind / PostBind / Rollback
pkg/binder/controllers/bindrequest_controller.go

我对 KAI 扩展能力的一个总结

如果只用一句话概括，我会写成：

KAI 把“调度平台”的关键维度分开了：工作负载语义、资源公平性、调度流程、局部策略、执行落地、平台配置，各自都有明确落点，因此高级能力能以组合的方式生长出来。

这跟很多“功能越多越难维护”的 scheduler 最大的不同在于：

它不是靠主循环越来越聪明来进化
而是靠边界越来越清楚来进化

系列收束：怎么继续读这个仓库

如果你是第一次接触 KAI，我会建议按照这条线继续深入：

先读 docs/developer/scheduler-concepts.md
再读 docs/developer/action-framework.md
看 docs/developer/plugin-framework.md
看 docs/developer/pod-grouper.md 和 docs/operator/scheduling-shards.md
进入 pkg/scheduler/ 的 session / action / cache 主链路
最后再去看 topology、fairness、time-based fairshare 等高级能力

因为 KAI 最值得学习的地方，真的不只是某一个调度算法，而是：

它如何把复杂的 AI workload 调度，拆成一套仍然能持续演化的 Kubernetes 控制面架构。

对我来说，这就是这个仓库最有价值的部分。

KAI Scheduler 深入拆解（四）：从源码走读一次调度周期

Posted on 2026-04-23 Edited on 2026-05-06 In cloud

前面两篇讲了架构和 workflow，上一篇讲了框架层的抽象。这一篇开始不再停留在概念，而是直接沿着源码主链路走一遍 KAI Scheduler 的一次调度周期。

我的目标不是把每个函数都抄一遍，而是回答五个更实在的问题：

scheduler 进程是怎么启动的？
action / plugin 是怎么注册进去的？
一轮 session 是怎么打开和关闭的？
placement 决策最后是怎么变成 BindRequest 的？
binder 是怎么把这个决策真正执行掉的？

如果你只准备精读少数几个文件，我建议把注意力放在下面这几处：

cmd/scheduler/main.go
cmd/scheduler/app/server.go
pkg/scheduler/scheduler.go
pkg/scheduler/framework/framework.go
pkg/scheduler/framework/session.go
pkg/scheduler/actions/factory.go
pkg/scheduler/plugins/factory.go
pkg/scheduler/cache/cache.go
pkg/apis/scheduling/v1alpha2/bindrequest_types.go
pkg/binder/controllers/bindrequest_controller.go

第一站：scheduler 进程入口不是算法，而是平台启动逻辑

真正的二进制入口其实在 cmd/scheduler/main.go，它只做一件事：调用 app.RunApp()。真正展开运行时准备的地方，才是 cmd/scheduler/app/server.go。

`RunApp()` 先做的是运行时准备

从代码顺序上看，大致是：

解析和校验命令行参数
启动 plugin server 的 HTTP mux
初始化 profiling / pyroscope
初始化 logging
读取 kube config 并设置 QPS / Burst
进入 Run(...)

如果是 operator 管理的 shard 部署，这时传进来的 --scheduler-conf 往往已经是 operator 根据 SchedulingShard 物化出来的配置文件，而不是 scheduler 进程自己去直接读取 CRD 后现场拼装。

也就是说，KAI scheduler 本身就是一个标准 Kubernetes 控制面进程，而不是“读个配置然后进一个大循环”的轻量程序。

`Run()` 里最重要的两行

对理解调度内核最关键的，其实是这两步：

1 2	actions.InitDefaultActions() plugins.InitDefaultPlugins()

这代表 KAI 在启动阶段先把“可用动作”和“可用插件”注册进框架，再从配置解析出本次实际启用的 action/plugin 组合。

这有个很重要的意义：

注册是能力集合，配置才是实际执行计划。

换句话说：

InitDefaultActions() 负责告诉框架有哪些 action 存在
InitDefaultPlugins() 负责告诉框架有哪些 plugin 可用
ResolveConfigurationFromFile(...) 读取当前进程拿到的配置文件，决定这一轮部署到底启用哪些，以及它们的优先级和参数是什么

这比很多项目直接在代码里写死流程要干净得多。

第二站：action / plugin 的“能力注册表”非常直观

Action 工厂：流程阶段被清楚写死

pkg/scheduler/actions/factory.go 很短，但很重要：

func InitDefaultActions() {
    framework.RegisterAction(reclaim.New())
    framework.RegisterAction(allocate.New())
    framework.RegisterAction(preempt.New())
    framework.RegisterAction(consolidation.New())
    framework.RegisterAction(stalegangeviction.New())
}

这里要注意一个细节：注册顺序不等于执行顺序。

真正的执行顺序来自配置解析，但这件事有两层来源：

如果直接跑 scheduler，默认顺序来自内置的 defaultSchedulerConf
如果通过 operator 部署 shard，operator 会先根据 SchedulingShard.Actions 的 priority 生成最终 config.yaml
scheduler runtime 再通过 ResolveConfigurationFromFile(...) 读取这个物化后的文件

无论哪种路径，代码层面的注册都只是“把工厂挂到框架里”，而不是“马上决定调度流程”。

Plugin 工厂：策略面非常丰富

pkg/scheduler/plugins/factory.go 更能体现 KAI 的平台化程度。默认注册的 plugin 包括：

predicates
priority
nodeplacement
nominatednode
nodeavailability
gpusharingorder
gpupack
gpuspread
resourcetype
podaffinity
elastic
kubeflow
ray
taskorder
subgrouporder
dynamicresources
topology
proportion
minruntime
snapshot
reflectjoborder

这张列表本身已经非常说明问题：KAI 不是一个只会做 node predicate + score 的 scheduler，它的策略面覆盖了：

queue fairness
AI workload integration
GPU placement
topology
subgroup ordering
dynamic resources
minimum runtime
snapshot analysis

也就是说，KAI 的“可扩展性”不是停留在接口层，而是已经沉淀成一个很具体的内建策略生态。

第三站：`NewScheduler()` 组装出来的不是一个算法对象，而是一整套运行环境

进入 pkg/scheduler/scheduler.go 后，可以看到 NewScheduler(...) 做了很多准备工作：

创建 Kubernetes client 和 KAI 自定义 client
创建 discovery client
初始化 usage DB client
组装 SchedulerCacheParams
创建 scheduler cache
保存 SchedulerConfiguration 和 SchedulerParams

这里非常值得注意的是 SchedulerCacheParams 内容，它不是普通缓存初始化参数，而是把 KAI 的运行语义一并带进来了：

SchedulerName
NodePoolParams
RestrictNodeScheduling
DetailedFitErrors
ScheduleCSIStorage
FullHierarchyFairness
NumOfStatusRecordingWorkers
UpdatePodEvictionCondition
UsageDBClient

这意味着 cache 不只是个 informer wrapper，而是调度器真正的运行底座。

第四站：调度循环的核心，真的就只有 `runOnce()` 这么短

从代码阅读体验来看，KAI 最漂亮的地方之一，就是主循环本身极简。

Run()：

func (s *Scheduler) Run(stopCh <-chan struct{}) {
    s.cache.Run(stopCh)
    s.cache.WaitForCacheSync(stopCh)

    go func() {
        wait.Until(s.runOnce, s.schedulePeriod, stopCh)
    }()
}

先启动 cache，再等待同步，然后按 schedulePeriod 周期执行 runOnce()。

`runOnce()` 的主链路

func (s *Scheduler) runOnce() {
    ssn, err := framework.OpenSession(...)
    if err != nil { return }
    defer framework.CloseSession(ssn)

    actions, _ := conf_util.GetActionsFromConfig(s.config)
    for _, action := range actions {
        action.Execute(ssn)
    }
}

这段代码极度简洁，但它背后隐含的是整个框架层次：

OpenSession() 负责把 cluster snapshot 和 plugin callbacks 组织好
GetActionsFromConfig() 负责把本轮要执行的流程读出来
action.Execute(ssn) 负责真正的调度阶段逻辑
CloseSession() 负责收尾、状态事件和清理

这就形成了一条非常稳定的主干：

flowchart LR
    A[Run] --> B[cache.Run]
    B --> C[WaitForCacheSync]
    C --> D[runOnce]
    D --> E[OpenSession]
    E --> F[Resolve Actions]
    F --> G[Execute Actions]
    G --> H[CloseSession]

第五站：`OpenSession()` 真正把 plugin 注入到本轮调度里

pkg/scheduler/framework/framework.go 的 OpenSession() 非常值得精读。

它的逻辑可以概括为：

如有必要初始化 plugin dispatch wrapper，并把 handler 挂到 mux
调用底层 openSession(...) 从 cache 里拿 snapshot
把 config 塞进 session
遍历配置中的 tiers/plugins
通过 plugin builder 创建 plugin 实例
调用 plugin.OnSessionOpen(ssn)

这里要特别注意：真正的 HTTP listener 是在 RunApp() 里启动的；OpenSession() 这一步做的更像是把 framework 的 plugin handler 接到已经存在的 mux 上。

这一步的关键不在于“插件被创建了”，而在于：

OnSessionOpen 会把 plugin 的 callback 注册进当前 session。

这意味着，每轮调度里真正生效的策略，不是从全局某个 registry 动态查一遍，而是已经在 session 打开时被“装配”进这一轮上下文了。

这一步带来的一个非常好的效果

每个 plugin 都只需要关心：

本轮开始时如何初始化自己的状态
往 session 注册哪些 compare/predicate/order/fairness/mutate 函数
本轮结束时如何清理或上报

这让 plugin 的实现边界非常清楚。

第六站：Session 里已经把“调度会用到的所有钩子”摆在台面上了

pkg/scheduler/framework/session.go 的 Session 结构体是最值得反复读的文件之一。

因为它几乎就是 KAI 调度能力的“公开清单”。

它里面最有代表性的字段包括：

NodeOrderFns
JobOrderFns
TaskOrderFns
QueueOrderFns
PrePredicateFns
PredicateFns
BindRequestMutateFns
CanReclaimResourcesFns
ReclaimVictimFilterFns
PreemptVictimFilterFns
GetQueueFairShareFns
eventHandlers

你会发现 KAI 真正的扩展点不是一个单独的 Score() 接口，而是对调度过程的很多局部决策都开放了函数槽位。

这也是为什么它能容纳这么多高级能力。

两个很值得注意的方法

`BindPod(...)`

func (ssn *Session) BindPod(pod *pod_info.PodInfo) error {
    bindRequestAnnotations := ssn.MutateBindRequestAnnotations(pod, pod.NodeName)
    if err := ssn.Cache.Bind(pod, pod.NodeName, bindRequestAnnotations); err != nil {
        return err
    }
    ...
}

这个方法本身已经说明了：session 层并不会直接调用原生 Pod binding，而是把绑定动作下放给 cache，再由 cache 生成 BindRequest。

`OrderedNodesByTask(...)`

这个方法则体现了 KAI 的 node ordering 是如何工作的：

先执行 NodePreOrderFn
再并发计算每个 node 的 NodeOrderFn score
按分值排序返回

这里能明显看出 KAI 的策略执行与排序逻辑已经抽象得很清楚。

第七站：真正的“提交点”在 `SchedulerCache.Bind()`

如果你想知道 placement 决策在哪一层真正离开 session，答案就在 pkg/scheduler/cache/cache.go。

SchedulerCache.Bind(...) 做的事情大致是：

记录 PreBind 状态
调用 createBindRequest(...)
如有需要 patch Pod labels
更新 Bound 状态

最关键的不是状态更新，而是 createBindRequest(...)。

`createBindRequest(...)` 里真正落地了什么

它会构造一个 BindRequest，写入：

selected-node label
ownerReference 指向 Pod
annotations（来自 bind mutate plugins）
PodName
SelectedNode
SelectedGPUGroups
ReceivedResourceType
ReceivedGPU
ResourceClaimAllocations

这一步非常有信息量，因为它告诉我们：

scheduler 的输出对象是 KAI 自己的 binding contract
这个 contract 不只是 node name，还包含 GPU / DRA / annotation 等执行信息
plugin 甚至能在 bind request 生成前修改 annotations

也就是说，KAI 把“binding 输入模型”也设计成了平台扩展面的一部分。

第八站：`BindRequest` 这个 CRD 其实就是 decision plane 和 execution plane 的接口

pkg/apis/scheduling/v1alpha2/bindrequest_types.go 里的 BindRequest 很值得单独看一下。

Spec 里最关键的字段是：

PodName
SelectedNode
ReceivedResourceType
ReceivedGPU
SelectedGPUGroups
ResourceClaimAllocations

另外，CRD 里确实还定义了 BackoffLimit 字段，但当前 scheduler 的 createBindRequest() 并不会主动填充它。

Status 则记录：

Phase
Reason
FailedAttempts

从设计角度看，它不是一个普通“中间对象”，而是一个非常明确的执行 contract：

scheduler 负责写入“应该怎样执行绑定”
binder 负责消费并推进状态

这是 KAI 解耦最关键的 API 边界之一。

第九站：binder controller 把决策真正变成 cluster state

接下来进入 pkg/binder/controllers/bindrequest_controller.go。

Reconcile(...) 的主逻辑其实很好懂：

获取 BindRequest
已删除或已成功则退出
取对应 Pod
若 Pod 已绑定则退出
取目标 Node
调用 r.binder.Bind(ctx, pod, node, bindRequest)
如果失败则 Rollback(...)
最后更新 BindRequest 状态、Pod condition

这里有几个很值得注意的点。

1. binder 是标准的 controller-runtime controller

这意味着 binder 本身具备：

reconcile 模型
watch / queue / rate limiter
status update
delete event handler
MaxConcurrentReconciles 控制

也就是说，KAI 没有发明一套自定义执行引擎，而是把 binding plane 放回了 Kubernetes 最自然的 controller 范式里。

2. 失败路径被认真对待了

当 binder.Bind(...) 返回错误时，controller 会显式调用 Rollback(...)。

这点非常重要。

因为在 AI / GPU / DRA 场景里，binding 不只是一次 API call，可能还伴随着：

ResourceClaim 更新
共享 GPU 相关预留
状态条件修改
其他资源副作用

如果失败后不 rollback，很容易留下脏状态。

3. binder 不只是 bind Pod，它还维护执行语义

从 controller 周围的代码可以看出来，binder 同时在维护：

BindRequest phase/status
PodBound condition
与 resource reservation 的协同
删除事件后的清理动作

这意味着 execution plane 不是一层薄封装，而是一个正式子系统。

第十站：把调用链串起来看，KAI 的一次调度就非常清楚了

如果用伪调用栈把它串起来，大致可以写成：

cmd/scheduler/main.go
  main()
    -> app.RunApp()
      -> Run(...)
      -> actions.InitDefaultActions()
      -> plugins.InitDefaultPlugins()
      -> scheduler.NewScheduler(...)
      -> scheduler.Run(...)
        -> wait.Until(runOnce)
          -> framework.OpenSession(...)
            -> cache.Snapshot()
            -> plugin.OnSessionOpen(ssn)
          -> action.Execute(ssn)
            -> ssn.BindPod(...)
              -> ssn.Cache.Bind(...)
                -> createBindRequest(...)
                  -> create BindRequest CR
          -> framework.CloseSession(ssn)

pkg/binder/controllers/bindrequest_controller.go
  Reconcile(...)
    -> fetch BindRequest / Pod / Node
    -> binder.Bind(...)
    -> rollback on error
    -> update status / pod condition

这条链路体现出了 KAI 最核心的分层：

启动层：server / options / leader election / metrics
执行框架层：scheduler / framework / session
策略层：actions + plugins
提交层：cache -> BindRequest
执行层：binder controller

为什么这条实现链路值得学习

我觉得 KAI 这条源码主链路特别值得看，不是因为它“炫技”，而是因为它把复杂调度系统最容易失控的地方都压住了。

1. 主循环足够短

主循环短，意味着：

更容易审计
更容易稳定
高级能力不会把入口拖成泥球

2. 策略和流程拆得很干净

action 决定流程阶段
plugin 决定策略
cache/session 决定执行上下文
binder 决定落地执行

3. 决策与执行明确解耦

这是整个系统最重要的长期收益点。

很多 scheduler 一旦开始支持复杂资源、副作用、回滚，就会在主调度线程里越写越重。KAI 用 BindRequest 这层接口把这个问题处理得很优雅。

一份我自己的“源码精读顺序”

如果你打算真的跟代码，我建议按下面顺序：

cmd/scheduler/main.go
cmd/scheduler/app/server.go
pkg/scheduler/scheduler.go
pkg/scheduler/framework/framework.go
pkg/scheduler/framework/session.go
pkg/scheduler/actions/factory.go
pkg/scheduler/plugins/factory.go
pkg/scheduler/cache/cache.go 中的 bind / evict / snapshot 相关逻辑
pkg/apis/scheduling/v1alpha2/bindrequest_types.go
pkg/binder/controllers/bindrequest_controller.go

这样读的好处是：先把主骨架立住，再回头钻进具体 action/plugin 算法时不容易迷路。

下一篇看什么

到这里，KAI 的主调度链路已经比较清楚了。但如果只停在这里，会低估它为什么能持续扩展。

真正支撑 KAI 持续演化的，是最后一层：

plugin 体系到底开放了哪些扩展点？
SchedulingShard 为什么是一个很强的 policy surface？
拓扑感知、公平性、time-based fairshare 为什么可以自然叠加进这套框架？

下一篇就专门讲这些“高级能力背后的结构原因”。

KAI Scheduler 深入拆解（三）：调度框架内核——Cycle、Snapshot、Session、Action、Statement

Posted on 2026-04-23 Edited on 2026-05-06 In cloud

如果说上一篇讲的是控制面工作流，那么这一篇要进入 KAI Scheduler 最核心的部分：调度内核是怎么组织的。

我读 KAI 代码时最大的感受是，它并不是把所有逻辑揉进一个“大调度函数”，而是把调度过程拆成了几层非常稳定的抽象：

cycle
cache
snapshot
session
actions
plugins
statement / scenario

这些抽象不是为了“显得框架化”，而是为了支撑 KAI 真正需要处理的复杂性：

gang scheduling
queue fairness
reclaim / preempt
GPU sharing
DRA
topology-aware scheduling
hierarchical podgroups
time-based fairshare

如果没有这套结构，任何一个高级特性都可能让调度逻辑失控。

先看最短的调度主循环

在 pkg/scheduler/scheduler.go 里，KAI 的调度主干其实很短：

func (s *Scheduler) Run(stopCh <-chan struct{}) {
    s.cache.Run(stopCh)
    s.cache.WaitForCacheSync(stopCh)

    go func() {
        wait.Until(s.runOnce, s.schedulePeriod, stopCh)
    }()
}

真正的重点在 runOnce()：

func (s *Scheduler) runOnce() {
    ssn, err := framework.OpenSession(...)
    if err != nil { return }
    defer framework.CloseSession(ssn)

    actions, _ := conf_util.GetActionsFromConfig(s.config)
    for _, action := range actions {
        action.Execute(ssn)
    }
}

这段代码非常能代表 KAI 的设计哲学：

scheduler 自己不直接写死“怎么调度每个 Pod”，它做的是：打开一个 session，然后把控制权交给有顺序的 action 和可插拔的 plugin。

也就是说，KAI 的内核不是一个算法，而是一个 按周期运行的调度执行框架。

第一层：Cycle —— 调度不是事件回调，而是周期性计算

KAI 文档里一直强调“scheduling cycle”，这不是表述习惯，而是真正的架构选择。

每个 cycle 的流程可以简化成：

flowchart LR
    Start([Cycle Start]) --> Snapshot[Take Snapshot from Cache]
    Snapshot --> Session[Open Session]
    Session --> Actions[Execute Actions]
    Actions --> Close[Close Session]
    Close --> End([Cycle End])

为什么 KAI 选择 cycle model

对于只做简单 node fitting 的调度器，事件驱动模型也许足够。但 KAI 需要解决的问题更复杂：

一个 PodGroup 能不能整体调度成功？
某个 queue 是不是超出 fair share？
为了让高优 workload 运行，当前要不要 reclaim / preempt？
某个 topology domain 是否更适合这个 workload？
某些资源是否能在一个模拟场景里整体成立？

这些问题都更像“在一个一致性视图上做组合决策”，而不是“处理一个 Pod Add 事件”。

cycle model 的好处正是：

所有决策基于同一份快照
复杂动作可以按顺序执行
调度器可以先模拟，再提交
高成本策略可以集中在每轮计算里完成

第二层：Cache —— 调度器真正依赖的是聚合态，而不是零散 API 调用

KAI 的 cache 在架构上是非常重要的一层。文档 docs/developer/scheduler-concepts.md 也把它放在很前面。

它的职责并不是简单“缓存 informer 数据”，而是负责：

聚合多个 Kubernetes / KAI 资源对象
维护调度需要的派生状态
生成 snapshot
把调度提交结果写回集群

也就是说，cache 同时承担了：

输入聚合层
状态维护层
提交出口层

在 KAI 里，cache 可以被理解成 scheduler 的“authoritative runtime state”。

为什么这层很关键

因为 scheduler 需要同时看：

nodes
pods
podgroups
queues
bindrequests
storage / DRA / resource claims
历史 usage

如果每次 action 执行都直接查 API server，复杂度和性能都会失控。KAI 的做法是先把这些状态汇总到 cache，再在 snapshot 上运行策略。

第三层：Snapshot —— 每一轮调度都从一致性视图出发

在 KAI 的设计里，snapshot 不是一个调试辅助对象，而是调度 cycle 的正式输入。

openSession() 里最关键的一步就是：

1 2	snapshot, err := cache.Snapshot() ssn.ClusterInfo = snapshot

也就是说，session 看到的整个世界，都是这一刻从 cache 切出来的静态视图。

snapshot 的意义

snapshot 的核心价值在于三个词：

一致性
可模拟
可复现

一致性

同一轮调度里，所有 action / plugin 面对的是同一个 cluster image。

可模拟

你可以在 session 里虚拟 allocate / evict / reorder，而不立刻影响真实集群。

可复现

如果你想调试某一轮调度，理论上 snapshot 是最接近“输入快照”的东西。这也是 KAI 为什么会有 snapshot-tool 这类工具。

第四层：Session —— 一轮调度真正的执行上下文

如果说 snapshot 是输入，那 session 就是这一轮调度的“运行时容器”。

pkg/scheduler/framework/session.go 里的 Session 结构体非常值得看，因为它几乎把 KAI 的扩展点都明牌写出来了：

GpuOrderFns
NodePreOrderFns
NodeOrderFns
JobOrderFns
SubGroupOrderFns
TaskOrderFns
QueueOrderFns
PrePredicateFns
PredicateFns
BindRequestMutateFns
CanReclaimResourcesFns
ReclaimVictimFilterFns
PreemptVictimFilterFns
GetQueueFairShareFns
event handlers

这说明 KAI 的 session 不是“存点变量”的上下文对象，而是 整个调度框架的插件总线。

Session 负责什么

从设计上看，Session 至少承担四件事：

保存 snapshot 得到的 ClusterInfo
承载本轮调度的 callback registry
提供绑定、驱逐、节点排序、GPU 排序等调度操作
为 statement，以及 reclaim / preempt 这类动作里的场景推演提供统一语义边界

这层设计的妙处

它把“当前一轮调度会发生什么”从长期状态里剥离出来了。

cache 是长期维护的系统状态
session 是某一轮的计算上下文

这能让 plugin 在 OnSessionOpen 时做本轮初始化，在 OnSessionClose 时做清理或指标记录，而不用在长期全局状态里互相污染。

第五层：Action —— 流程由 action 决定，但顺序是配置驱动的

在 KAI 里，action 定义调度流程阶段，而“这一轮按什么顺序跑”由配置决定。runOnce() 每轮都会通过 conf_util.GetActionsFromConfig(s.config) 读取 action 序列；默认配置则是：

allocate
consolidation
reclaim
preempt
stalegangeviction

代码和配置里的实际 token 是上面这组小写名字。

这组默认顺序仍然表达了一种非常清晰的调度哲学：

先尝试无破坏地调度，再做布局整理，再做跨 queue 回收，再做更激进的抢占，最后清理 gang 死锁状态。

换句话说，Action 负责的是“调度流程的阶段控制”，但阶段顺序本身仍然可以被配置覆盖。

为什么 action 顺序很重要

KAI 不是随便堆几个插件，而是把“先做什么，后做什么”收敛成一条可配置的 phase pipeline。

`allocate`

优先处理无需干扰即可落地的工作负载。

`consolidation`

在已有布局上做整理，减少碎片。

`reclaim`

当 queue 之间资源分配不公平时，做跨 queue 回收。

`preempt`

当同一 queue 内高优 workload 需要资源时，再做优先级抢占。

`stalegangeviction`

处理 gang 语义下的僵死/失效场景。

这个默认顺序说明 KAI 很重视“尽量把破坏性动作后置”。

第六层：Plugin —— 策略由 plugin 注入，不由 action 写死

如果说 action 决定“什么时候干什么”，那 plugin 决定“具体按什么规则干”。

在 pkg/scheduler/framework/framework.go 里，OpenSession() 会读取配置里的 tiers 和 plugins，然后依次：

取到 plugin builder
实例化 plugin
调用 plugin.OnSessionOpen(ssn)

在 OnSessionOpen 里，plugin 把自己的 callback 注册进 session。

这意味着什么

KAI 的 plugin 不是“调用时才临时执行的一段代码”，而是会在 session 打开时把行为注入调度上下文。例如：

queue ordering
task ordering
node scoring
GPU scoring
predicates
fairness calculation
reclaim / preempt validation
bind request mutation
allocate / deallocate event handlers

不过这些 callback 并不是都以同一种方式组合：有些会累积执行，有些会按顺序挑第一个可用实现；例如部分 reclaim 相关判断在当前源码里就是 first-provider-wins，而不是所有插件结果一起归并。

一个非常关键的区分

很多调度框架里，action 和 plugin 的边界不够清楚；而 KAI 这里非常明确：

Action = 调度流程阶段
Plugin = 每个阶段使用的策略函数集合

这让系统具备两个非常好的性质：

流程稳定
- cycle 还是那个 cycle，action 还是那些 phase。
策略可演化
- node placement、fairness、topology、min runtime、dynamicresources 都可以通过 plugin 扩展。

第七层：Statement / Scenario —— KAI 最有意思的设计之一

这里有两个很重要的概念：

Scenario
Statement

我认为这两个概念是 KAI 能处理复杂调度动作的关键。

Scenario：先构造“如果这样调度，会发生什么”

Scenario 可以理解成 reclaim / preempt / consolidation 这类动作里的“候选情景”或“校验视角”，而不是和 Statement 完全对称的一层通用事务对象。

比如某个 reclaim / preempt 场景，不是直接去驱逐 Pod，而是先在内存里构造一个可能的结果：

如果把这几个 Pod 挪走会怎样？
如果这个 PodGroup 迁移到另一个 domain 会怎样？
如果先释放某些资源再放入新 workload，会不会整体成立？

这意味着 KAI 不是“想到一个动作就立刻提交”，而是先做 what-if analysis。

Statement：把一组调度操作组织成事务式单元

Statement 可以理解成一个 transaction-like object。

它支持：

Checkpoint()
Rollback(checkpoint)
Allocate(pod, node)
Evict(pod, msg)
Pipeline(pod, node)
Commit()
Discard()

也就是说，调度器可以在本轮 session 中：

先尝试若干虚拟操作
观察这些操作在 plugin callback、资源状态、fairness 上的影响
如果不成立就 rollback
如果成立再 commit

为什么这非常适合 KAI

因为 KAI 做的不只是“找个 node fit 一下”。它经常需要处理下面这种组合问题：

一个 PodGroup 的多个 Pod 必须整体成功
某个 queue reclaim 以后，另一个 queue 是否真的能受益
topology domain 切换后，整体 locality 是否更优
consolidation 是否会造成新的碎片
DRA / resource claim 场景是不是整体可行

没有 statement/scenario，很多逻辑要么只能做局部贪心，要么必须把复杂回滚散落在各处。KAI 用一层统一抽象把这个问题收住了。

Action + Plugin + Statement 这三个东西怎么配合

我觉得可以把它们理解成三层：

flowchart TD
    A[Action] --> B[Statement / Scenario]
    C[Plugin callbacks] --> B
    B --> D[Commit to cache / cluster]

    A1[allocate / consolidation / reclaim / preempt] -.控制流程.-> A
    C1[Predicate / Order / FairShare / Bind Mutate] -.提供策略.-> C
    B1[Checkpoint / Rollback / Commit] -.提供事务化模拟.-> B

action 决定“当前做哪类动作”
plugin 决定“按什么规则评估和排序”
statement 决定“怎么安全地模拟和提交”，而 scenario 更像动作内部的候选情景/验证视角

这三个层次组合在一起，才形成 KAI 的调度内核。

为什么这套框架足以承载高级能力

KAI 后续的很多高级特性，其实都不是“另起炉灶”，而是在这套框架上叠加出来的：

拓扑感知：本质上是 plugin + node/domain ordering
fairshare：本质上是 queue ordering / resource accounting / reclaim policy
time-based fairshare：本质上是 fairness 数据输入变得更丰富
hierarchical podgroups：本质上是 workload model + subgroup ordering + action evaluation 更复杂
DRA：本质上是 predicates / bind execution / resource claims 集成

也就是说，KAI 的高阶能力不是靠一堆特判支撑，而是靠一个足够稳的 execution model 支撑。

我对 KAI 调度框架的一个总结

如果只用一句话概括这套内核，我会写成：

KAI 把调度建模成：在一致性 snapshot 上打开一个 session，由 action 驱动流程、由 plugin 注入策略、由 statement 提供可回滚模拟，并在部分动作里通过 scenario 组织候选情景，然后把成立的结果提交回集群。

这是一个非常适合复杂 workload 调度的模型。

它的好处不是“优雅”，而是很实用：

可以承载越来越复杂的 AI workload 语义
可以在保证流程稳定的同时持续增加新策略
可以把破坏性动作控制在明确的阶段和边界里
可以让调试和离线复现更有抓手

下一篇看什么

框架理解了以后，最自然的下一步就是：

代码里到底是怎么启动的？
RunApp() 到 runOnce() 之间发生了什么？
action / plugin 注册和配置覆盖怎么配合？
BindRequest 是在哪一层创建的？
binder 又是怎么接住这个对象的？

下一篇就从源码入口开始，沿着真正的调用链，走一遍 KAI 的一次调度周期。

KAI Scheduler 深入拆解（二）：从 Pod 到 BindRequest 的控制面工作流

Posted on 2026-04-23 Edited on 2026-05-06 In cloud

上一篇我把 KAI Scheduler 当成一个平台来读：它不是单个 scheduler 进程，而是一串围绕 CRD、controller 和 scheduling session 组织起来的控制面。

这一篇开始看它真正的“主干路径”：一个 workload 进入集群后，到底是怎么一步步变成调度决策，再变成真实节点绑定的？

如果要先给结论，我会这样概括：

在 KAI 里，提交 workload 并不会立刻进入“给 Pod 选 node”的同步流程，而是先被重写成更适合调度的 workload model，再进入周期性调度循环，最后通过 BindRequest 异步落地。

这条路径一共可以拆成六步：

workload 进入 Kubernetes
pod-grouper 抽取 workload 语义，创建 PodGroup
scheduler 在一个 cycle 里对 PodGroup 和 Queue 做 placement 决策
scheduler 创建 BindRequest
binder reconcile BindRequest 并执行真实绑定
PodGroup/Queue controller 回写状态，形成反馈闭环

先看整条链路

sequenceDiagram
    participant U as User / Workload Controller
    participant A as Admission
    participant PG as Pod Grouper
    participant PGCRD as PodGroup CRD
    participant Q as Queue CRD
    participant S as Scheduler
    participant BR as BindRequest CRD
    participant B as Binder
    participant PGC as PodGroup Controller
    participant QC as Queue Controller
    participant PM as Prometheus / Metrics Store

    U->>A: 创建 Pod / Job / Ray / Kubeflow workload
    A-->>U: 准入校验 / 变更
    U->>PG: Pod 被 watch 到
    PG->>PGCRD: 基于 top owner 创建或更新 PodGroup
    U->>Q: workload 通过 queue 标签归属某个 Queue
    S->>PGCRD: 在 snapshot 中读取 PodGroup
    S->>Q: 在 snapshot 中读取 Queue
    S->>BR: 为选中的 Pod 创建 BindRequest
    B->>BR: reconcile BindRequest
    B->>B: 执行 bind / 资源准备 / 回滚
    B-->>U: Pod 绑定到目标节点
    PGC->>PGCRD: 回写资源状态与部分运行态
    QC->>Q: patch Queue 状态
    QC->>PM: 暴露队列级资源指标
    PM-->>S: usage DB client 查询历史 usage

这个 sequence 图里，真正最关键的转折点有两个：

Pod -> PodGroup：KAI 把原始 workload 转成更适合调度的抽象。
Decision -> BindRequest：KAI 把调度决策和执行绑定解耦。

只要把这两个转折点读懂，整个系统就顺了。

第一步：workload 进入集群，但 scheduler 还不会立刻直接处理它

KAI 支持的工作负载并不只有单 Pod。它非常强调 AI / batch / distributed workload 的调度语义，所以常见入口包括：

原生 Pod
Job / CronJob（以及其他 batch 类上层对象）
Deployment
MPIJob
Ray
Kubeflow 相关 workload
JobSet
以及其他带 owner reference 的上层对象

这些对象刚进入集群时，并不天然具备 KAI 需要的完整调度语义。

例如，scheduler 真正关心的问题往往是：

这些 Pod 是否必须一起启动？
它们属于哪个 queue？
它们默认优先级和 preemptibility 是什么？
它们有没有 topology constraint？
它们是不是多级 subgroup 结构？

这些信息如果散落在不同 workload 的不同字段里，后续调度逻辑会非常难维护。于是 KAI 先做了一次“语义收敛”。

第二步：pod-grouper 把 workload 重写成 `PodGroup`

pod-grouper 是整条链路里最容易被低估的组件。

很多人看 scheduler 时会把 pod-grouper 当成辅助服务，但实际上它承担的是“把 Kubernetes 原生 workload 翻译成 KAI workload model”的职责。

它在做什么

pod-grouper 监听 Pod，然后做三件事：

找到 Pod 的 top owner
根据 owner 类型选择合适的 grouping plugin
生成或更新 PodGroup

文档 docs/developer/pod-grouper.md 给出的核心思路非常清楚：它不是简单按 label 聚合 Pod，而是通过 owner chain 去推断真正的 workload 边界。

为什么一定要找 top owner

因为很多 Kubernetes workload 都有多层 owner reference。

例如：

Pod 的 owner 是 ReplicaSet
ReplicaSet 的 owner 又是 Deployment

如果只按 Pod 当前直接 owner 聚合，你得到的是“副本控制器层”的分组，而不是“业务 workload 层”的分组。KAI 的做法是继续往上找，找到最适合代表 workload 语义的对象。

这一步带来的收益很大：

同一个 workload 的 Pod 可以稳定归到同一个 PodGroup
不同 workload 类型可以保留各自的分组逻辑
调度器不需要理解所有 workload CRD 的细节，只需要理解 PodGroup

`PodGroup` 才是 KAI 真正的 workload 主语

在 pkg/apis/scheduling/v2alpha2/podgroup_types.go 里，PodGroupSpec 已经不仅仅是一个 minMember 包装器，它承载的是 workload 的调度语义：

MinMember
MinSubGroup
Queue
PriorityClassName
Preemptibility
TopologyConstraint
SubGroups
MarkUnschedulable
SchedulingBackoff

这里最重要的不是字段多，而是字段的组合方式。

这意味着 KAI 在表达的是：

这是一个 gang 还是普通任务？
这是平面 workload 还是层级 workload？
它属于哪个资源队列？
它可以被抢占吗？
它有机架、zone、拓扑域要求吗？

也就是说，从这一步开始，系统不再是“对一堆 Pod 做 placement”，而是“对带有完整调度语义的 workload group 做 placement”。

第三步：`Queue` 把 workload 放进资源与公平性的上下文里

PodGroup 解决的是 workload 建模问题，Queue 解决的是资源分配问题。

在 KAI 里，queue 不是顺手加的一个标签，而是公平性和资源边界的核心对象。很多关键特性都建立在它上面：

hierarchical queue
quota / deserved quota
over-quota share
reclaim
fair-share
time-based fairshare

也就是说，scheduler 在做决策时，看到的不是“某个 Pod 能不能放到某个节点”，而是：

某个 PodGroup 在它所属 Queue 的资源语境下，是否应该得到这次调度机会。

这是 KAI 跟只做 node fitting 的调度器非常不同的一点。

第四步：scheduler 在一个 cycle 里对 snapshot 做决策

等到 PodGroup 和 Queue 都就位以后，scheduler 才开始它真正擅长的事情。

这里的关键点不是某个单独算法，而是 周期性调度模型。

KAI scheduler 不是收到一个 Pod 事件就立刻同步做完全部决策。更准确地说，是启动时先让 cache 跑起来并等待同步；进入稳定运行后，每个 cycle 再做：

获取 snapshot
打开 session
执行 actions
关闭 session

这带来两个直接好处：

1. 所有决策基于同一份一致视图

在同一个 cycle 里，所有 action 和 plugin 都基于同一个 cluster snapshot 工作。

这对复杂特性尤其重要：

gang scheduling
reclaim
preempt
topology-aware placement
dynamic resources
queue fairshare

如果没有 snapshot 这一层，很多逻辑会变成“边读边改边抢锁”，复杂度会急剧上升。

2. scheduler 可以处理的是 workload 级别的组合决策

因为它不是只看一个 Pod，而是看 snapshot 里的：

nodes
queues
podgroups
bindrequests
资源状态
历史 usage

所以它做出来的是更接近“全局最优”或“局部一致最优”的决策，而不只是一个 request/response 式的即时选择。

第五步：scheduler 不直接 bind，而是创建 `BindRequest`

这是 KAI 整条工作流里最精彩的一步。

在 pkg/scheduler/cache/cache.go 里，SchedulerCache.Bind(...) 的逻辑不是直接调用 Pod binding API，而是创建 BindRequest。

当前 scheduler 创建 BindRequest 时，BindRequestSpec 里主要会写入：

PodName
SelectedNode
ReceivedResourceType
ReceivedGPU
SelectedGPUGroups
ResourceClaimAllocations

另外，BindRequest 类型里确实还定义了 BackoffLimit，但当前 scheduler 创建对象时并不会主动填充它。更重要的是，scheduler 输出的不是一个轻量的“node choice”，而是一份 完整的绑定执行意图。

为什么这一步非常重要

如果 scheduler 直接 bind Pod，那它就必须自己同步承担这些职责：

资源声明与 DRA 处理
共享 GPU 相关准备
失败重试
失败回滚
状态更新
绑定副作用控制

这会让调度循环变得又慢又重。

KAI 的选择是把它们拆开：

scheduler 专注于决策
binder 专注于执行

中间用 BindRequest 这个 CRD 做状态中介。

这类设计最大的价值不是“看起来解耦”，而是真正在错误模型和并发模型上得到收益：

scheduler cycle 可以更短
binder 可以独立重试
执行错误不会把调度内核拖进复杂副作用里
中间态对象便于排障和观测

第六步：binder reconcile `BindRequest`，把决策变成真实集群状态

binder 读取到 BindRequest 后，会进入 controller-runtime 的 reconcile 流程。

它的大致步骤是：

取回 BindRequest
找到对应 Pod
找到目标 Node
如果 Pod 已经绑定则退出
调用 binder 实现执行真实 bind
如果失败则 rollback
更新 BindRequest 状态和 Pod condition

这里最值得注意的是两点。

1. binder 处理的是“执行型失败”

比如：

节点在短时间内变化了
资源声明失败了
共享 GPU 相关准备失败了
某些 bind-time side effect 没完成

这些都属于执行面的问题，而不是 placement 逻辑本身的问题。

KAI 通过 binder 把这些风险从主调度循环里隔离了出来。

2. binder 能做 rollback

这很关键。

当 bind 失败时，binder 会尝试 rollback。对于一个支持 DRA、GPU sharing、复杂资源声明的系统来说，这是必须的。

因为 scheduler 的决策不只是“选 node”，它往往还隐含了若干资源分配语义。如果失败后没有清理干净，后面很容易出现资源状态污染。

最后一环：状态回写与历史反馈

如果工作流只到 binder 为止，那 KAI 还只是一个“高级 scheduler”。

它之所以更像平台，是因为后面还有完整的反馈闭环。

`PodGroupController`

它主要回写 PodGroup.ResourcesStatus，并补充部分调度/运行态信息。

这让 PodGroup 不只是“调度前的输入”，也变成“调度后的观测对象”。

`QueueController`

它主要 patch Queue 状态，并把队列级资源统计暴露成指标，为公平性和资源分配提供基础。

Prometheus / usage DB

当 time-based fairshare 打开以后，这条反馈链会继续延伸：

podgroup-controller 更新 podgroup 资源状态
queue-controller patch queue 状态并导出指标
Prometheus 存储这些历史指标
scheduler 通过 usage DB client 读取历史数据

这就让 KAI 不只是“看当前资源”的 scheduler，而是一个能把 历史使用行为 纳入决策的系统。

KAI 的 workflow 为什么适合 AI / batch workload

把整条链路看完以后，会很容易理解它为什么适合 AI 场景。

1. 它先建模，再调度

AI workload 往往不是单 Pod，而是：

多副本训练
分布式推理
分层通信结构
gang / subgroup / topology 约束

KAI 通过 PodGroup 先把这些语义固定下来，再做 scheduling，避免把 workload 复杂性泄漏进每个 action/plugin 的实现里。

2. 它先决策，再执行

AI workload 的 bind-time 复杂度通常更高：

GPU sharing
DRA / ResourceClaim
volume / storage 约束
失败回滚

用 BindRequest 把 decision plane 和 execution plane 分开，能让调度器更专注，也让系统更稳定。

3. 它有反馈闭环

公平性不只是“这一刻谁先跑”，而是“长期来看谁占用了多少资源”。

KAI 通过 queue controller 导出队列级指标，再由 Prometheus 和 usage DB client 把这些历史数据带回调度决策里，因此它可以逐步走向时间维度的公平性，而不是只做瞬时队列排序。

我对这条工作流的一个总结

如果把 KAI 的控制面 workflow 用一句话概括，我会写成：

先把 Kubernetes workload 翻译成 KAI 的 workload language，再在一致性 snapshot 上做调度决策，最后把决策通过异步 binding pipeline 落地，并把结果反馈回 queue 与 workload 状态。

这条链路的好处是系统边界非常清楚：

pod-grouper 负责“翻译 workload”
scheduler 负责“决定谁该上、上到哪”
binder 负责“把决定真的执行掉”
controller 们负责“把执行结果重新喂回系统”

下一篇看什么

理解 workflow 以后，下一步最自然的问题就是：

snapshot 到底是什么？
session 里都存了什么？
action 和 plugin 究竟谁负责流程，谁负责策略？
KAI 为什么要引入 statement / scenario 这种“事务化模拟”模型？

下一篇就进入调度内核本身，拆解 KAI 最核心的设计：

cycle、snapshot、session、action、plugin、statement。

Background

概述

概述

SCST Infinite Loop: Full Root Cause Analysis and Fix

Executive Summary

Background: The SCST User-Space Device Handler

Architecture Overview

Command Lifecycle

The Symptom: Kernel Soft Lockup During Device Cleanup

The Cleanup Flow

Phase 1: Initial Analysis and Workaround

Initial Hypothesis

Workaround: Diagnostics + Force Cleanup

Phase 2: What Instrumentation Revealed

Why Initial Analysis Was Wrong

The Real Root Cause: SGV Pool Caching

State 7: UCMD_STATE_ON_FREE_SKIPPED

The SGV Pool Is a Cache, Not a Direct Free

The Complete Reference Count Trace

Why the Loop Runs at 2 Million Iterations per Second

The Fix: Post-Unjam SGV Pool Flush

Why the Existing Pre-Loop Flush Was Not Enough

The Two-Line Fix

Call Chain After the Fix

Replace panic() with schedule()

End-to-End Fix Summary

Lessons

1. Instrumentation before conclusion

2. Falsification experiments are decisive

3. Pool caches hide reference lifetimes

4. Minimal fixes beat complex workarounds

5. Kernel infinite loops require an escape hatch

Code Locations

先看最关键的一点：plugin 不是装饰品，而是调度策略的主入口

Session 暴露的扩展点，基本等于 KAI 的能力边界

别忘了：binder 这一侧也有正式插件面

SchedulingShard：KAI 真正的平台配置面

为什么这层设计非常强

分片不是部署技巧，而是语义隔离能力

拓扑感知：KAI 的 topology 不是附属 feature，而是主框架里的自然延伸

第一层：domain selection

第二层：node ordering within domain

公平性：KAI 的 fairshare 不是单一排序函数，而是 queue 系统的一部分

这件事为什么离不开框架抽象

Time-based fairshare：KAI 开始把“历史行为”放进调度决策

为什么这件事能自然接进去

Config + SchedulingShard 的组合，让 operator 也变成架构的一部分

为什么我认为 KAI 的扩展路径是健康的

1. workload 语义落在 PodGroup

2. 公平性与资源治理落在 Queue

3. 执行流程落在 action

4. 局部策略落在 plugin

5. 中间态执行落在 BindRequest + binder

6. 平台调优落在 SchedulingShard

如果想扩展 KAI，最值得看哪几类点

1. 想加 workload 语义

2. 想改调度策略

3. 想改调度流程

4. 想做平台级多租户 / 多场景隔离

5. 想改执行落地路径

我对 KAI 扩展能力的一个总结

系列收束：怎么继续读这个仓库

第一站：scheduler 进程入口不是算法，而是平台启动逻辑

RunApp() 先做的是运行时准备

Run() 里最重要的两行

第二站：action / plugin 的“能力注册表”非常直观

Action 工厂：流程阶段被清楚写死

Plugin 工厂：策略面非常丰富

第三站：NewScheduler() 组装出来的不是一个算法对象，而是一整套运行环境

第四站：调度循环的核心，真的就只有 runOnce() 这么短

runOnce() 的主链路

第五站：OpenSession() 真正把 plugin 注入到本轮调度里

这一步带来的一个非常好的效果

第六站：Session 里已经把“调度会用到的所有钩子”摆在台面上了

两个很值得注意的方法

BindPod(...)

OrderedNodesByTask(...)

第七站：真正的“提交点”在 SchedulerCache.Bind()

createBindRequest(...) 里真正落地了什么

第八站：BindRequest 这个 CRD 其实就是 decision plane 和 execution plane 的接口

`SchedulingShard`：KAI 真正的平台配置面

`Config` + `SchedulingShard` 的组合，让 operator 也变成架构的一部分

1. workload 语义落在 `PodGroup`

2. 公平性与资源治理落在 `Queue`

5. 中间态执行落在 `BindRequest` + binder

6. 平台调优落在 `SchedulingShard`

`RunApp()` 先做的是运行时准备

`Run()` 里最重要的两行

第三站：`NewScheduler()` 组装出来的不是一个算法对象，而是一整套运行环境

第四站：调度循环的核心，真的就只有 `runOnce()` 这么短

`runOnce()` 的主链路

第五站：`OpenSession()` 真正把 plugin 注入到本轮调度里

`BindPod(...)`

`OrderedNodesByTask(...)`

第七站：真正的“提交点”在 `SchedulerCache.Bind()`

`createBindRequest(...)` 里真正落地了什么

第八站：`BindRequest` 这个 CRD 其实就是 decision plane 和 execution plane 的接口

`allocate`

`consolidation`

`reclaim`

`preempt`

`stalegangeviction`

第二步：pod-grouper 把 workload 重写成 `PodGroup`

`PodGroup` 才是 KAI 真正的 workload 主语

第三步：`Queue` 把 workload 放进资源与公平性的上下文里

第五步：scheduler 不直接 bind，而是创建 `BindRequest`

第六步：binder reconcile `BindRequest`，把决策变成真实集群状态

`PodGroupController`