Kubernetes核心组件学习系列 - 完整指南与学习路线图
Kubernetes核心组件深度学习系列文章导航,提供系统性的学习路径和面试准备指南
Kubernetes核心组件深度学习系列文章导航,提供系统性的学习路径和面试准备指南
数据库扩展是系统设计面试绕不开的话题。本文系统梳理读写分离、分库分表、一致性哈希、分布式事务的核心原理与工程实现,并给出面试中的取舍框架。
限流是保障系统稳定性的最后一道防线。本文系统梳理限流系统的四大算法、单机 vs 分布式限流、滑动窗口精确实现,以及如何在面试中设计一个完整的分布式限流服务。
消息队列是分布式系统的”神经中枢”。本文以 Kafka 为核心,从面试视角解析消息队列的核心设计、可靠性保障、顺序性、幂等性等关键问题,并给出面试常见追问的参考答案。
缓存是系统设计面试中出现频率最高的话题之一。本文完整梳理分布式缓存的核心设计:缓存策略选型、三大经典问题(雪崩/穿透/击穿)的根因与解法、高可用架构,以及面试中的常见追问答法。
高可用微服务是每一场系统设计面试的必考题。本文以”设计一个可以快速容灾的微服务系统”为主线,完整梳理面试官期待听到的答案框架:如何检测故障、如何切换流量、如何恢复,以及背后的关键权衡。
在排查一起 local-dynamic StorageClass 的 Pod 调度异常时,通过提升调度器日志级别并结合实际压测复现,深入剖析了 kube-scheduler volumebinding 插件中存在的容量检查 Bug。本文完整记录从现象、根因、日志验证到修复的全过程。
During device shutdown, dev_user_process_cleanup() spins at ~2 million
iterations per second, pins one CPU core, and triggers the kernel soft-lockup
detector within seconds. The root cause is a command stuck in ucmd_hash
with ref=1 because sgv_pool_free() caches the scatter-gather buffer on
the pool LRU instead of freeing it — so the allocator’s ucmd_put() callback
never fires and the reference count never reaches zero.
The fix: two sgv_pool_flush() calls added after the unjam loop indev_user_unjam_dev().
SCST is a high-performance storage target subsystem for Linux. The scst_user
module allows user-space applications to implement SCSI target devices via a
character device interface.
Key data structures:
ucmd_hash: hash table tracking all active scst_user_cmd objectsready_cmd_list: queue of commands ready for user-space processingcleanup_cmpl: completion for device cleanup synchronizationucmd_ref: per-command reference count; dev_user_free_ucmd() →cmd_remove_hash() fires only when atomic_dec_and_test() returns trueNormal command lifecycle:
1 | dev_user_alloc_ucmd() ucmd_ref = 1 |
Multiple scst_usr_release threads stuck in D state:
1 | [Thu Jan 23 02:37:11 2025] task:scst_usr_releas state:D stack: 0 pid:334614 |
The threads block on wait_for_completion(&dev->cleanup_cmpl), which is never
signaled because the cleanup thread is spinning in an infinite loop and never
reaches complete_all(&dev->cleanup_cmpl).
When a SCST user device is torn down:
dev_user_exit_dev() — unregisters the device, setsdev->cleanup_done = 1, then blocks onwait_for_completion(&dev->cleanup_cmpl).
dev_user_process_cleanup() — runs in a separate thread, drains
remaining commands, and calls complete_all(&dev->cleanup_cmpl) to
unblock step 1.
The exit condition requires rc1 == 0 (hash empty) and rc == -EAGAIN
(ready list empty) and cleanup_done:
1 | while (1) { |
If any command remains in ucmd_hash but is not in ready_cmd_list,rc1 > 0 and rc == -EAGAIN simultaneously, and the loop has no exit.
The command stuck in ucmd_hash has:
1 | state = UCMD_STATE_ON_FREE_SKIPPED (7) |
State 7 is set in dev_user_on_free_cmd() when on_free_cmd_type isSCST_USER_ON_FREE_CMD_IGNORE:
1 | if (ucmd->dev->on_free_cmd_type == SCST_USER_ON_FREE_CMD_IGNORE) { |
dev_user_process_reply_on_free() frees the SGV buffer and drops a reference:
1 | static int dev_user_process_reply_on_free(struct scst_user_cmd *ucmd) |
This looks correct. The problem is what dev_user_free_sgv() actually does.
sgv_pool_free() is a cache return, not a free1 | static void dev_user_free_sgv(struct scst_user_cmd *ucmd) |
The SGV (scatter-gather vector) pool is a performance cache: it holds
recently freed SG buffers so future commands can reuse them without hitting
the page allocator. When sgv_pool_free() is called:
dev_user_free_sg_entries() — is not called.ucmd_get() reference taken in dev_user_alloc_pages() is not released.dev_user_free_sg_entries() (and its ucmd_put()) only fires when the pool
evicts a cached object — via an explicit sgv_pool_flush().
| Event | Operation | ucmd_ref |
|---|---|---|
dev_user_alloc_ucmd() |
atomic_set(&ucmd_ref, 1) |
1 |
dev_user_alloc_pages() |
ucmd_get() for first SG page |
2 |
dev_user_unjam_dev(): ucmd_get_check() |
bump to verify not zombie | 3 |
dev_user_unjam_cmd() → scst_cmd_done() → dev_user_on_free_cmd() → dev_user_free_sgv() → sgv_pool_free() |
SGV goes to pool LRU; dev_user_free_sg_entries() not called; alloc_pages ref not released |
3 |
dev_user_process_reply_on_free(): ucmd_put() |
3 → 2 | 2 |
dev_user_unjam_dev(): ucmd_put() for ucmd_get_check ref |
2 → 1 | 1 |
cmd_remove_hash() fires only when atomic_dec_and_test() returns true (ref
reaches 0). It never does — the alloc_pages reference is never released becausedev_user_free_sg_entries() never fires. The ucmd stays in ucmd_hash
indefinitely.
After unjamming, the stuck ucmd has sent_to_user = 0 and is not inready_cmd_list. On every subsequent pass:
1 | list_for_each_entry(ucmd, head, hash_list_entry) { |
res is non-zero (rc1 > 0) but no command is unjammed.dev_user_get_next_cmd() returns -EAGAIN (ucmd not in ready_cmd_list).
Both functions acquire and release a spinlock in under a microsecond.
Result: ~2 million iterations per second, 100% CPU on one core, soft-lockup
detector fires within seconds.
dev_user_unjam_dev() already calls sgv_pool_flush() before the unjam
loop:
1 | static int dev_user_unjam_dev(struct scst_user_dev *dev) |
SGV objects are placed into the pool cache during unjamming — whenscst_cmd_done → dev_user_on_free_cmd → dev_user_free_sgv →sgv_pool_free executes inside the unjam loop. A flush that precedes the loop
cannot evict objects that do not yet exist in the cache.
1 | static int dev_user_unjam_dev(struct scst_user_dev *dev) |
sgv_pool_flush() is fully synchronous — it calls sgv_dtor_and_free()
inline in a while loop, so by the time it returns all eviction callbacks have
already fired. The call chain on eviction:
1 | sgv_pool_flush() |
On the next iteration dev_user_unjam_dev() returns res = 0, anddev_user_process_cleanup() breaks normally — within 2–3 iterations.
| Detail | |
|---|---|
| Symptom | dev_user_process_cleanup() loops at ~2M iter/s; soft lockup |
| Stuck ucmd | state=7 (ON_FREE_SKIPPED), ref=1, not in ready list |
| Why ref stays at 1 | sgv_pool_free() caches the SGV on the pool LRU; dev_user_free_sg_entries() never fires; the ucmd_get() from dev_user_alloc_pages() is never balanced |
| Why pre-unjam flush failed | Runs before unjamming; SGV objects are cached during unjamming |
| Fix | sgv_pool_flush() for both pools after the unjam loop |
| Fix size | 2 function calls |
The SGV pool decouples sgv_pool_free() from the actual page release. Code
that relies on “free → callback → ucmd_put” must account for the callback
firing on eviction, not on free. At teardown time, an explicitsgv_pool_flush() is required to force eviction and drain all outstanding
references before checking whether the hash is empty.
Tags: #kernel #scst #storage #debugging #linux #memory-management #sgv-pool #reference-counting