概述

SCST (SCSI Target Subsystem for Linux) 的核心是一个精心设计的命令处理引擎。本文将深入剖析SCST如何从接收SCSI命令到完成响应的整个处理流程,包括状态机转换、线程模型、执行上下文切换等关键机制。

Read more »

概述

SCST (SCSI Target Subsystem for Linux) 是Linux内核中用于构建SCSI目标设备(Target)的高性能框架。它允许Linux系统将本地存储资源通过各种SCSI传输协议(如iSCSI、FC、SRP等)导出给远程主机使用。本文将深入解析SCST的核心概念、架构设计和关键数据结构。

Read more »

SCST Infinite Loop: Root Cause Analysis and Fix

Summary

During device shutdown, dev_user_process_cleanup() spins at ~2 million
iterations per second, pins one CPU core, and triggers the kernel soft-lockup
detector within seconds. The root cause is a command stuck in ucmd_hash
with ref=1 because sgv_pool_free() caches the scatter-gather buffer on
the pool LRU instead of freeing it — so the allocator’s ucmd_put() callback
never fires and the reference count never reaches zero.

The fix: two sgv_pool_flush() calls added after the unjam loop in
dev_user_unjam_dev().


Background: The SCST User-Space Device Handler

SCST is a high-performance storage target subsystem for Linux. The scst_user
module allows user-space applications to implement SCSI target devices via a
character device interface.

Key data structures:

  • ucmd_hash: hash table tracking all active scst_user_cmd objects
  • ready_cmd_list: queue of commands ready for user-space processing
  • cleanup_cmpl: completion for device cleanup synchronization
  • ucmd_ref: per-command reference count; dev_user_free_ucmd()
    cmd_remove_hash() fires only when atomic_dec_and_test() returns true

Normal command lifecycle:

1
2
3
4
5
6
dev_user_alloc_ucmd()        ucmd_ref = 1
dev_user_alloc_pages() ucmd_ref++ (ucmd_get) for each SG allocation
sent to user space sent_to_user = 1
reply from user space processing begins
dev_user_on_free_cmd() SGV freed, one ucmd_put()
dev_user_free_ucmd() ref reaches 0 → cmd_remove_hash()

The Symptom

Multiple scst_usr_release threads stuck in D state:

1
2
3
4
5
6
7
[Thu Jan 23 02:37:11 2025] task:scst_usr_releas state:D stack:    0 pid:334614
[Thu Jan 23 02:37:11 2025] Call Trace:
[Thu Jan 23 02:37:11 2025] __schedule+0x23d/0x590
[Thu Jan 23 02:37:11 2025] schedule+0x4e/0xb0
[Thu Jan 23 02:37:11 2025] schedule_timeout+0xfb/0x140
[Thu Jan 23 02:37:11 2025] wait_for_completion+0x24/0x30
[Thu Jan 23 02:37:11 2025] dev_user_exit_dev.isra.0+0x16a/0x1e0 [scst_user]

The threads block on wait_for_completion(&dev->cleanup_cmpl), which is never
signaled because the cleanup thread is spinning in an infinite loop and never
reaches complete_all(&dev->cleanup_cmpl).


The Cleanup Flow

When a SCST user device is torn down:

  1. dev_user_exit_dev() — unregisters the device, sets
    dev->cleanup_done = 1, then blocks on
    wait_for_completion(&dev->cleanup_cmpl).

  2. dev_user_process_cleanup() — runs in a separate thread, drains
    remaining commands, and calls complete_all(&dev->cleanup_cmpl) to
    unblock step 1.

The exit condition requires rc1 == 0 (hash empty) and rc == -EAGAIN
(ready list empty) and cleanup_done:

1
2
3
4
5
6
7
8
9
10
11
12
13
while (1) {
rc1 = dev_user_unjam_dev(dev); /* returns number of cmds in hash */

if (rc1 == 0 && rc == -EAGAIN && dev->cleanup_done)
break; /* normal exit */

spin_lock_irq(&dev->udev_cmd_threads.cmd_list_lock);
rc = dev_user_get_next_cmd(dev, &ucmd, false);
if (rc == 0)
dev_user_unjam_cmd(ucmd, 1, NULL);
spin_unlock_irq(&dev->udev_cmd_threads.cmd_list_lock);
}
complete_all(&dev->cleanup_cmpl); /* never reached */

If any command remains in ucmd_hash but is not in ready_cmd_list,
rc1 > 0 and rc == -EAGAIN simultaneously, and the loop has no exit.


Root Cause: SGV Pool Caching Strands a Reference

The stuck command

The command stuck in ucmd_hash has:

1
2
3
4
state        = UCMD_STATE_ON_FREE_SKIPPED (7)
scst_cmd = NULL
ucmd_ref = 1
sent_to_user = 0

State 7 is set in dev_user_on_free_cmd() when on_free_cmd_type is
SCST_USER_ON_FREE_CMD_IGNORE:

1
2
3
4
5
6
7
if (ucmd->dev->on_free_cmd_type == SCST_USER_ON_FREE_CMD_IGNORE) {
ucmd->state = UCMD_STATE_ON_FREE_SKIPPED;
goto out_reply;
}
...
out_reply:
dev_user_process_reply_on_free(ucmd);

dev_user_process_reply_on_free() frees the SGV buffer and drops a reference:

1
2
3
4
5
6
static int dev_user_process_reply_on_free(struct scst_user_cmd *ucmd)
{
dev_user_free_sgv(ucmd); /* free the scatter-gather buffer */
ucmd_put(ucmd); /* drop one reference */
return 0;
}

This looks correct. The problem is what dev_user_free_sgv() actually does.

sgv_pool_free() is a cache return, not a free

1
2
3
4
5
6
7
8
9
10
static void dev_user_free_sgv(struct scst_user_cmd *ucmd)
{
if (ucmd->sgv) {
sgv_pool_free(ucmd->sgv, &ucmd->dev->udev_mem_lim);
ucmd->sgv = NULL;
} else if (ucmd->data_pages) {
ucmd_get(ucmd);
__dev_user_free_sg_entries(ucmd);
}
}

The SGV (scatter-gather vector) pool is a performance cache: it holds
recently freed SG buffers so future commands can reuse them without hitting
the page allocator. When sgv_pool_free() is called:

  • The SGV object is placed on the pool’s LRU cache.
  • The allocator’s free callback — dev_user_free_sg_entries() — is not called.
  • The ucmd_get() reference taken in dev_user_alloc_pages() is not released.

dev_user_free_sg_entries() (and its ucmd_put()) only fires when the pool
evicts a cached object — via an explicit sgv_pool_flush().

The complete reference count trace

Event Operation ucmd_ref
dev_user_alloc_ucmd() atomic_set(&ucmd_ref, 1) 1
dev_user_alloc_pages() ucmd_get() for first SG page 2
dev_user_unjam_dev(): ucmd_get_check() bump to verify not zombie 3
dev_user_unjam_cmd()scst_cmd_done()dev_user_on_free_cmd()dev_user_free_sgv()sgv_pool_free() SGV goes to pool LRU; dev_user_free_sg_entries() not called; alloc_pages ref not released 3
dev_user_process_reply_on_free(): ucmd_put() 3 → 2 2
dev_user_unjam_dev(): ucmd_put() for ucmd_get_check ref 2 → 1 1

cmd_remove_hash() fires only when atomic_dec_and_test() returns true (ref
reaches 0). It never does — the alloc_pages reference is never released because
dev_user_free_sg_entries() never fires. The ucmd stays in ucmd_hash
indefinitely.

Why the loop spins at 2 million iterations per second

After unjamming, the stuck ucmd has sent_to_user = 0 and is not in
ready_cmd_list. On every subsequent pass:

1
2
3
4
5
list_for_each_entry(ucmd, head, hash_list_entry) {
res++; /* always incremented — hash is not empty */
if (!ucmd->sent_to_user)
continue; /* always taken — sent_to_user == 0 */
}

res is non-zero (rc1 > 0) but no command is unjammed.
dev_user_get_next_cmd() returns -EAGAIN (ucmd not in ready_cmd_list).
Both functions acquire and release a spinlock in under a microsecond.
Result: ~2 million iterations per second, 100% CPU on one core, soft-lockup
detector fires within seconds.


The Fix: Post-Unjam SGV Pool Flush

Why the existing pre-unjam flush was not enough

dev_user_unjam_dev() already calls sgv_pool_flush() before the unjam
loop:

1
2
3
4
5
6
7
8
9
10
11
static int dev_user_unjam_dev(struct scst_user_dev *dev)
{
sgv_pool_flush(dev->pool); /* before unjamming */
sgv_pool_flush(dev->pool_clust);

spin_lock_irq(&dev->udev_cmd_threads.cmd_list_lock);
/* ... unjam loop ... */
spin_unlock_irq(&dev->udev_cmd_threads.cmd_list_lock);

return res;
}

SGV objects are placed into the pool cache during unjamming — when
scst_cmd_donedev_user_on_free_cmddev_user_free_sgv
sgv_pool_free executes inside the unjam loop. A flush that precedes the loop
cannot evict objects that do not yet exist in the cache.

The fix

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
static int dev_user_unjam_dev(struct scst_user_dev *dev)
{
sgv_pool_flush(dev->pool); /* existing flush — before unjamming */
sgv_pool_flush(dev->pool_clust);

spin_lock_irq(&dev->udev_cmd_threads.cmd_list_lock);
/* ... unjam loop ... */
spin_unlock_irq(&dev->udev_cmd_threads.cmd_list_lock);

/*
* Flush again after unjamming. Unjamming calls sgv_pool_free(), which
* caches the SGV object on the pool LRU instead of freeing it directly.
* The pre-unjam flush above misses these objects. Without this second
* flush, dev_user_free_sg_entries() never fires, the alloc_pages
* ucmd_get() ref is never balanced, and the ucmd stays in ucmd_hash
* indefinitely — causing dev_user_process_cleanup() to loop forever.
*/
sgv_pool_flush(dev->pool);
sgv_pool_flush(dev->pool_clust);

return res;
}

sgv_pool_flush() is fully synchronous — it calls sgv_dtor_and_free()
inline in a while loop, so by the time it returns all eviction callbacks have
already fired. The call chain on eviction:

1
2
3
4
5
6
sgv_pool_flush()
→ dev_user_free_sg_entries()
→ __dev_user_free_sg_entries()
→ ucmd_put() ← releases the alloc_pages ref
→ atomic_dec_and_test() → 0 → dev_user_free_ucmd()
→ cmd_remove_hash() ← ucmd removed from hash

On the next iteration dev_user_unjam_dev() returns res = 0, and
dev_user_process_cleanup() breaks normally — within 2–3 iterations.


Summary

Detail
Symptom dev_user_process_cleanup() loops at ~2M iter/s; soft lockup
Stuck ucmd state=7 (ON_FREE_SKIPPED), ref=1, not in ready list
Why ref stays at 1 sgv_pool_free() caches the SGV on the pool LRU; dev_user_free_sg_entries() never fires; the ucmd_get() from dev_user_alloc_pages() is never balanced
Why pre-unjam flush failed Runs before unjamming; SGV objects are cached during unjamming
Fix sgv_pool_flush() for both pools after the unjam loop
Fix size 2 function calls

Lesson: Pool Caches Decouple Free from Callback

The SGV pool decouples sgv_pool_free() from the actual page release. Code
that relies on “free → callback → ucmd_put” must account for the callback
firing on eviction, not on free. At teardown time, an explicit
sgv_pool_flush() is required to force eviction and drain all outstanding
references before checking whether the hash is empty.


Tags: #kernel #scst #storage #debugging #linux #memory-management #sgv-pool #reference-counting

0%