LVM Hangs Under High Concurrency: Root Cause and Fix for Local CSI Volume Drivers

When a large number of pods using local CSI inline volumes are created and deleted concurrently on a single node, the LVM command hangs and the entire node’s disk operations become unavailable. This post analyzes the root cause — unbounded goroutine concurrency — and describes the fix: a FIFO queue with a bounded worker pool.

Problem

On a node with many pods being created and deleted simultaneously using local CSI inline volumes, LVM commands hang. Inspecting the kernel stack of the blocked process reveals:

1
2
3
4
5
6
7
8
9
10
# cat /proc/<pid>/stack
[<0>] submit_bio_wait+0x7f/0xc0
[<0>] blkdev_issue_discard+0x7e/0xd0
[<0>] blk_ioctl_discard+0x110/0x140
[<0>] blkdev_common_ioctl+0x3fc/0x890
[<0>] blkdev_ioctl+0xf6/0x270
[<0>] block_ioctl+0x46/0x50
[<0>] __x64_sys_ioctl+0x91/0xc0
[<0>] do_syscall_64+0x5c/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

While in this state, LVM is unavailable node-wide, blocking debug operations, metrics export, and any other disk-dependent activity.


Root Cause

Disk operations per pod lifecycle

Each pod creation requires approximately 5 disk operations through the local CSI driver:

  • LVM volume creation (lvcreate)
  • Volume metadata lookup (lvs)
  • Filesystem formatting (mkfs)
  • Data wiping (wipefs)
  • Discard/TRIM issuing

Each pod deletion similarly requires about 5 disk operations. All of these operations acquire a lock in the Linux block layer.

Unbounded goroutine concurrency

The original CSI driver framework spawned a new goroutine for every incoming CSI RPC request and immediately executed the disk operation without any concurrency limit. With 100 pods being created simultaneously:

1
100 pods × 5 disk operations (create) = 500 concurrent disk operations

All 500 operations compete for the same block layer lock. The pressure is transferred directly into the Linux kernel, causing every operation to queue up. The result:

  • All pods complete nearly simultaneously at the tail of the queue (not evenly spread)
  • LVM becomes completely unavailable during the entire period
  • Debug, metrics, and other operations are blocked

Solution

Optimization 1: Reduce redundant disk operations

The driver was refactored to:

  • Merge multiple lvs queries into a single batch lookup
  • Remove duplicate cleanup calls during volume deletion

This reduces per-pod disk operations from 5 to approximately 4 for both creation and deletion.

Optimization 2: FIFO queue with bounded worker pool (primary fix)

A bounded FIFO queue was added in front of all disk operations:

1
2
3
4
5
6
7
8
9
CSI RPC request


FIFO disk operation queue (unbounded enqueue, ordered dequeue)

├── Worker 1 ──► disk operation
├── Worker 2 ──► disk operation
├── ...
└── Worker N ──► disk operation (N = 10)
  • All disk operations are enqueued and processed in order
  • A fixed number of workers (N=10 in testing) consume from the queue
  • Pod creation/deletion transitions from “concurrent burst” to “smooth pipeline”

Benchmark Results

Test scenario: 100 pods created and deleted concurrently, each using a local CSI inline volume on a single node.

Metric Before fix After fix (workers=10)
All pods Running 17 min 14 min 40s
All pods Terminated 17 min 8 min
Completion distribution Concentrated at end Evenly spread
LVM availability Unavailable (hung) Available throughout

After the fix:

  • Pod termination time reduced by 53% (17 min → 8 min)
  • LVM remains available at all times — no impact on debugging or metrics
  • Pod completion times spread evenly rather than clustering at the end

Key Design Insight

Limiting concurrency matters more than reducing per-operation count.

Cutting from 5 to 4 disk operations per pod is a marginal improvement. The real problem is the O(N²) lock contention in the kernel block layer — the more concurrent operations, the longer each one waits, and the longer the total time. Each additional concurrent operation degrades all others.

Introducing the FIFO queue with a bounded worker pool shifts the queuing from kernel space to user space. Instead of 100 goroutines all hammering the block layer simultaneously, at most N operations are in-flight at any time. This converts an unpredictable burst pattern into a steady, predictable throughput, with substantially better overall latency.