LVM Hangs Under High Concurrency: Root Cause and Fix for Local CSI Volume Drivers
When a large number of pods using local CSI inline volumes are created and deleted concurrently on a single node, the LVM command hangs and the entire node’s disk operations become unavailable. This post analyzes the root cause — unbounded goroutine concurrency — and describes the fix: a FIFO queue with a bounded worker pool.
Problem
On a node with many pods being created and deleted simultaneously using local CSI inline volumes, LVM commands hang. Inspecting the kernel stack of the blocked process reveals:
1 | # cat /proc/<pid>/stack |
While in this state, LVM is unavailable node-wide, blocking debug operations, metrics export, and any other disk-dependent activity.
Root Cause
Disk operations per pod lifecycle
Each pod creation requires approximately 5 disk operations through the local CSI driver:
- LVM volume creation (
lvcreate) - Volume metadata lookup (
lvs) - Filesystem formatting (
mkfs) - Data wiping (
wipefs) - Discard/TRIM issuing
Each pod deletion similarly requires about 5 disk operations. All of these operations acquire a lock in the Linux block layer.
Unbounded goroutine concurrency
The original CSI driver framework spawned a new goroutine for every incoming CSI RPC request and immediately executed the disk operation without any concurrency limit. With 100 pods being created simultaneously:
1 | 100 pods × 5 disk operations (create) = 500 concurrent disk operations |
All 500 operations compete for the same block layer lock. The pressure is transferred directly into the Linux kernel, causing every operation to queue up. The result:
- All pods complete nearly simultaneously at the tail of the queue (not evenly spread)
- LVM becomes completely unavailable during the entire period
- Debug, metrics, and other operations are blocked
Solution
Optimization 1: Reduce redundant disk operations
The driver was refactored to:
- Merge multiple
lvsqueries into a single batch lookup - Remove duplicate cleanup calls during volume deletion
This reduces per-pod disk operations from 5 to approximately 4 for both creation and deletion.
Optimization 2: FIFO queue with bounded worker pool (primary fix)
A bounded FIFO queue was added in front of all disk operations:
1 | CSI RPC request |
- All disk operations are enqueued and processed in order
- A fixed number of workers (N=10 in testing) consume from the queue
- Pod creation/deletion transitions from “concurrent burst” to “smooth pipeline”
Benchmark Results
Test scenario: 100 pods created and deleted concurrently, each using a local CSI inline volume on a single node.
| Metric | Before fix | After fix (workers=10) |
|---|---|---|
| All pods Running | 17 min | 14 min 40s |
| All pods Terminated | 17 min | 8 min |
| Completion distribution | Concentrated at end | Evenly spread |
| LVM availability | Unavailable (hung) | Available throughout |
After the fix:
- Pod termination time reduced by 53% (17 min → 8 min)
- LVM remains available at all times — no impact on debugging or metrics
- Pod completion times spread evenly rather than clustering at the end
Key Design Insight
Limiting concurrency matters more than reducing per-operation count.
Cutting from 5 to 4 disk operations per pod is a marginal improvement. The real problem is the O(N²) lock contention in the kernel block layer — the more concurrent operations, the longer each one waits, and the longer the total time. Each additional concurrent operation degrades all others.
Introducing the FIFO queue with a bounded worker pool shifts the queuing from kernel space to user space. Instead of 100 goroutines all hammering the block layer simultaneously, at most N operations are in-flight at any time. This converts an unpredictable burst pattern into a steady, predictable throughput, with substantially better overall latency.