Cgroup V2 IO Controller Benchmarks: io.max, io.weight, and iocost in Practice

Posted on 2026-06-20 Edited on 2026-06-23 In Linux Kernel

Systematic benchmarks of the Cgroup V2 IO controller on production-grade Kubernetes nodes, covering io.max (hard bandwidth/IOPS limits), io.weight (proportional scheduling), and io.cost.qos (latency-based QoS), across Direct IO vs Buffer IO, raw disk vs LVM, and ext4 vs xfs combinations.

Test Environment

OS: Ubuntu 22.04 LTS (kernel 5.15)
Kubernetes: v1.23
Storage: 3 × 1.7T SSDs (sda/sdb/sdc), LVM VG spanning all three disks
Tools: fio + custom IO statistics scripts

Disk layout:

sda  1.7T
 └─sda4  1.3T  → LVM PV
sdb  1.7T
 └─sdb1  1.7T  → LVM PV
sdc  1.7T
 └─sdc1  1.7T  → LVM PV

VG: vg10000 (composed of sda4 + sdb1 + sdc1)

Base fio command:

fio -filename=/mnt/test/xfs/testfile \
    -direct=0 -iodepth 64 -thread -rw=write \
    -ioengine=libaio -bs=4k -size=20G \
    -numjobs=1 -group_reporting --runtime=100 \
    -name=mytest

Baseline Performance (No Throttling)

4K Sequential R/W	Read BW	Read IOPS	Write BW	Write IOPS
Ext4 Direct	359 MB/s	89,887	391 MB/s	97,825
Ext4 Buffer	336 MB/s	84,110	728 MB/s	182,086
XFS Direct	360 MB/s	89,997	400 MB/s	100,005
XFS Buffer	300 MB/s	74,910	780 MB/s	194,982

Buffer IO write bandwidth is significantly higher than Direct IO due to page cache write aggregation.

io.max Throttling Tests (Case 1)

Single volume, single thread, io.max applied to a single device.

Case	IO Type	FS	Layer	Config	Actual Write BW	Actual Write IOPS	Disk Util
1-1	Direct	ext4	raw	wbps=100MB/s	100 MB/s	21,113	31%
1-2	Direct	xfs	raw	wbps=100MB/s	100 MB/s	13,382	28%
1-3	Buffer	ext4	raw	wbps=100MB/s	100 MB/s	90	25%
1-4	Buffer	xfs	raw	wbps=100MB/s	100 MB/s	90	25%
1-5	Direct	xfs	raw	wiops=1000	4 MB/s	1,000	100%
1-6	Buffer	xfs	raw	wiops=10	12.6 MB/s	14.5	3%
1-7	Direct	xfs	LVM	wiops=1000	0.4 MB/s	100	4%
1-8	Buffer	xfs	LVM	wiops=100	475 MB/s	2	1.6%
1-9	Direct	xfs	LVM	wbps=100MB/s	100 MB/s	25,570	28%
1-10	Buffer	xfs	LVM	wbps=100MB/s	100 MB/s	2	20%

Key Finding: Buffer IO IOPS throttling is inaccurate

Case 1-8 (Buffer IO, LVM, wiops=100): actual IOPS was only 2, but bandwidth hit 475 MB/s — far exceeding the intent. Case 1-6 is similar (wiops=10, actual 14.5 IOPS).

Reason: Buffer IO dirty page writeback uses a block size determined by the kernel’s writeback aggregation policy, not the 4K block size set in fio. With sufficient memory, many small writes are coalesced into large writes, so actual IOPS drops far below the configured limit while bandwidth may exceed it. With less memory, writeback frequency increases and block sizes are smaller.

Conclusion: io.max BPS throttling (wbps) is accurate for Buffer IO. IOPS throttling (wiops) is unreliable for Buffer IO.

Cross-Device and Multi-Volume Tests (Case 2)

Case	Config	Device 1 Actual	Device 2 Actual	LVM Total
2-1	dev1=100MB/s, dev2=unlimited	104 MB/s, 83 IOPS	489 MB/s, 446 IOPS	IOPS fluctuates [0,100]
2-2	both devices 100MB/s	100 MB/s, 100 IOPS	100 MB/s, 100 IOPS	100 MB/s, [5,70] IOPS

Throttling on LVM volumes spanning multiple devices is applied at the underlying device granularity. LVM-level IOPS fluctuates due to striping across devices.

io.weight Proportional Scheduling Tests (Case 3)

Two LVM volumes, two fio threads, testing io.weight proportional scheduling.

Case	Scenario	Volume 1	Volume 2	Total
3-1.1	Buffer, no limit	250 MB/s	250 MB/s	500 MB/s, 100% util
3-1.2	Direct, no limit	220 MB/s	220 MB/s	440 MB/s, 100% util
3-2	Direct, weight 100:300, user qos	206 MB/s	209 MB/s	420 MB/s
3-4	Buffer, weight 100:300, user qos	129 MB/s	368 MB/s	500 MB/s
3-6	Direct, weight 100:900, user qos	153 MB/s	155 MB/s	304 MB/s
3-8	Buffer, weight 100:900, user qos	57 MB/s	442 MB/s	500 MB/s

Key Finding: Direct IO io.weight is ineffective under default iocost.qos

Cases 3-2 and 3-6: weights set to 100:300 and 100:900, but actual bandwidth is nearly equal (~206 vs ~209, ~153 vs ~155).

Reason: The default io.cost.qos automatically adjusts weights based on latency (ctrl=auto). When Direct IO saturates the disk at 100% utilization, the kernel overrides user-configured weights to meet latency targets.

Buffer IO weight scheduling works correctly (3-4: 129 vs 368 ≈ 1:2.8; 3-8: 57 vs 442 ≈ 1:7.8), close to the expected 1:3 and 1:9 ratios.

Conclusion: io.weight works well for Buffer IO. For Direct IO at full disk saturation, iocost automatic adjustment must be disabled (set min=max=100) for weights to take effect.

iocost QoS Precision Control Tests (Case 6)

Disable iocost auto-adjustment (min=max=100), two volumes at weight 100:900.

Case	Additional Config	Volume 1	Volume 2	Total
6-1	No io.max	22 MB/s, 83% util	309 MB/s, 85% util	322 MB/s
6-2	No io.max, weight 100:300	127 MB/s	291 MB/s	405 MB/s
6-3	io.max 150MB/s each	18 MB/s	145 MB/s	168 MB/s
6-4	io.max 100MB/s each	67 MB/s	102 MB/s	164 MB/s
6-5	v2 limit 10MB/s	48 MB/s	10 MB/s	58 MB/s

Case 6-1 with weight 100:900 shows an actual ratio of ~1:14 (22 vs 309), exceeding the expected 1:9. This is because disk utilization is ~85%, causing iocost to slightly adjust the allocation ratio. With auto-adjustment disabled (min=max=100), the actual ratio is closer to the configured value.

Three-Volume Weight Tests (Case 7)

Three LVM volumes with different weights, io.cost.qos set to min=max=100 (auto-adjustment disabled).

Case	Weight Config	Vol 1	Vol 2	Vol 3	Total
7-1	100:900:900	38 MB/s	6 MB/s	191 MB/s	235 MB/s
7-3	100:900:900, v2 limit 10MB/s	10 MB/s	9.6 MB/s	153 MB/s	172 MB/s
7-4	100:900:900, ctrl=auto	20 MB/s	112 MB/s	122 MB/s	254 MB/s
7-5	100:300:900	9.7 MB/s	30 MB/s	155 MB/s	194 MB/s
7-6	100:300:600	14 MB/s	43 MB/s	118 MB/s	175 MB/s
7-9	No weights (baseline)	121 MB/s	119 MB/s	119 MB/s	359 MB/s

Findings:

Case 7-4 with ctrl=auto: vol3 (weight 900) achieved only 122 MB/s, barely higher than vol2 (112 MB/s) — far from the expected 1:9 ratio. Auto-adjustment overrides configured weights.
Enabling io.weight reduces total bandwidth (235 vs 359 MB/s baseline). io.cost.qos can partially mitigate but not fully eliminate this overhead.
Case 7-2 (ctrl=auto, no min=max) caused a CPU soft lockup on the node, terminating the test.

Large-Scale Pod Tests (Case 5 & 9)

100 pods, 10 fio threads each, testing the impact of IO scheduler and iocost on node-wide IO behavior.

Case 5: IO Scheduler Comparison

Scenario	Load Avg	CPU	D-state Procs	Node IO PSI	Total IO BW
Unlimited (mq-deadline)	900	100%	30	910ms	300 MB/s
io.max throttled (mq-deadline)	548	68.4%	0	450ms	200 MB/s
io.max throttled (bfq)	700	93.4%	0	791ms	150 MB/s

Case 9: Container Count vs IO Throughput

Write throughput as container count increases, across different schedulers and control strategies:

Scenario	1 container	2 containers	6 containers	18 containers	50 containers
SSD xfs, bfq	500 MB/s	500 MB/s	220 MB/s	220 MB/s	300 MB/s
SSD xfs, mq-deadline	500 MB/s	500 MB/s	500 MB/s	500 MB/s	500 MB/s
SSD xfs, bfq+iocost	250 MB/s	300 MB/s	340 MB/s	16~490 MB/s	35~491 MB/s
SSD xfs, bfq+ioweight	40 MB/s	42 MB/s	65 MB/s	500 MB/s	500 MB/s
NVMe xfs, bfq	830 MB/s	900 MB/s	810 MB/s	850 MB/s	750 MB/s

Key Findings:

BFQ throughput degrades significantly under multiple containers: On SSD, throughput drops from 500 MB/s to 220 MB/s at 6 containers, while mq-deadline holds a stable 500 MB/s throughout.
iocost causes IO instability: With iocost enabled, throughput fluctuates wildly between 16–490 MB/s, with prolonged periods of very low IO before recovering.
BFQ is relatively stable on NVMe: NVMe’s high IOPS characteristics reduce BFQ’s scheduling overhead, making throughput degradation under multiple containers more acceptable.

Summary and Recommendations

Feature	Scenario	Recommendation
`io.max` BPS throttling	Direct IO / Buffer IO	Reliable — use directly
`io.max` IOPS throttling	Direct IO	Reliable
`io.max` IOPS throttling	Buffer IO	Unreliable — avoid
`io.weight`	Buffer IO	Works well
`io.weight`	Direct IO at full disk load	Requires `min=max=100` to disable iocost auto-adjustment
`io.cost.qos ctrl=auto`	Any scenario	Can cause CPU soft lockups under high-concurrency IO — use with caution in production
IO scheduler	High-concurrency multi-container	mq-deadline more stable than bfq; bfq suitable when weight scheduling is required

Core conclusion: Cgroup V2’s IO controller is powerful, but the mechanisms interact with each other (iocost can override io.weight; BFQ degrades under high concurrency). For production environments, start with io.max BPS throttling, validate thoroughly, then incrementally introduce weight scheduling and iocost.