Question: wa usage

When you run top, you can see the usage of wa. What does it means and how it calculated?

1
%Cpu(s):  0.1 us,  2.4 sy,  0.0 ni, 93.6 id,  3.9 wa,  0.0 hi,  0.0 si,  0.0 st

Values related to processor utilization are displayed on the third line. They provide insight into exactly what the CPUs are doing.

  • us is the percent of time spent running user processes.
  • sy is the percent of time spent running the kernel.
  • ni is the percent of time spent running processes with manually configured nice values.
  • id is the percent of time idle (if high, CPU may be overworked).
  • wa is the percent of wait time (if high, CPU is waiting for I/O access).
  • hi is the percent of time managing hardware interrupts.
  • si is the percent of time managing software interrupts.
  • st is the percent of virtual CPU time waiting for access to physical CPU.

In the context of the top command in Unix-like operating systems, the “wa” field represents the percentage of time the CPU spends waiting for I/O operations to complete. Specifically, it stands for “waiting for I/O.”

The calculation includes the time the CPU is idle while waiting for data to be read from or written to storage devices such as hard drives or SSDs. High values in the “wa” field may indicate that the system is spending a significant amount of time waiting for I/O operations to complete, which could be a bottleneck if not addressed.

The formula for calculating the “wa” value is as follows:

wa %=(Time spent waiting for I/O / Total CPU time)×100

Also we can use vmstat 1 to watch the wa state in system, but it is the not the percent:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
5 9 0 25994684 24 64404172 0 0 1 16 0 0 4 3 93 0 0
0 10 0 25994436 24 64404172 0 0 342776 341104 63906 52162 2 10 83 4 3
0 10 0 25994188 24 64404172 0 0 350392 350968 65593 52742 2 12 80 3 3
1 9 0 25994188 24 64404172 0 0 342656 341128 63793 51367 2 11 80 3 4
0 10 0 25994188 24 64404172 0 0 350760 350216 65439 52811 2 10 81 3 4
5 8 0 25993940 24 64404172 0 0 355224 358952 66762 52421 2 10 80 3 5
0 10 0 25993444 24 64404172 0 0 350536 350576 64998 52346 1 10 81 3 4
1 10 0 25992948 24 64404172 0 0 348344 347912 64709 51990 1 11 81 3 4
2 10 0 25992700 24 64404172 0 0 351944 349616 65443 51371 2 11 80 3 5
0 10 0 26000764 24 64404172 0 0 343368 345000 64589 51969 2 12 80 3 4
1 9 0 26000516 24 64404172 0 0 348976 352352 65800 53079 1 9 84 3 3
1 9 0 26000516 24 64404172 0 0 345520 341800 63737 51454 1 8 84 3 3
1 9 0 26000516 24 64404172 0 0 349192 352336 65719 52241 2 9 82 3 4
1 9 0 26000516 24 64404172 0 0 352384 355696 66567 52232 2 11 80 3 4
8 5 0 26000268 24 64404172 0 0 346368 345496 64541 52114 2 12 80 3 3
0 10 0 26000020 24 64404172 0 0 340064 341296 63603 51670 2 11 81 3 2
0 0 0 26007496 24 64399156 0 0 306736 306040 57770 46495 2 10 82 3 3
0 0 0 26007588 24 64398828 0 0 0 0 853 1480 0 0 100 0 0
0 0 0 26007588 24 64398828 0 0 0 0 871 1553 0 0 100 0 0
0 0 0 26007588 24 64398828 0 0 0 0 748 1430 0 0 100 0 0
0 0 0 26007588 24 64398828 0 0 0 2 731 1399 0 0 100 0 0
0 0 0 26007588 24 64398828 0 0 0 0 695 1319 0 0 100 0 0
0 0 0 26007588 24 64398828 0 0 0 0 743 1418 0 0 100 0 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Procs
r: The number of processes waiting for run time.
b: The number of processes in uninterruptible sleep.
Memory
swpd: the amount of virtual memory used.
free: the amount of idle memory.
buff: the amount of memory used as buffers.
cache: the amount of memory used as cache.
inact: the amount of inactive memory. (-a option)
active: the amount of active memory. (-a option)
Swap
si: Amount of memory swapped in from disk (/s).
so: Amount of memory swapped to disk (/s).
IO
bi: Blocks received from a block device (blocks/s).
bo: Blocks sent to a block device (blocks/s).
System
in: The number of interrupts per second, including the clock.
cs: The number of context switches per second.
CPU
These are percentages of total CPU time.
us: Time spent running non-kernel code. (user time, including nice time)
sy: Time spent running kernel code. (system time)
id: Time spent idle. Prior to Linux 2.5.41, this includes IO-wait time.
wa: Time spent waiting for IO. Prior to Linux 2.5.41, included in idle.
st: Time stolen from a virtual machine. Prior to Linux 2.6.11, unknown.

In summary, a high “wa” value in the output of top suggests that your system is spending a considerable amount of time waiting for I/O operations to be completed, which could impact overall system performance. Investigating and optimizing disk I/O can be beneficial in such cases, possibly by improving disk performance, optimizing file systems, or identifying and addressing specific I/O-intensive processes. We can use iotop to find the high io or slow io processes in system:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Current DISK READ:     341.78 M/s | Current DISK WRITE:     340.53 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
3487445 be/4 root 34.44 M/s 33.36 M/s ?unavailable? fio -filename=/mnt/hostroot/var/mnt/testfile -direct=1 -iodepth 64 -thread -rw=randrw -ioengine=libaio -bs=8k -size=20G -numjobs=10 -group_reporting --runtime=100 -name=mytest
3487446 be/4 root 33.40 M/s 34.27 M/s ?unavailable? fio -filename=/mnt/hostroot/var/mnt/testfile -direct=1 -iodepth 64 -thread -rw=randrw -ioengine=libaio -bs=8k -size=20G -numjobs=10 -group_reporting --runtime=100 -name=mytest
3487447 be/4 root 34.10 M/s 33.78 M/s ?unavailable? fio -filename=/mnt/hostroot/var/mnt/testfile -direct=1 -iodepth 64 -thread -rw=randrw -ioengine=libaio -bs=8k -size=20G -numjobs=10 -group_reporting --runtime=100 -name=mytest
3487448 be/4 root 32.21 M/s 31.64 M/s ?unavailable? fio -filename=/mnt/hostroot/var/mnt/testfile -direct=1 -iodepth 64 -thread -rw=randrw -ioengine=libaio -bs=8k -size=20G -numjobs=10 -group_reporting --runtime=100 -name=mytest
3487449 be/4 root 34.22 M/s 34.44 M/s ?unavailable? fio -filename=/mnt/hostroot/var/mnt/testfile -direct=1 -iodepth 64 -thread -rw=randrw -ioengine=libaio -bs=8k -size=20G -numjobs=10 -group_reporting --runtime=100 -name=mytest
3487450 be/4 root 34.46 M/s 34.33 M/s ?unavailable? fio -filename=/mnt/hostroot/var/mnt/testfile -direct=1 -iodepth 64 -thread -rw=randrw -ioengine=libaio -bs=8k -size=20G -numjobs=10 -group_reporting --runtime=100 -name=mytest
3487451 be/4 root 34.17 M/s 34.21 M/s ?unavailable? fio -filename=/mnt/hostroot/var/mnt/testfile -direct=1 -iodepth 64 -thread -rw=randrw -ioengine=libaio -bs=8k -size=20G -numjobs=10 -group_reporting --runtime=100 -name=mytest
3487452 be/4 root 33.05 M/s 32.95 M/s ?unavailable? fio -filename=/mnt/hostroot/var/mnt/testfile -direct=1 -iodepth 64 -thread -rw=randrw -ioengine=libaio -bs=8k -size=20G -numjobs=10 -group_reporting --runtime=100 -name=mytest
3487453 be/4 root 34.23 M/s 34.74 M/s ?unavailable? fio -filename=/mnt/hostroot/var/mnt/testfile -direct=1 -iodepth 64 -thread -rw=randrw -ioengine=libaio -bs=8k -size=20G -numjobs=10 -group_reporting --runtime=100 -name=mytest
3487454 be/4 root 37.51 M/s 36.79 M/s ?unavailable? fio -filename=/mnt/hostroot/var/mnt/testfile -direct=1 -iodepth 64 -thread -rw=randrw -ioengine=libaio -bs=8k -size=20G -numjobs=10 -group_reporting --runtime=100 -name=mytest
1 be/4 root 0.00 B/s 0.00 B/s ?unavailable? systemd --switched-root --system --deserialize 29
2 be/4 root 0.00 B/s 0.00 B/s ?unavailable? [kthreadd]
3 be/4 root 0.00 B/s 0.00 B/s ?unavailable? [rcu_gp]
4 be/4 root 0.00 B/s 0.00 B/s ?unavailable? [rcu_par_gp]
6 be/4 root 0.00 B/s 0.00 B/s ?unavailable? [kworker/0:0H-events_highpri]

Code tracing

From strace the top command, the wa value is get the status from /proc/*/stat. The /proc/*/stat is wrote by the proc ops and get the iowait time from kernel_cpustat.cpustat[CPUTIME_IOWAIT] accroding to the code fs/proc/stat.c.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
static const struct proc_ops stat_proc_ops = {
.proc_flags = PROC_ENTRY_PERMANENT,
.proc_open = stat_open,
.proc_read_iter = seq_read_iter,
.proc_lseek = seq_lseek,
.proc_release = single_release,
};

static int __init proc_stat_init(void)
{
proc_create("stat", 0, NULL, &stat_proc_ops);
return 0;
}

static int stat_open(struct inode *inode, struct file *file)
{
unsigned int size = 1024 + 128 * num_online_cpus();

/* minimum size to display an interrupt count : 2 bytes */
size += 2 * nr_irqs;
return single_open_size(file, show_stat, NULL, size);
}

static int show_stat(struct seq_file *p, void *v)
{
...
iowait += get_iowait_time(&kcpustat, i);
...
}

static u64 get_iowait_time(struct kernel_cpustat *kcs, int cpu)
{
u64 iowait, iowait_usecs = -1ULL;

if (cpu_online(cpu))
iowait_usecs = get_cpu_iowait_time_us(cpu, NULL);

if (iowait_usecs == -1ULL)
/* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
iowait = kcs->cpustat[CPUTIME_IOWAIT];
else
iowait = iowait_usecs * NSEC_PER_USEC;

return iowait;
}

In the Linux kernel, the calculation of the “wa” (waiting for I/O) value that is reported by tools like top is handled within the kernel’s scheduler. The specific code responsible for updating the CPU usage statistics can be found in the kernel/sched/cputime.c file.

One of the key functions related to this is account_idle_time().

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/*
* Account for idle time.
* @cputime: the CPU time spent in idle wait
*/
void account_idle_time(u64 cputime)
{
u64 *cpustat = kcpustat_this_cpu->cpustat;
struct rq *rq = this_rq();

if (atomic_read(&rq->nr_iowait) > 0)
cpustat[CPUTIME_IOWAIT] += cputime;
else
cpustat[CPUTIME_IDLE] += cputime;
}

u32 nr_iowait; /* number of blocked threads (waiting for I/O) */


This function is part of the kernel’s scheduler and is responsible for updating the various CPU time statistics, including the time spent waiting for I/O. The function takes into account different CPU states, such as idle time and time spent waiting for I/O.When The blocked threads with waiting for I/O, the cpu time accumulated into CPUTIME_IOWAIT.

Here is a basic outline of how the “wa” time is accounted for in the Linux kernel:

  1. Idle time accounting: The kernel keeps track of the time the CPU spends in an idle state, waiting for tasks to execute.
  2. I/O wait time accounting: When a process is waiting for I/O operations (such as reading or writing to a disk), the kernel accounts for this time in the appropriate CPU state, including the “wa” time.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# block/fops.c
static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
unsigned int nr_pages)
{
...
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!READ_ONCE(dio->waiter))
break;
blk_io_schedule();
}
__set_current_state(TASK_RUNNING);
...
}


1
2
3
4
5
6
7
8
9
10
11
# block/blk-core.c
void blk_io_schedule(void)
{
/* Prevent hang_check timer from firing at us during very long I/O */
unsigned long timeout = sysctl_hung_task_timeout_secs * HZ / 2;
if (timeout)
io_schedule_timeout(timeout);
else
io_schedule();
}
EXPORT_SYMBOL_GPL(blk_io_schedule);
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# kernel/sched/core.c
/*
* This task is about to go to sleep on IO. Increment rq->nr_iowait so
* that process accounting knows that this is a task in IO wait state.
*/
long __sched io_schedule_timeout(long timeout)
{
int token;
long ret;

token = io_schedule_prepare();
ret = schedule_timeout(timeout);
io_schedule_finish(token);

return ret;
}
EXPORT_SYMBOL(io_schedule_timeout);

int io_schedule_prepare(void)
{
int old_iowait = current->in_iowait;

current->in_iowait = 1;
blk_flush_plug(current->plug, true);
return old_iowait;
}

void io_schedule_finish(int token)
{
current->in_iowait = token;
}

asmlinkage __visible void __sched schedule(void)
{
struct task_struct *tsk = current;

sched_submit_work(tsk);
do {
preempt_disable();
__schedule(SM_NONE);
sched_preempt_enable_no_resched();
} while (need_resched());
sched_update_worker(tsk);
}
EXPORT_SYMBOL(schedule);

static void __sched notrace __schedule(unsigned int sched_mode)
{
...
if (prev->in_iowait) {
atomic_inc(&rq->nr_iowait);
delayacct_blkio_start();
}
...
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# kernel/time/timer.c
signed long __sched schedule_timeout(signed long timeout)
{
struct process_timer timer;
unsigned long expire;

switch (timeout)
{
case MAX_SCHEDULE_TIMEOUT:
/*
* These two special cases are useful to be comfortable
* in the caller. Nothing more. We could take
* MAX_SCHEDULE_TIMEOUT from one of the negative value
* but I' d like to return a valid offset (>=0) to allow
* the caller to do everything it want with the retval.
*/
schedule();
goto out;
...
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
...
if (task_cpu(p) != cpu) {
if (p->in_iowait) {
delayacct_blkio_end(p);
atomic_dec(&task_rq(p)->nr_iowait);
}

wake_flags |= WF_MIGRATED;
psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
...
}

Question: trace the life of IO by blktrace/blkparse

blktrace

blkparse

Investigate the blktrace implementation to understand the trace hook of the blk IO stack?

blktrace to parse the block io

blktrace explores the Linux kernel tracepoint infrastructure to track requests in-flight through the block I/O stack. It traces everything that goes through to block devices, while observing timing information.

First we using dd command to make a write IO (the read IO is the same).

1
2
3
4
5
// write to a file with direct io by dd command
# dd if=/dev/zero of=testfile bs=16k count=1024000 oflag=direct
1024000+0 records in
1024000+0 records out
16777216000 bytes (17 GB, 16 GiB) copied, 111.433 s, 151 MB/s

Usingblktrace and blkparse analize results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# blktrace -d /dev/vda -o res
# blkparse -i res
device
252,0 10 3 0.000238760 0 C WS 49724200 + 32 [0]
252,0 0 17 0.000261770 3418552 A WS 48720712 + 32 <- (253,0) 48718664
252,0 0 18 0.000262000 3418552 A WS 49724232 + 32 <- (252,3) 48720712
252,0 0 19 0.000262192 3418552 Q WS 49724232 + 32 [dd]
252,0 0 20 0.000263132 3418552 G WS 49724232 + 32 [dd]
252,0 0 21 0.000263335 3418552 P N [dd]
252,0 0 22 0.000263494 3418552 U N [dd] 1
252,0 0 23 0.000263719 3418552 I WS 49724232 + 32 [dd]
252,0 0 24 0.000264296 3418552 D WS 49724232 + 32 [dd]
252,0 10 4 0.000341566 0 C WS 49724232 + 32 [0]
252,0 0 25 0.000366914 3418552 A WS 48720744 + 32 <- (253,0) 48718696
252,0 0 26 0.000367235 3418552 A WS 49724264 + 32 <- (252,3) 48720744
252,0 0 27 0.000367410 3418552 Q WS 49724264 + 32 [dd]
252,0 0 28 0.000368423 3418552 G WS 49724264 + 32 [dd]
252,0 0 29 0.000368612 3418552 P N [dd]
252,0 0 30 0.000368797 3418552 U N [dd] 1
252,0 0 31 0.000369036 3418552 I WS 49724264 + 32 [dd]
252,0 0 32 0.000369730 3418552 D WS 49724264 + 32 [dd]
252,0 10 5 0.000453876 0 C WS 49724264 + 32 [0]
...
252,0 10 91033 2.279081418 3405085 C WS 50399976 + 32 [0]
252,0 10 232460 5.618988454 3405085 C WS 51349952 + 32 [0]
252,0 10 318363 8.424729857 3405085 C WM 39888904 + 16 [0]
252,0 10 318364 8.424732534 3405085 C WM 267886593 + 1 [0]
252,0 10 318365 8.424734981 3405085 C WM 267886600 + 16 [0]
252,0 10 318366 8.424737948 3405085 C WM 267925376 + 32 [0]
252,0 10 318367 8.424750755 3405085 C WM 267924512 + 32 [0]

...
Total (res):
Reads Queued: 0, 0KiB Writes Queued: 70567, 1128MiB
Read Dispatches: 3, 0KiB Write Dispatches: 70559, 1128MiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 3, 0KiB Writes Completed: 70563, 1128MiB
Read Merges: 0, 0KiB Write Merges: 6, 24KiB
IO unplugs: 70545 Timer unplugs: 0


# tracing 267886608
252,0 6 71 8.424115456 585 A WM 267886608 + 8 <- (252,3) 266883088
252,0 6 72 8.424115630 585 Q WM 267886608 + 8 [xfsaild/dm-0]
252,0 6 73 8.424115962 585 M WM 267886608 + 8 [xfsaild/dm-0]

# tracing 4555007
252,0 9 2 8.423915195 3374227 A WFSM 4555007 + 31 <- (252,3) 3551487
252,0 9 3 8.423915869 3374227 Q WFSM 4555007 + 31 [kworker/9:4]
252,0 9 4 8.423918485 3374227 G WFSM 4555007 + 31 [kworker/9:4]
252,0 9 5 8.423927401 240 D WSM 4555007 + 31 [kworker/9:1H]
252,0 10 318350 8.424051685 0 C WSM 4555007 + 31 [0]
252,0 10 318353 8.424169747 0 C WSM 4555007 [0]

![*Screenshot 2023-12-25 at 13.37.09*](/Users/tashen/Library/Application Support/typora-user-images/Screenshot 2023-12-25 at 13.37.09.png)

ACTION IDENTIFIERS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
The following table shows the various actions which may be
output:

A IO was remapped to a different device

B IO bounced

C IO completion

D IO issued to driver

F IO front merged with request on queue

G Get request

I IO inserted onto request queue

M IO back merged with request on queue

P Plug request

Q IO handled by request queue code

S Sleep request

T Unplug due to timeout

U Unplug request

X Split

The result is long and there only paste few of them. I try to trace an different sectors and behaviors of block layer.

blkparse command still hard to know the total metrics and statistics, so we can use btt to see the statistic of whole table and picture, this is the part of results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# blkparse -q -i res   -d sda.blk.bin
# btt -i sda.blk.bin

==================== All Devices ====================

ALL MIN AVG MAX N
--------------- ------------- ------------- ------------- -----------

Q2Q 0.000001441 0.000119382 0.802839654 70564
Q2G 0.000000497 0.000000797 0.000025123 70559
G2I 0.000000406 0.000000654 0.000128249 70558
Q2M 0.000000133 0.000000271 0.000000631 6
I2D 0.000000411 0.000000663 0.000017267 70558
M2D 0.000031518 0.000071988 0.000116699 6
D2C 0.000064651 0.000079818 0.002640604 70565
Q2C 0.000066315 0.000081939 0.002643331 70565

==================== Device Overhead ====================

DEV | Q2G G2I Q2M I2D D2C
---------- | --------- --------- --------- --------- ---------
(252, 0) | 0.9723% 0.7983% 0.0000% 0.8096% 97.4121%
---------- | --------- --------- --------- --------- ---------
Overall | 0.9723% 0.7983% 0.0000% 0.8096% 97.4121%

==================== Device Merge Information ====================

DEV | #Q #D Ratio | BLKmin BLKavg BLKmax Total
---------- | -------- -------- ------- | -------- -------- -------- --------
(252, 0) | 70565 70559 1.0 | 1 31 32 2257605
1
2
3
4
5
6
Q------->G------------>I--------->M------------------->D----------------------------->C
|-Q time-|-Insert time-|
|--------- merge time ------------|-merge with other IO|
|----------------scheduler time time-------------------|---driver,adapter,storagetime--|

|----------------------- await time in iostat output ----------------------------------|

Q2Q — time between requests sent to the block layer

Q2G — time from a block I/O is queued to the time it gets a request allocated for it

G2I — time from a request is allocated to the time it is inserted into the device’s queue

Q2M — time from a block I/O is queued to the time it gets merged with an existing request

I2D — time from a request is inserted into the device’s queue to the time it is actually issued to the device (time the I/O is “idle” on the request queue)

M2D — time from a block I/O is merged with an exiting request until the request is issued to the device

D2C — service time of the request by the device (time the I/O is “active” in the driver and on the device)

Q2C — total time spent in the block layer for a request

Actually the blktrace’s implement is using various linux kernel tracepoint to trace different phase of IO. Here are some of the key tracepoints used by blktrace:

  1. block_rq_insert: This tracepoint is hit when a request is inserted into the request queue.(Q)
  2. block_rq_issue: This tracepoint is hit when a request is issued to the device.(I)
  3. block_rq_complete: This tracepoint is hit when a request is completed.(C)
  4. block_bio_queue: This tracepoint is hit when a bio is queued.(Q)
  5. block_bio_backmerge: This tracepoint is hit when a bio is being merged with the last bio in the request.
  6. block_bio_frontmerge: This tracepoint is hit when a bio is being merged with the first bio in the request.
  7. block_bio_bounce: This tracepoint is hit when a bio is bounced.
  8. block_getrq: This tracepoint is hit when get_request() is called to allocate a request.
  9. block_sleeprq: This tracepoint is hit when a process is going to sleep waiting for a request to be available.
  10. block_plug: This tracepoint is hit when a plug is inserted into the request queue.
  11. block_unplug: This tracepoint is hit when a plug is removed from the request queue.

Main data structure in block layer

To deepdiv the block layer, i read the code and finding as followes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
/*
* main unit of I/O for the block layer and lower layers (ie drivers and
* stacking drivers)
*/
struct bio {
struct bio *bi_next; /* request queue link */
struct block_device *bi_bdev;
unsigned long bi_rw; /* read or write */
...
struct bio_vec *bi_io_vec; /* the actual vec list */
struct bvec_iter bi_iter;

};

/**
* struct bio_vec - a contiguous range of physical memory addresses
* @bv_page: First page associated with the address range.
* @bv_len: Number of bytes in the address range.
* @bv_offset: Start of the address range relative to the start of @bv_page.
*
* The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len:
*
* nth_page(@bv_page, n) == @bv_page + n
*
* This holds because page_is_mergeable() checks the above property.
*/
struct bio_vec {
struct page *bv_page;
unsigned int bv_len;
unsigned int bv_offset;
};

struct bvec_iter {
sector_t bi_sector; /* device address in 512 byte
sectors */
unsigned int bi_size; /* residual I/O count */

unsigned int bi_idx; /* current index into bvl_vec */

unsigned int bi_bvec_done; /* number of bytes completed in
current bvec */
} __packed;

struct block_device {
sector_t bd_start_sect;
sector_t bd_nr_sectors;
struct gendisk * bd_disk;
struct request_queue * bd_queue;
bool bd_has_submit_bio;
...
} __randomize_layout;

/*
* Try to put the fields that are referenced together in the same cacheline.
*
* If you modify this structure, make sure to update blk_rq_init() and
* especially blk_mq_rq_ctx_init() to take care of the added fields.
*/
struct request {
struct request_queue *q;
struct bio *bio;
struct bio *biotail;
union {
struct list_head queuelist;
struct request *rq_next;
};
enum mq_rq_state state;
/*
* The hash is used inside the scheduler, and killed once the
* request reaches the dispatch list. The ipi_list is only used
* to queue the request for softirq completion, which is long
* after the request has been unhashed (and even removed from
* the dispatch list).
*/
union {
struct hlist_node hash; /* merge hash */
struct llist_node ipi_list;
};

/*
* The rb_node is only used inside the io scheduler, requests
* are pruned when moved to the dispatch queue. So let the
* completion_data share space with the rb_node.
*/
union {
struct rb_node rb_node; /* sort/lookup */
struct bio_vec special_vec;
void *completion_data;
};
/*
* completion callback.
*/
rq_end_io_fn *end_io;
}

struct request_queue {
struct request *last_merge;
struct elevator_queue *elevator;
struct rq_qos *rq_qos;
const struct blk_mq_ops *mq_ops;
struct queue_limits limits;
/*
* for flush operations
*/
struct blk_flush_queue *fq;
}

/**
* struct blk_mq_hw_ctx - State for a hardware queue facing the hardware
* block device
*/
struct blk_mq_hw_ctx {
/**
* @queue: Pointer to the request queue that owns this hardware context.
*/
struct request_queue *queue;
/**
* @dispatch_busy: Number used by blk_mq_update_dispatch_busy() to
* decide if the hw_queue is busy using Exponential Weighted Moving
* Average algorithm.
*/
unsigned int dispatch_busy;
}

bio逻辑架构

code flow in block layer

img

Bio -> Request -> plug request list -> staging request queue in sheduler -> hardware request queue

![Screenshot 2023-12-25 at 13.54.30](/Users/tashen/Library/Application Support/typora-user-images/Screenshot 2023-12-25 at 13.54.30.png)

![Screenshot 2023-12-25 at 13.57.23](/Users/tashen/Library/Application Support/typora-user-images/Screenshot 2023-12-25 at 13.57.23.png)

These pictures records the total main code and function flow from the filessystem to block layer and to driver layder. It also prints the tracepoint mapping to the acation in output of the blktrace where they trace from. By these tracepoint, we can understand the block io process flow more clearly.

Reproduce

environment:

5.15 kernel

way:

Enable cgroup v2 and bfq, set bfq as scheduler on target deviceEnable io cost by /sys/fs/cgroup/io.cost.qos, set cgroup io.weightStart two or more fio process to read or write on the target device, at least contains 1 direct and 1 buffer io, each one use different cgroupsWait 10-30 minutes

Debug

Kernel Dump 1 (Set hardlockup_panic=1 by sysctl):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
[ 1408.828715] Call Trace:
[ 1408.828716] <TASK>
[ 1408.828718] _raw_spin_lock_irq+0x26/0x30
[ 1408.828722] bfq_bio_merge+0x9f/0x160 [bfq]
[ 1408.828727] __blk_mq_sched_bio_merge+0x36/0x190
[ 1408.828731] blk_mq_submit_bio+0xd2/0x600
[ 1408.828735] __submit_bio+0x1dc/0x210
[ 1408.828738] submit_bio_noacct+0x263/0x2c0
[ 1408.828741] submit_bio+0x48/0x130
[ 1408.828743] iomap_submit_ioend.isra.0+0x52/0x80
[ 1408.828745] iomap_writepage_map+0x490/0x630
[ 1408.828748] iomap_do_writepage+0x75/0x230
[ 1408.828750] ? clear_page_dirty_for_io+0xdf/0x1d0
[ 1408.828754] write_cache_pages+0x184/0x450
[ 1408.828756] ? iomap_truncate_page+0x60/0x60
[ 1408.828760] iomap_writepages+0x20/0x40
[ 1408.828762] xfs_vm_writepages+0x84/0xb0 [xfs]
[ 1408.828837] do_writepages+0xc5/0x1c0
[ 1408.828839] ? __wb_calc_thresh+0x3e/0x120
[ 1408.828842] __writeback_single_inode+0x44/0x290
[ 1408.828845] writeback_sb_inodes+0x22d/0x4b0
[ 1408.828849] __writeback_inodes_wb+0x56/0xf0
[ 1408.828852] wb_writeback+0x1ce/0x290
[ 1408.828855] wb_workfn+0x31b/0x490
[ 1408.828858] ? psi_task_switch+0xc3/0x240
[ 1408.828862] process_one_work+0x22b/0x3d0
[ 1408.828865] worker_thread+0x4d/0x3f0
[ 1408.828868] ? process_one_work+0x3d0/0x3d0
[ 1408.828870] kthread+0x12a/0x150
[ 1408.828872] ? set_kthread_struct+0x40/0x40
[ 1408.828874] ret_from_fork+0x22/0x30
[ 1408.828879] </TASK>
[ 1408.829600] Kernel panic - not syncing: Hard LOCKUP
[ 1408.829602] CPU: 26 PID: 0 Comm: swapper/26 Kdump: loaded Tainted: P IOE K 5.15.0-26-generic #26
[ 1408.829604] Hardware name: Dell Inc. PowerEdge C6420/0K2TT6, BIOS 2.4.8 11/27/2019
[ 1408.829605] Call Trace:
[ 1408.829606] <NMI>
[ 1408.829608] dump_stack_lvl+0x4a/0x5f
[ 1408.829615] dump_stack+0x10/0x12
[ 1408.829617] panic+0x149/0x321
[ 1408.829623] nmi_panic.cold+0xc/0xc
[ 1408.829626] watchdog_overflow_callback.cold+0x5c/0x70
[ 1408.829631] __perf_event_overflow+0x57/0x100
[ 1408.829634] perf_event_overflow+0x14/0x20
[ 1408.829637] handle_pmi_common+0x1f3/0x2f0
[ 1408.829643] ? flush_tlb_one_kernel+0xe/0x20
[ 1408.829647] ? __set_pte_vaddr+0x37/0x40
[ 1408.829650] ? set_pte_vaddr_p4d+0x3d/0x50
[ 1408.829652] ? set_pte_vaddr+0x81/0xb0
[ 1408.829653] ? __native_set_fixmap+0x28/0x40
[ 1408.829655] ? native_set_fixmap+0x66/0x70
[ 1408.829657] ? ghes_copy_tofrom_phys+0x7c/0x120
[ 1408.829663] ? __ghes_peek_estatus.isra.0+0x4e/0xb0
[ 1408.829666] intel_pmu_handle_irq+0xf4/0x210
[ 1408.829670] perf_event_nmi_handler+0x2d/0x50
[ 1408.829672] nmi_handle+0x66/0x120
[ 1408.829678] default_do_nmi+0x45/0x110
[ 1408.829681] exc_nmi+0x175/0x190
[ 1408.829683] end_repeat_nmi+0x16/0x55
[ 1408.829685] RIP: 0010:native_queued_spin_lock_slowpath+0x14f/0x220
[ 1408.827472] ? adjust_inuse_and_calc_cost+0x13a/0x250
[ 1408.827475] ioc_rqos_merge+0xef/0x260
[ 1408.827478] __rq_qos_merge+0x31/0x50
[ 1408.827481] bio_attempt_back_merge+0x43/0xd0
[ 1408.827484] blk_mq_sched_try_merge+0x147/0x1d0
[ 1408.827486] bfq_bio_merge+0xe2/0x160 [bfq]
[ 1408.827491] __blk_mq_sched_bio_merge+0x36/0x190
[ 1408.827495] blk_mq_submit_bio+0xd2/0x600
[ 1408.827499] __submit_bio+0x1dc/0x210
[ 1408.827502] ? get_user_pages_fast+0x24/0x40
[ 1408.827507] ? iov_iter_get_pages+0xc6/0x3f0
[ 1408.827513] submit_bio_noacct+0xac/0x2c0
[ 1408.827516] submit_bio+0x48/0x130
[ 1408.827518] iomap_dio_submit_bio+0x79/0x80
[ 1408.827524] iomap_dio_bio_iter+0x2d7/0x4a0
[ 1408.827528] __iomap_dio_rw+0x531/0x7c0
[ 1408.827532] iomap_dio_rw+0xe/0x30
[ 1408.827535] xfs_file_dio_write_aligned+0xa0/0x110 [xfs]
[ 1408.827647] xfs_file_write_iter+0x101/0x1a0 [xfs]
[ 1408.827736] aio_write+0x116/0x210
[ 1408.827740] ? update_load_avg+0x7c/0x640
[ 1408.827744] ? set_next_entity+0xb7/0x200
[ 1408.827747] __io_submit_one.constprop.0+0x40e/0x720
[ 1408.827749] ? __io_submit_one.constprop.0+0x40e/0x720
[ 1408.827752] ? __check_object_size+0x5d/0x150
[ 1408.827755] ? _copy_to_user+0x20/0x30
[ 1408.827760] ? aio_read_events+0x210/0x320
[ 1408.827762] io_submit_one+0xe3/0x550
[ 1408.827764] ? io_submit_one+0xe3/0x550
[ 1408.827766] ? audit_filter_rules.constprop.0+0x3b1/0x12e0
[ 1408.827772] __x64_sys_io_submit+0x8d/0x180
[ 1408.827775] ? syscall_trace_enter.isra.0+0x146/0x1c0
[ 1408.827779] do_syscall_64+0x5c/0xc0
[ 1408.827782] ? exit_to_user_mode_prepare+0x3d/0x1c0
[ 1408.827785] ? syscall_exit_to_user_mode+0x27/0x50
[ 1408.827788] ? do_syscall_64+0x69/0xc0
[ 1408.827790] ? irqentry_exit+0x19/0x30
[ 1408.827793] ? sysvec_call_function_single+0x4e/0x90
[ 1408.827795] ? asm_sysvec_call_function_single+0xa/0x20
[ 1408.827799] entry_SYSCALL_64_after_hwframe+0x44/0xae

Kernel dump 2 :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
[60977.577690] BUG: spinlock recursion on CPU#21, fio/11770
[60977.583007] lock: 0xffff92600e348c20, .magic: dead4ead, .owner: fio/11770, .owner_cpu: 21
[60977.591275] CPU: 21 PID: 11770 Comm: fio Kdump: loaded Not tainted 5.15.0-26-generic #26
[60977.599362] Hardware name: Supermicro SYS-6029TP-H-EI012/X11DPT-PS-EI012, BIOS 1.0.9.0 01/31/2018
[60977.608229] Call Trace:
[60977.610682] <IRQ>
[60977.612700] dump_stack_lvl+0x58/0x7d
[60977.616367] dump_stack+0x10/0x12
[60977.619687] spin_dump.cold+0x24/0x39
[60977.623359] do_raw_spin_lock+0x9a/0xd0
[60977.627198] _raw_spin_lock_irqsave+0x7d/0x80
[60977.631561] bfq_finish_requeue_request+0x55/0x540 [bfq]
[60977.636874] blk_mq_free_request+0x3e/0x160
[60977.641058] __blk_mq_end_request+0x102/0x110
[60977.645419] scsi_end_request+0xce/0x190
[60977.649345] scsi_io_completion+0x7e/0x5d0
[60977.653442] scsi_finish_command+0xca/0x100
[60977.657629] scsi_complete+0x74/0xe0
[60977.661212] blk_complete_reqs+0x3b/0x50
[60977.665135] blk_done_softirq+0x1d/0x20
[60977.668975] __do_softirq+0xf7/0x31e
[60977.672554] irq_exit_rcu+0x79/0xa0
[60977.676044] sysvec_call_function_single+0x7c/0x90
[60977.680837] </IRQ>
[60977.682945] <TASK>
[60977.685051] asm_sysvec_call_function_single+0x12/0x20
[60977.690189] RIP: 0010:_raw_spin_unlock_irq+0x2a/0x30
[60977.695155] Code: 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 48 8d 7f 18 48 8b 75 08 e8 75 6f 3d ff 4c 89 e7 e8 8d b0 3d ff fb 66 0f 1f 44 00 00 <41> 5c 5d c3 66 90 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 48 8d
[60977.713902] RSP: 0018:ffffa0781992b670 EFLAGS: 00000246
[60977.719128] RAX: 0000000000000015 RBX: 00000000001d9281 RCX: 0000000036d7555b
[60977.726260] RDX: 000000007a3bfa68 RSI: ffff9260bca80d48 RDI: ffff92600fd9f0e0
[60977.733392] RBP: ffffa0781992b678 R08: 00000000ffffffff R09: 00000000ffffffff
[60977.740527] R10: 0000000000000001 R11: ffff9231c6d2cc00 R12: ffff92600fd9f0e0
[60977.747661] R13: ffffa0781992b710 R14: 0000000000000001 R15: 000000000019007e
[60977.754794] adjust_inuse_and_calc_cost+0x164/0x250
[60977.759673] ioc_rqos_merge+0xef/0x260
[60977.763425] __rq_qos_merge+0x31/0x50
[60977.767091] bio_attempt_back_merge+0x64/0xf0
[60977.771451] blk_mq_sched_try_merge+0x147/0x1d0
[60977.775982] bfq_bio_merge+0xe2/0x150 [bfq]
[60977.780168] __blk_mq_sched_bio_merge+0x36/0x1b0
[60977.784790] blk_mq_submit_bio+0xd5/0x690
[60977.788803] __submit_bio+0x23d/0x270
[60977.792465] ? wake_up_page_bit+0xd3/0x100
[60977.796568] submit_bio_noacct+0x26c/0x2d0
[60977.800664] ? wake_up_page_bit+0xd3/0x100
[60977.804764] submit_bio+0x48/0x140
[60977.808171] iomap_submit_ioend.isra.0+0x52/0x80
[60977.812790] iomap_writepage_map+0x490/0x630
[60977.817065] iomap_do_writepage+0x8c/0x260
[60977.821161] ? clear_page_dirty_for_io+0x113/0x220
[60977.825955] write_cache_pages+0x19e/0x460
[60977.830054] ? iomap_truncate_page+0x60/0x60
[60977.834330] iomap_writepages+0x20/0x40
[60977.838167] xfs_vm_writepages+0x85/0xb0 [xfs]
[60977.842718] do_writepages+0xc6/0x1c0
[60977.846384] ? _raw_spin_unlock+0x23/0x30
[60977.850398] filemap_fdatawrite_wbc+0x7d/0xc0
[60977.854756] __filemap_fdatawrite_range+0x54/0x70
[60977.859463] generic_fadvise+0x299/0x2c0
[60977.863387] xfs_file_fadvise+0x23/0x80 [xfs]
[60977.867807] vfs_fadvise+0x1e/0x30
[60977.871214] ksys_fadvise64_64+0x41/0x80
[60977.875140] ? syscall_trace_enter.isra.0+0x16b/0x1e0
[60977.880194] __x64_sys_fadvise64+0x1e/0x30
[60977.884291] do_syscall_64+0x5c/0xc0
[60977.887870] ? irqentry_exit_to_user_mode+0x9/0x20
[60977.892665] ? irqentry_exit+0x19/0x30
[60977.896416] ? exc_page_fault+0x94/0x1b0
[60977.900343] ? asm_exc_page_fault+0x8/0x30
[60977.904444] entry_SYSCALL_64_after_hwframe+0x44/0xae
[60977.909496] RIP: 0033:0x7f16f2c07d3e
[60977.913076] Code: 10 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb b3 e8 28 d8 01 00 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 41 89 ca b8 dd 00 00 00 0f 05 <89> c2 f7 da 3d 00 f0 ff ff b8 00 00 00 00 0f 47 c2 c3 41 57 41 56
[60977.931819] RSP: 002b:00007ffd83ec8338 EFLAGS: 00000246 ORIG_RAX: 00000000000000dd
[60977.939388] RAX: ffffffffffffffda RBX: 00007f16effca890 RCX: 00007f16f2c07d3e
[60977.946518] RDX: 0000000280000000 RSI: 0000000000000000 RDI: 0000000000000006
[60977.953652] RBP: 0000000000000000 R08: 00007f16e8f2b028 R09: 0000000000000000
[60977.960784] R10: 0000000000000004 R11: 0000000000000246 R12: 0000000280000000
[60977.967918] R13: 00007f16e8c16c80 R14: 00007f16effca890 R15: 00007f16e8c16c80
[60977.975053] </TASK>

Kernel dump 3 (enable lockdep):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
[ 4398.422037] WARNING: inconsistent lock state
[ 4398.426311] 5.15.0-26-generic #26 Not tainted
[ 4398.430666] --------------------------------
[ 4398.434939] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ 4398.440947] fio/156503 [HC0[0]:SC0[0]:HE0:SE1] takes:
[ 4398.445997] ffff8b5bd4771c38 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x9f/0x150 [bfq]
[ 4398.454265] {IN-HARDIRQ-W} state was registered at:
[ 4398.459148] lock_acquire+0xd8/0x300
[ 4398.462823] _raw_spin_lock_irqsave+0x52/0xa0
[ 4398.467279] bfq_idle_slice_timer+0x2d/0xc0 [bfq]
[ 4398.472079] __hrtimer_run_queues+0x8b/0x420
[ 4398.476438] hrtimer_interrupt+0x109/0x230
[ 4398.480621] __sysvec_apic_timer_interrupt+0x78/0x1f0
[ 4398.485761] sysvec_apic_timer_interrupt+0x77/0x90
[ 4398.490640] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 4398.495868] lock_is_held_type+0x115/0x150
[ 4398.500054] rcu_read_lock_sched_held+0x5a/0x90
[ 4398.504673] lock_acquire+0x19f/0x300
[ 4398.508426] process_one_work+0x28d/0x5d0
[ 4398.512525] worker_thread+0x4a/0x3d0
[ 4398.516276] kthread+0x141/0x160
[ 4398.519597] ret_from_fork+0x22/0x30
[ 4398.523289] irq event stamp: 1141506
[ 4398.526869] hardirqs last enabled at (1141505): [<ffffffff880849b1>] _raw_spin_unlock_irqrestore+0x51/0x70
[ 4398.536601] hardirqs last disabled at (1141506): [<ffffffff88084744>] _raw_spin_lock_irq+0x74/0x90
[ 4398.545556] softirqs last enabled at (1139712): [<ffffffff884002f5>] __do_softirq+0x2f5/0x4d3
[ 4398.554160] softirqs last disabled at (1139699): [<ffffffff872c9556>] irq_exit_rcu+0x96/0xc0
[ 4398.562594]
other info that might help us debug this:
[ 4398.569121] Possible unsafe locking scenario:

[ 4398.575041] CPU0
[ 4398.577494] ----
[ 4398.579947] lock(&bfqd->lock);
[ 4398.583188] <Interrupt>
[ 4398.585813] lock(&bfqd->lock);
[ 4398.589219]
*** DEADLOCK ***

[ 4398.595137] 2 locks held by fio/156503:
[ 4398.598978] #0: ffff8b5c653c4e28 (&sb->s_type->i_mutex_key#14){++++}-{3:3}, at: xfs_ilock+0x104/0x1b0 [xfs]
[ 4398.608937] #1: ffff8b5bd4771c38 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x9f/0x150 [bfq]
[ 4398.617637]
stack backtrace:
[ 4398.621999] CPU: 19 PID: 156503 Comm: fio Kdump: loaded Not tainted 5.15.0-26-generic #26
[ 4398.630170] Hardware name: Supermicro SYS-6029TP-H-EI012/X11DPT-PS-EI012, BIOS 1.0.9.0 01/31/2018
[ 4398.639036] Call Trace:
[ 4398.641488] <TASK>
[ 4398.643596] dump_stack_lvl+0x6e/0x9c
[ 4398.647284] dump_stack+0x10/0x12
[ 4398.650605] print_usage_bug.part.0+0x18e/0x19d
[ 4398.655138] mark_lock_irq.cold+0x11/0x2c
[ 4398.659151] ? stack_trace_save+0x4c/0x70
[ 4398.663164] ? save_trace+0x45/0x2f0
[ 4398.666742] mark_lock.part.0+0x11f/0x220
[ 4398.670756] mark_held_locks+0x54/0x80
[ 4398.674510] ? adjust_inuse_and_calc_cost+0x164/0x2c0
[ 4398.679563] lockdep_hardirqs_on_prepare+0x7f/0x1c0
[ 4398.680859] patch_est_timer: disagrees about version of symbol module_layout
[ 4398.684449] ? _raw_spin_unlock_irq+0x28/0x40
[ 4398.695856] trace_hardirqs_on+0x21/0xe0
[ 4398.699791] _raw_spin_unlock_irq+0x28/0x40
[ 4398.703974] adjust_inuse_and_calc_cost+0x164/0x2c0
[ 4398.708858] ioc_rqos_merge+0xef/0x280
[ 4398.712608] __rq_qos_merge+0x31/0x50
[ 4398.716272] bio_attempt_back_merge+0x63/0x160
[ 4398.720722] blk_mq_sched_try_merge+0x147/0x1d0
[ 4398.725256] bfq_bio_merge+0xe2/0x150 [bfq]
[ 4398.729448] __blk_mq_sched_bio_merge+0x36/0x1b0
[ 4398.734069] blk_mq_submit_bio+0xd5/0x900
[ 4398.738082] __submit_bio+0x307/0x340
[ 4398.741746] ? get_user_pages_fast+0x24/0x40
[ 4398.746018] ? iov_iter_get_pages+0xc6/0x400
[ 4398.750292] submit_bio_noacct+0x26c/0x2d0
[ 4398.754393] submit_bio+0x48/0x140
[ 4398.757797] iomap_dio_submit_bio+0x7c/0x90
[ 4398.761983] iomap_dio_bio_iter+0x2da/0x4b0
[ 4398.766172] __iomap_dio_rw+0x514/0x830
[ 4398.770012] iomap_dio_rw+0xe/0x30
[ 4398.773413] xfs_file_dio_write_aligned+0xb9/0x1a0 [xfs]
[ 4398.778821] ? sched_clock_cpu+0x12/0xe0
[ 4398.782749] xfs_file_write_iter+0x101/0x1a0 [xfs]
[ 4398.787613] aio_write+0x131/0x310
[ 4398.791019] ? __fget_files+0xcf/0x1c0
[ 4398.794771] __io_submit_one.constprop.0+0x43a/0x570
[ 4398.799740] ? __io_submit_one.constprop.0+0x43a/0x570
[ 4398.804885] ? sched_clock_cpu+0x12/0xe0
[ 4398.808819] ? io_submit_one+0xd9/0x870
[ 4398.812671] io_submit_one+0x142/0x870
[ 4398.816429] ? io_submit_one+0x142/0x870
[ 4398.820373] __x64_sys_io_submit+0x97/0x2a0
[ 4398.824569] ? syscall_trace_enter.isra.0+0x186/0x250
[ 4398.829632] do_syscall_64+0x5c/0xc0
[ 4398.833217] ? __x64_sys_io_cancel+0x250/0x250
[ 4398.837660] ? do_syscall_64+0x5c/0xc0
[ 4398.841412] ? do_syscall_64+0x69/0xc0
[ 4398.845165] ? do_syscall_64+0x69/0xc0
[ 4398.848921] ? sysvec_call_function_single+0x4e/0x90
[ 4398.853884] ? asm_sysvec_call_function_single+0xa/0x20
[ 4398.859110] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 4398.864165] RIP: 0033:0x7fd43e139697
[ 4398.867746] Code: 00 75 10 8b 47 0c 39 47 08 74 10 0f 1f 84 00 00 00 00 00 e9 bb ff ff ff 0f 1f 00 31 c0 c3 0f 1f 44 00 00 b8 d1 00 00 00 0f 05 <c3> 0f 1f 84 00 00 00 00 00 b8 d2 00 00 00 0f 05 c3 0f 1f 84 00 00
[ 4398.886488] RSP: 002b:00007fd439a49cc8 EFLAGS: 00000202 ORIG_RAX: 00000000000000d1
[ 4398.894057] RAX: ffffffffffffffda RBX: 00007fd43a24c000 RCX: 00007fd43e139697
[ 4398.901187] RDX: 00007fd43400d530 RSI: 0000000000000001 RDI: 00007fd43eb24000
[ 4398.908323] RBP: 00007fd43a24c000 R08: 00007fd43a7051f0 R09: 00000000000001f0
[ 4398.915454] R10: 0000000004455000 R11: 0000000000000202 R12: 00007fd43400ccf0
[ 4398.922587] R13: 00007fd43400d740 R14: 00007fd43400d530 R15: 00007fd43a251210
[ 4398.929725] </TASK>

Lock chain:

  1. bfq_bio_merge:

​ spin_lock_irq(&bfqd->lock);

​ blk_mq_sched_try_merge:

​ adjust_inuse_and_calc_cost:

​ spin_lock_irq(&ioc->lock);

​ spin_unlock_irq(&ioc->lock);

  1. bfq_finish_requeue_request:

​ spin_lock_irqsave(&bfqd->lock, flags);

​ …

​ spin_unlock_irqrestore(&bfqd->lock, flags);

img

There is a process running on cpu and hold a lock spin_lock_irq(&bfqd->lock), then the process to hold the second lock by spin_lock_irq(&ioc->lock), after finish calculation, the process release the second lock spin_unlock_irq(&ioc->lock), at this time the kernel allow irq, and a irq preempts the cpu and when run into function bfq_finish_requeue_request, it try to hold the first lock by spin_lock_irqsave(&bfqd->lock, flags), but it is still hold by original process, so it is deadlock. The locked module can detect this case and reprot the risk into dmesg (stack 3).

Solution

This kernel patch fix the issue in kernel 6.3.13 version:

https://lwn.net/Articles/937933/

https://www.spinics.net/lists/stable/msg669695.html

In this patch, the fix is using spin_lock_irqsave to replace spin_lock_irq in the function of “adjust_inuse_and_calc_cost” to block the irq when unlocked.

We will cherry-pick the patch into our 5.15 kernel.

RAID

RAID是用于增加数据存储的性能和可靠性的技术。全程为Redundant Array of Inexpensive Disks(廉价磁盘冗余阵列)

  • RAID 0 - strping
  • RAID 1 - mirroring
  • RAID 5 - striping with parity
  • RAID 6 - striping with double parity
  • RAID 10 - combining mirroring and striping

准备

mdadm目前是Linux标准RAID管理工具, 能够支持多种模式,可以对RAID进行创建,管理,添加和移除设备等,还能查看RAID设备的详细信息。功能非常强大。

1
2
$yum install mdadm -y
$git clone git://neil.brown.name/mdadm

devicemapper 也能创建RAID设备.

用dd创建4个镜像作为loop设备

1
2
3
$dd if=/dev/zero of=disk1.img bs=512 count=4096000
$losetup /dev/loop2 disk1.img
...

RAID level 0 – Striping

image

Raid 0 将数据进行条带化,同时向多个磁盘(至少2个)写数据.不做数据冗余,理想的情况,每个磁盘使用单独的控制器管理。

优点

  • 读写性能好,没有数据冗余。

  • 所有磁盘都有用.

  • 容易实现.

    缺点

  • 没有容错机制

    mdadm

1
2
mdadm --create --verbose /dev/md0 --level=stripe --raid-devices=2 /dev/loop2 /dev/loop3
mdadm --manage --stop /dev/md0

查看结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$fdisk -l
...
Disk /dev/md0: 4192 MB, 4192206848 bytes, 8187904 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 524288 bytes / 1048576 bytes

$lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
loop2 7:2 0 2G 0 loop
└─md0 9:0 0 3.9G 0 raid0
loop3 7:3 0 2G 0 loop
└─md0 9:0 0 3.9G 0 raid0

devicemapper

1
2
// 创建4设备组成的raid0设备,16384000是4个设备总的Sector数量, 0 是每个设备的偏移量
$dmsetup create test-raid0 --table '0 16384000 striped 4 128 /dev/loop4 0 /dev/loop5 0 /dev/loop6 0 /dev/loop7 0'

查看结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
$lsblk
loop4 7:4 0 2G 0 loop
└─test-raid0 253:23 0 7.8G 0 dm
loop5 7:5 0 2G 0 loop
└─test-raid0 53:23 0 7.8G 0 dm
loop6 7:6 0 2G 0 loop
└─test-raid0 253:23 0 7.8G 0 dm
loop7 7:7 0 2G 0 loop
└─test-raid0 253:23 0 7.8G 0 dm

$fdisk -l
...
Disk /dev/mapper/test-raid0: 8388 MB, 8388608000 bytes, 16384000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 65536 bytes / 262144 bytes

$dmsetup status test-raid0
0 16384000 striped 4 7:4 7:5 7:6 7:7 1 AAAA

RAID level 1 – Mirroring

image

RAID 1 数据存储两次,如果一个drive挂了,控制器会使用另外一个drive或者直接拷贝另一个drive的数据给它。

RAID 1的优缺点很明显,值得一提的的是,它不能保证热交换磁盘,也就是说,当一块盘坏了之后,需要将计算机停机,然后才能更换磁盘。

mdadm

1
$mdadm --create --verbose /dev/md0 --level=mirror --raid-devices=2 /dev/loop2 /dev/loop3

devicemapper

devicemapper中mirror与raid1是有差别的

1
2
3
4
5
6
7
// 4096000是单个设备的sector数量,因为mirror没有扩大容量,2是设备数量, - 表示没有metadata 设备
$dmsetup create test-raid1 --table '0 4096000 raid raid1 3 0 region_size 1024 2 - /dev/loop4 - /dev/loop5'
//下面的是有metadata设备的raid1创建
$dmsetup create test-raid1 --table '0 4096000 raid raid1 3 0 region_size 1024 2 /dev/loop4 /dev/loop5 /dev/loop6 /dev/loop7'

$dmsetup status test-raid1
0 4096000 raid raid1 2 AA 4096000/4096000 idle 0

RAID level 5

最为普遍的RAID方式。要求至少3个drive,其中一个drive的一个磁盘作为奇偶校验盘,其他块做striping。所有奇偶校验盘会广泛分布在所有drive上。当有某个盘挂掉后,可以利用其他块以及奇偶校验盘来恢复数据,但如果同一个块中有两个设备挂掉,那么整个RAID就挂了。也就是说,RAID5能够支持单drive失败。
image

RAID5的优点是兼顾了读写性能和安全性,能够支持单drive的失败情况。缺点在于实现比较复杂,恢复数据比较慢,如果在恢复的过程中其他drive也发生故障,那么整个RAID就挂了。

mdadm

1
2
// spare device指定备用磁盘,创建3块盘的RAID5
$mdadm --create --verbose /dev/md1 --level=5 --raid-devices=3 /dev/loop4 /dev/loop5 /dev/loop6 --spare-devices=1 /dev/loop7

查看信息

1
`$mdadm --detail /dev/md1/dev/md1:        Version : 1.2  Creation Time : Tue Mar  7 16:57:39 2017     Raid Level : raid5     Array Size : 4093952 (3.90 GiB 4.19 GB)  Used Dev Size : 2046976 (1999.34 MiB 2096.10 MB)   Raid Devices : 3  Total Devices : 4    Persistence : Superblock is persistent    Update Time : Tue Mar  7 16:57:57 2017          State : clean  Active Devices : 3Working Devices : 4 Failed Devices : 0  Spare Devices : 1         Layout : left-symmetric     Chunk Size : 512K           Name : st-integrat-node00:1  (local to host st-integrat-node00)           UUID : 071d2e75:2029e9a2:2e427dd9:d2ab804f         Events : 18    Number   Major   Minor   RaidDevice State       0       7        4        0      active sync   /dev/loop4       1       7        5        1      active sync   /dev/loop5       4       7        6        2      active sync   /dev/loop6       3       7        7        -      spare   /dev/loop7`

device-mapper

1
2
3
4
5
// 这里创建了没有metadata设备的4个设备的raid5,因此大小是3倍磁盘扇区大小,有一个作为奇偶校验。 
$dmsetup create test-raid5 --table '0 12288000 raid raid5_ls 3 64 region_size 1024 4 - /dev/loop4 - /dev/loop5 - /dev/loop6 - /dev/loop7'
// metadata设备的创建过程和前面RAID1的类似,这里略过。
//devicemapper还可以创建degraded RAID5,只需将最后一个设备置为 - 即可。
$dmsetup create test-raid5 --table '0 12288000 raid raid5_ls 3 64 region_size 1024 4 - /dev/loop4 - /dev/loop5 - - -

查看结果

1
2
3
4
5
// resync 表示正在同步, 4096000/4096000表示全部同步完成。
$dmsetup status test-raid5
0 12288000 raid raid5_ls 4 aaaa 2048064/4096000 resync 0
0 12288000 raid raid5_ls 4 AAAA 4096000/4096000 idle 0
//同步过程中也是可以删除设备的

RAID level 6 – Striping with double parity

image
RAID6改进了RAID5,使用了两块奇偶校验盘,
RAID6的创建过程和RAID5相似.

RAID level 10 – combining RAID 1 & RAID 0

RAID10可以看做是RAID1和RAID0的结合。
image

RAID10的优点在于恢复数据速度很快,但相比与RAID5和6,使用设备的代价更大。

mdadm

1
2
3
4
5
6
7
$mdadm --create --verbose /dev/md1 --level=10 --raid-devices=4 /dev/loop4 /dev/loop5 /dev/loop6 /dev/loop7
mdadm: layout defaults to n2
mdadm: layout defaults to n2
mdadm: chunk size defaults to 512K
mdadm: size set to 2046976K
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.

查看结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
$mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Tue Mar 7 17:44:43 2017
Raid Level : raid10
Array Size : 4093952 (3.90 GiB 4.19 GB)
Used Dev Size : 2046976 (1999.34 MiB 2096.10 MB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent

Update Time : Tue Mar 7 17:45:04 2017
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0

Layout : near=2
Chunk Size : 512K

Name : st-integrat-node00:1 (local to host st-integrat-node00)
UUID : 643b371a:eeadc82d:7d4effee:cf00411c
Events : 17

Number Major Minor RaidDevice State
0 7 4 0 active sync set-A /dev/loop4
1 7 5 1 active sync set-B /dev/loop5
2 7 6 2 active sync set-A /dev/loop6
3 7 7 3 active sync set-B /dev/loop7

device-mapper

1
2
// 创建4个drive的RAID10,一半做镜像,所以大小为2倍drive sector大小。
$dmsetup create test-raid10 --table '0 8192000 raid raid10 3 64 region_size 1024 4 - /dev/loop4 - /dev/loop5 - /dev/loop6 - /dev/loop7'

查看结果

1
2
$dmsetup status test-raid10
0 8192000 raid raid10 4 AAAA 8192000/8192000 idle 0

参考文献

Repair thin-pool document

Use lvm repair

If you would have latest lvm2 tools - you could have tried:

  1. deactive if pool is active, before deactive pool, you must to deactive the volumes created from pool. if volume is created by lvm, just umount volume and lvchange, but if volume is created by devicemapper, need manual remove volume and record the deviceId deviceName and table to activate volume. Deactive, actually, is the process of remove the file link of /dev/vg/volume and* /dev/mapper/vg-volume.

    1
    lvchange -an vg
  2. repair meta and active pool

    1
    lvconvert --repair  vg/pool
  3. active pool

    1
    lvchange -ay vg

    Below are the steps which happen while running the consistency check:

    1. Creates a new, repaired copy of the metadata.
      lvconvert runs the thin_repair command to read damaged metadata from
      the existing pool metadata LV, and writes a new repaired copy to the
      VG’s pmspare LV.
    2. Replaces the thin pool metadata LV.
      If step 1 is successful, the thin pool metadata LV is replaced with
      the pmspare LV containing the corrected metadata. The previous thin
      pool metadata LV, containing the damaged metadata, becomes visible
      with the new name ThinPoolLV_tmetaN (where N is 0,1,…).

    but in my lvm (version 2.02.166(2)-RHEL7 (2016-11-16)), repair will not create new pmspare, it will direct use free vg to create new meta:

    1
    2
    3
    4
    5
    6
    7
    [root@tosqatest4 ~]# lvs -a silver_vg
    LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
    convoy_Linear_silver_data silver_vg twi-aotz-- 894.25g 0.04 0.01
    convoy_Linear_silver_data_meta0 silver_vg -wi-a----- 9.31g
    [convoy_Linear_silver_data_tdata] silver_vg Twi-ao---- 894.25g
    [convoy_Linear_silver_data_tmeta] silver_vg ewi-ao---- 9.31g
    thinvolume silver_vg Vwi-aotz-- 1.00g convoy_Linear_silver_data 40.04

    So pmspare device just a free space device nor a mirror of metadata, if not, we can add pv to vg for repair.

    Use manual repair

    With older tools - you need to go in these manual step:

    1. create temporary small LV

      1
      lvcreate -an -Zn -L10 --name temp vg
    2. replace pool’s metadata volume with this tempLV

      1
      lvconvert --thinpool vg/pool  --poolmetadata temp

    (say ‘y’ to swap)

    1. activate & repair metadata from ‘temp’ volume - you will likely need another volume where to store repaire metadata -

      so create:

      1
      2
      3
      lvcreate -Lat_least_as_big_as_temp  --name repaired  vg
      lvchage -ay vg/temp
      thin_repair -i /dev/vg/temp /dev/vg/repaired

      if everything when fine - compare visualy ‘transaction_id’ of repaired metadata (thin_dump /dev/vg/repaired)

      1. swap deactivated repaired volume back to your thin-pool

        1
        2
        lvchange -an vg/repaired
        lvconvert --thinpool vg/pool --poolmetadata repaired

      try to activate pool - if it doesn’t work report more problems.

      Metadata space exhaustion

      Metadata space exhaustion can lead to inconsistent thin pool metadata
      and inconsistent file systems, so the response requires offline
      checking and repair.

      1. Deactivate the thin pool LV, or reboot the system if this is not

        1
        possible.
      2. Repair thin pool with lvconvert –repair.

        1
        See "Metadata check and repair".
      3. Extend pool metadata space with lvextend VG/ThinPoolLV_tmeta.

        1
        See "Manually manage free metadata space of a thin pool LV".
      4. Check and repair file system with fsck.

KMP算法用于做字符串匹配, 是计算机中经常用到的算法. 大多数都文章比较难懂, 这里参考了阮一封的博客文章, 比较通俗易懂, 但是没有详细解释Next表怎么具体算出来, 只是给出了概念的解释. 这里给出更完整的算法实现和解释.

基本思想

KMP的基本思想非常简单, 举例来说, 现在判断字符串”ABCDABD”是否在字符串”BBC ABCDAB ABCDABCDABDE”中.

img

首先,字符串”BBC ABCDAB ABCDABCDABDE”的第一个字符与搜索词”ABCDABD”的第一个字符,进行比较。因为B与A不匹配,所以搜索词后移一位。

img

因为B与A不匹配,搜索词再往后移。

img

就这样,直到字符串有一个字符,与搜索词的第一个字符相同为止。

img

接着比较字符串和搜索词的下一个字符,还是相同。

img

直到字符串有一个字符,与搜索词对应的字符不相同为止。

img

这时,最自然的反应是,将搜索词整个后移一位,再从头逐个比较。这样做虽然可行,但是效率很差,因为你要把”搜索位置”移到已经比较过的位置,重比一遍。

img

一个基本事实是,当空格与D不匹配时,你其实知道前面六个字符是”ABCDAB”。KMP算法的想法是,设法利用这个已知信息,不要把”搜索位置”移回已经比较过的位置,继续把它向后移,这样就提高了效率。

img

怎么做到这一点呢?可以针对搜索词,算出一张《部分匹配表》(Partial Match Table)。这张表是如何产生的,后面再介绍,这里只要会用就可以了。

img

已知空格与D不匹配时,前面六个字符”ABCDAB”是匹配的。查表可知,最后一个匹配字符B对应的”部分匹配值”为2,因此按照下面的公式算出向后移动的位数:

  移动位数 = 已匹配的字符数 - 对应的部分匹配值

因为 6 - 2 等于4,所以将搜索词向后移动4位。

img

因为空格与C不匹配,搜索词还要继续往后移。这时,已匹配的字符数为2(”AB”),对应的”部分匹配值”为0。所以,移动位数 = 2 - 0,结果为 2,于是将搜索词向后移2位。

img

因为空格与A不匹配,继续后移一位。

img

逐位比较,直到发现C与D不匹配。于是,移动位数 = 6 - 2,继续将搜索词向后移动4位。

img

逐位比较,直到搜索词的最后一位,发现完全匹配,于是搜索完成。如果还要继续搜索(即找出全部匹配),移动位数 = 7 - 0,再将搜索词向后移动7位,这里就不再重复了。

img

下面介绍《部分匹配表》是如何产生的。

首先,要了解两个概念:”前缀”和”后缀”。 “前缀”指除了最后一个字符以外,一个字符串的全部头部组合;”后缀”指除了第一个字符以外,一个字符串的全部尾部组合。

img

“部分匹配值”就是”前缀”和”后缀”的最长的共有元素的长度。以”ABCDABD”为例,

  - “A”的前缀和后缀都为空集,共有元素的长度为0;

  - “AB”的前缀为[A],后缀为[B],共有元素的长度为0;

  - “ABC”的前缀为[A, AB],后缀为[BC, C],共有元素的长度0;

  - “ABCD”的前缀为[A, AB, ABC],后缀为[BCD, CD, D],共有元素的长度为0;

  - “ABCDA”的前缀为[A, AB, ABC, ABCD],后缀为[BCDA, CDA, DA, A],共有元素为”A”,长度为1;

  - “ABCDAB”的前缀为[A, AB, ABC, ABCD, ABCDA],后缀为[BCDAB, CDAB, DAB, AB, B],共有元素为”AB”,长度为2;

  - “ABCDABD”的前缀为[A, AB, ABC, ABCD, ABCDA, ABCDAB],后缀为[BCDABD, CDABD, DABD, ABD, BD, D],共有元素的长度为0。

img

“部分匹配”的实质是,有时候,字符串头部和尾部会有重复。比如,”ABCDAB”之中有两个”AB”,那么它的”部分匹配值”就是2(”AB”的长度)。搜索词移动的时候,第一个”AB”向后移动4位(字符串长度-部分匹配值),就可以来到第二个”AB”的位置。

####算法实现

从上述思想来看, KMP的核心在于计算Next(部分匹配表)表, 其他的就是按照一个公式去移动而已. 我们来看看这个表用算法如何构造. 从15我们知道, Next就是”部分匹配值”就是”前缀”和”后缀”的最长的共有元素的长度.

我们先用最直观的做法来写这段代码, 然后再来优化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
func Next(str string) []int {
length := len(str)
next := make([]int, length) // next数组
for q := 1; q < length; q++ {
m := 0
for k := 0; k < q; k++ {
if str[:k+1] == str[q-k:q+1] && k+1 > m{ //如果前缀等于后缀, 并且长度是最长的, 就替换m
m = k+1
}
}
next[q] = m // m是写入next数组, 循环q
}
return next
}

这段代码将next数组计算出来, 符合正常的code思路, 但是复杂度是O(n^2), 显然不符合KMP的思想. 我们来做一些优化. 首先, 按照next的特性, 每次不需要回溯到0, 而是充分利用next特性回溯到已经匹配的字串, 如果不中再去找上一个, 算法见:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
func Next(str string) []int {
length := len(str)
next := make([]int, length)
next[0] = 0 // 初始化为0, 表示没有
for q := 1; q < length; q++ {
k := next[q - 1] // 只需要退回到前一个的next[q-1]位置, 而不用回到0
for k > 0 && str[k] != str[q] { // 如果后一个位置不相等, 再往前回溯
k = next[k] // 为什么这里不直接跳回到0呢, 因为next[k]表示前面已经匹配的, 而不需要退到起点, 先从最大长度往前回溯
}
if str[k] == str[q] {
// 如果下一个字符相等, 将将k + 1, 由于前面的字串肯定是相等的, 不需要在比较
next[q] = k + 1
} else {
next[q] = 0 // 如果不相等
}
}
return next
}

改进后的算法时间复杂度为O(2n)

这种写法看起来不是很美观, 但是可读性更强, 有点写法会将初始值为-1, 然后将k := next[q - 1] 和 k = next[k]合在一起, 把 next[q] = k + 1和 next[q] = 0也合在一起, 这种”炫技” 行为我个人是非常不喜欢的, 看起来简洁但是将算法思想隐藏了, 可理解性很差. 同样我也不喜欢用任何lamda写法, 除非语言层面对此有性能优化, 否则只是用来恶心看代码的人.

按照这个思想写KMP的算法就非常简单了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
func kmp(str1, str2 string) bool {
next := Next(str2)
i := 0 // i 是str1的index
j := 0 // j 是str2的index
length := len(str1)
for i < length {
if str1[i+j] == str2[j] {
j++
if j == len(str2) {
// 完全匹配
return true
}
continue
}
if j == 0 { // j==0 字节i++ 跳过
i++
} else {
i = i + j - next[j-1]//移动位数 = 已匹配的字符数 - 对应的部分匹配值
j = next[j - 1] // j回溯到记录位置
}
}
return false
}

KMP算法的复杂度为O(M+N), 其他写法会更简洁, 但是同样为了可读性, 情愿写的罗嗦一点.

1
2
3
4
5
func main() {
str1 := "BBC ABCDAB ABCDABCDABDE"
str := "ABCDABD"
fmt.Println(kmp(str1, str))
}

带上术例子进行计算, 总共循环次数为15次, 其中计算next数组总共只计算了7次.

记录一个kernel bug.

在组建中调用了lvconvert生成存储池, 但是测试告诉我过程失败并且机器重启了, 通过抓kdump生成以下日志:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[  983.179047] BUG kmalloc-256(4:099fef32f6df55082271a9ab572acec8251aa754a14a60444d7ce4cb110be56f) (Tainted: G    B          ------------ T): Objects remaining in kmalloc-256(4:099fef32f6df55082271a9ab572acec8251aa754a14a60444d7ce4cb110be56f
[ 983.179890] -----------------------------------------------------------------------------

[ 983.180762] INFO: Slab 0xffffea000cae8840 objects=16 used=9 fp=0xffff88032ba21b00 flags=0x2fffff00000080
[ 983.181181] CPU: 3 PID: 25070 Comm: lvconvert Tainted: G B ------------ T 3.10.0-327.el7.x86_64 #1
[ 983.181182] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 983.181183] ffffea000cae8840 00000000c255adbf ffff88018d43baa8 ffffffff816351f1
[ 983.181186] ffff88018d43bb80 ffffffff811beac4 0000000000000020 ffff88018d43bb90
[ 983.181188] ffff88018d43bb40 656a624f00000000 616d657220737463 6e6920676e696e69
[ 983.181190] Call Trace:
[ 983.181195] [<ffffffff816351f1>] dump_stack+0x19/0x1b
[ 983.181203] [<ffffffff811beac4>] slab_err+0xb4/0xe0
[ 983.181206] [<ffffffff811c1f33>] ? __kmalloc+0x1f3/0x230
[ 983.181208] [<ffffffff811c316b>] ? kmem_cache_close+0x12b/0x2f0
[ 983.181210] [<ffffffff811c318c>] kmem_cache_close+0x14c/0x2f0
[ 983.181212] [<ffffffff811c35c4>] __kmem_cache_shutdown+0x14/0x80
[ 983.181215] [<ffffffff8118ceaf>] kmem_cache_destroy+0x3f/0xe0
[ 983.181217] [<ffffffff811d4949>] kmem_cache_destroy_memcg_children+0x89/0xb0
[ 983.181221] [<ffffffff8118ce84>] kmem_cache_destroy+0x14/0xe0
[ 983.181231] [<ffffffff8121705e>] bioset_free+0xce/0x110
[ 983.181250] [<ffffffffa0002fa0>] __dm_destroy+0x1b0/0x340 [dm_mod]
[ 983.181257] [<ffffffffa00047e3>] dm_destroy+0x13/0x20 [dm_mod]
[ 983.181263] [<ffffffffa000a34e>] dev_remove+0x11e/0x180 [dm_mod]
[ 983.181268] [<ffffffffa000a230>] ? dev_suspend+0x250/0x250 [dm_mod]
[ 983.181274] [<ffffffffa000aa25>] ctl_ioctl+0x255/0x500 [dm_mod]
[ 983.181278] [<ffffffff81271b94>] ? SYSC_semtimedop+0x264/0xd10
[ 983.181287] [<ffffffffa000ace3>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 983.181289] [<ffffffff811f1ef5>] do_vfs_ioctl+0x2e5/0x4c0
[ 983.181292] [<ffffffff811e05ee>] ? ____fput+0xe/0x10
[ 983.181294] [<ffffffff811f2171>] SyS_ioctl+0xa1/0xc0
[ 983.181298] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b

看起来是lvconvert过程中缓存分配出现错误, 查到这一步基本无从下手了, 只能google搜索. 在docker项目中发现有相似的情况https://github.com/moby/moby/issues/29879, 出问题的三个网址都是3.10的kernel, 回到说是kernel问题, 升到4.16.7-1.el7.elrepo.x86_64没有问题.

make build过程中忽然出现错误:

1
2
3
4
5
6
7
8
9
10
Sending build context to Docker daemon   220 MB
Step 1 : FROM warpdrive:tos-release-1-5
---> 769306738d96
Step 2 : COPY . /go/src/github.com/transwarp/warpdrive/
---> 07c99697b16e
Removing intermediate container 127c0e71a84b
Successfully built 07c99697b16e
/usr/bin/docker: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:245: running exec setns process for init caused \\\"exit status 6\\\"\"\n".
FATA[0301] exit status 125
make: *** [build] Error 1

尝试run 07c99697b16e报同样的错, 查看docker日志, 出现"oci runtime error: container_linux.go:247: starting container process caused \"process_linux .go:245: running exec setns process for init caused \\\"exit status 6\\\"\"\n"

1
2
3
4
5
6
8月 21 15:29:01 zhenghc-tos dockerd[2517]: time="2018-08-21T15:29:01.778212288+08:00" level=error msg="Handler for POST /v1.24/containers/79a0906242e0180dfd7aeff30e9fb179f7b7c37f0ba20533f83b4cd40b409d2a/start returned error: invali
d header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:245: running exec setns process for init caused \\\\\\\"exit status 6\\\\\\\"\\\"\\n\""
8月 21 15:51:08 zhenghc-tos dockerd[2517]: time="2018-08-21T15:51:08.075631868+08:00" level=error msg="containerd: start container" error="oci runtime error: container_linux.go:247: starting container process caused \"process_linux
.go:245: running exec setns process for init caused \\\"exit status 6\\\"\"\n" id=b93539a02f27cc1ad0114552643d8e0c942fba053dc8cd3980cb8f24cf61ff25
8月 21 15:51:08 zhenghc-tos dockerd[2517]: time="2018-08-21T15:51:08.092002257+08:00" level=error msg="Create container failed with error: invalid header field value \"oci runtime error: container_linux.go:247: starting container p
rocess caused \\\"process_linux.go:245: running exec setns process for init caused \\\\\\\"exit status 6\\\\\\\"\\\"\\n\""

发现Docker比较卡, 于是清理了大量的images, 删掉了所有未运行的container, 再清理了系统缓存, 问题解决了, 但深层次的问题没弄清楚, 尽管清了缓存, 但是清完其实缓存差距不大, docker 也没有显示资源问题, 另外一个点就是, 只有这个image是进不去的, 其他的Images是可以run -it的, 而且在做COPY . /go/src/github.com/transwarp/warpdrive/这一步时间等的比较长.

https://github.com/opencontainers/runc/issues/1130 记录了相关的问题, 说runc的问题.

问题:

  1. k8s中cpu的request和limit中设置”0.5”和”100m”表示什么意思?
  2. 一个4核8线程的cpu, 设置cpu为”1”的limit实际的限制多少?
  3. cpu中request和limit使用上有什么区别?
  4. cgroup中是如何限制的?

Linux Cgroup

众所周知, kubernetes和docker中对cpu, memory进行了使用限制, 用于内存和cpu的资源使用隔离. 而底层使用的是linux cgroup技术. 内存比较简单, 本文主要讲cpu的cgroup.

在cgroup里面,跟CPU相关的子系统有cpusetscpuacctcpu

其中cpuset主要用于设置CPU的亲和性,可以限制cgroup中的进程只能在指定的CPU上运行,或者不能在指定的CPU上运行,同时cpuset还能设置内存的亲和性。设置亲和性一般只在比较特殊的情况才用得着,所以这里不做介绍。

cpuacct包含当前cgroup所使用的CPU的统计信息,信息量较少,有兴趣可以去看看它的文档,这里不做介绍。

本篇只介绍cpu子系统,包括怎么限制cgroup的CPU使用上限及相对于其它cgroup的相对值。

创建子cgroup

在ubuntu下,systemd已经帮我们mount好了cpu子系统,我们只需要在相应的目录下创建子目录就可以了

1
2
3
4
5
6
7
8
9
10
11
12
13
#从这里的输出可以看到,cpuset被挂载在了/sys/fs/cgroup/cpuset,
#而cpu和cpuacct一起挂载到了/sys/fs/cgroup/cpu,cpuacct下面
dev@ubuntu:~$ mount|grep cpu
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)

#进入/sys/fs/cgroup/cpu,cpuacct并创建子cgroup
dev@ubuntu:~$ cd /sys/fs/cgroup/cpu,cpuacct
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct$ sudo mkdir test
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct$ cd test
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ ls
cgroup.clone_children cpuacct.stat cpuacct.usage_percpu cpu.cfs_quota_us cpu.stat tasks
cgroup.procs cpuacct.usage cpu.cfs_period_us cpu.shares notify_on_release

我们只需要关注cpu.开头的文件

cpu subsystem

cpu子系统调度cpu到cgroups中, 目前有两种调度策略:

  • Completely Fair Scheduler (CFS) —-将cpu时间划分成合适的份额, 按比例和权重分配给cgroup.cfs可以设置相对权重和绝对权重, 目前k8s用的是这个调度策略.
  • Real-Time scheduler (RT) —RT调度器与CFS中的绝对权重控制相似, 不过仅用于实时任务. 可以在运行时进行实时调整参数.
  1. Completely Fair Scheduler (CFS):
  • 强制绝对控制参数:

    cpu.cfs_period_us & cpu.cfs_quota_us

cfs_period_us用来配置时间周期长度,cfs_quota_us用来配置当前cgroup在设置的周期长度内所能使用的CPU时间数,两个文件配合起来设置CPU的使用上限。两个文件的单位都是微秒(us),cfs_period_us的取值范围为1毫秒(ms)到1秒(s),cfs_quota_us的取值大于1ms即可,如果cfs_quota_us的值为-1(默认值),表示不受cpu时间的限制。下面是几个例子:

1
2
3
4
5
6
7
8
9
10
11
1.限制只能使用1个CPU(每250ms能使用250ms的CPU时间)
# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
# echo 250000 > cpu.cfs_period_us /* period = 250ms */

2.限制使用2个CPU(内核)(每500ms能使用1000ms的CPU时间,即使用两个内核)
# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
# echo 500000 > cpu.cfs_period_us /* period = 500ms */

3.限制使用1个CPU的20%(每50ms能使用10ms的CPU时间,即使用一个CPU核心的20%)
# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
# echo 50000 > cpu.cfs_period_us /* period = 50ms */

cpu.stat:

包含了下面三项统计结果

  • nr_periods: 表示过去了多少个cpu.cfs_period_us里面配置的时间周期
  • nr_throttled: 在上面的这些周期中,有多少次是受到了限制(即cgroup中的进程在指定的时间周期中用光了它的配额)
  • throttled_time: cgroup中的进程被限制使用CPU持续了多长时间(纳秒)
  • 相对控制参数: cpu.shares

shares用来设置CPU的相对值,并且是针对所有的CPU(内核),默认值是1024,假如系统中有两个cgroup,分别是A和B,A的shares值是1024,B的shares值是512,那么A将获得1024/(1204+512)=66%的CPU资源,而B将获得33%的CPU资源。shares有两个特点:

  • 如果A不忙,没有使用到66%的CPU时间,那么剩余的CPU时间将会被系统分配给B,即B的CPU使用率可以超过33%
  • 如果添加了一个新的cgroup C,且它的shares值是1024,那么A的限额变成了1024/(1204+512+1024)=40%,B的变成了20%

从上面两个特点可以看出:

  • 在闲的时候,shares基本上不起作用,只有在CPU忙的时候起作用,这是一个优点。
  • 由于shares是一个绝对值,需要和其它cgroup的值进行比较才能得到自己的相对限额,而在一个部署很多容器的机器上,cgroup的数量是变化的,所以这个限额也是变化的,自己设置了一个高的值,但别人可能设置了一个更高的值,所以这个功能没法精确的控制CPU使用率。
  1. Real-Time scheduler (RT):

cpu.rt_period_us: 周期时间, us, 同cfs

cpu.rt_runtime_us: 最长持续周期, us. 例如设置rt_runtime_us = 200000, rt_period_us = 1000000, 这就是说如果node有2cpu, 那么每秒钟占用时间就是2*0.2 = 0.4s. 这个也是绝对时间.

运行事例:

以cfs为例.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#继续使用上面创建的子cgroup: test
#设置只能使用1个cpu的20%的时间
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ sudo sh -c "echo 50000 > cpu.cfs_period_us"
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ sudo sh -c "echo 10000 > cpu.cfs_quota_us"

#将当前bash加入到该cgroup
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ echo $$
5456
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ sudo sh -c "echo 5456 > cgroup.procs"

#在bash中启动一个死循环来消耗cpu,正常情况下应该使用100%的cpu(即消耗一个内核)
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ while :; do echo test > /dev/null; done

#--------------------------重新打开一个shell窗口----------------------
#通过top命令可以看到5456的CPU使用率为20%左右,说明被限制住了
#不过这时系统的%us+%sy在10%左右,那是因为我测试的机器上cpu是双核的,
#所以系统整体的cpu使用率为10%左右
dev@ubuntu:~$ top
Tasks: 139 total, 2 running, 137 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.6 us, 6.2 sy, 0.0 ni, 88.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 499984 total, 15472 free, 81488 used, 403024 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 383332 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5456 dev 20 0 22640 5472 3524 R 20.3 1.1 0:04.62 bash

#这时可以看到被限制的统计结果
dev@ubuntu:~$ cat /sys/fs/cgroup/cpu,cpuacct/test/cpu.stat
nr_periods 1436
nr_throttled 1304
throttled_time 51542291833

kubernetes 资源控制机制

kubernetes使用runc作为runtime, runc中通过cpuGroup对cpu子系统的的调度进行设置. 其中Set()用于初始设置, Apply()方法用于动态更改设置, . 原理就是往上述的指定文件写入相应条目和数字. 没有什么可说的.

内存的Set():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
func (s *MemoryGroup) Set(path string, cgroup *configs.Cgroup) error {
if err := setMemoryAndSwap(path, cgroup); err != nil {
return err
}

if cgroup.Resources.KernelMemory != 0 {
if err := setKernelMemory(path, cgroup.Resources.KernelMemory); err != nil {
return err
}
}

if cgroup.Resources.MemoryReservation != 0 {
if err := writeFile(path, "memory.soft_limit_in_bytes", strconv.FormatInt(cgroup.Resources.MemoryReservation, 10)); err != nil {
return err
}
}

if cgroup.Resources.KernelMemoryTCP != 0 {
if err := writeFile(path, "memory.kmem.tcp.limit_in_bytes", strconv.FormatInt(cgroup.Resources.KernelMemoryTCP, 10)); err != nil {
return err
}
}
if cgroup.Resources.OomKillDisable {
if err := writeFile(path, "memory.oom_control", "1"); err != nil {
return err
}
}
if cgroup.Resources.MemorySwappiness == nil || int64(*cgroup.Resources.MemorySwappiness) == -1 {
return nil
} else if *cgroup.Resources.MemorySwappiness <= 100 {
if err := writeFile(path, "memory.swappiness", strconv.FormatUint(*cgroup.Resources.MemorySwappiness, 10)); err != nil {
return err
}
} else {
return fmt.Errorf("invalid value:%d. valid memory swappiness range is 0-100", *cgroup.Resources.MemorySwappiness)
}

return nil
}

内存Apply():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
func (s *MemoryGroup) Apply(d *cgroupData) (err error) {
path, err := d.path("memory")
if err != nil && !cgroups.IsNotFound(err) {
return err
} else if path == "" {
return nil
}
if memoryAssigned(d.config) {
if _, err := os.Stat(path); os.IsNotExist(err) {
if err := os.MkdirAll(path, 0755); err != nil {
return err
}
// 只有内核内存可以动态设置
if err := EnableKernelMemoryAccounting(path); err != nil {
return err
}
}
}
...
}

func EnableKernelMemoryAccounting(path string) error {
// Check if kernel memory is enabled
// We have to limit the kernel memory here as it won't be accounted at all
// until a limit is set on the cgroup and limit cannot be set once the
// cgroup has children, or if there are already tasks in the cgroup.
for _, i := range []int64{1, -1} {
if err := setKernelMemory(path, i); err != nil {
return err
}
}
return nil
}

cpu的Set():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
func (s *CpuGroup) Set(path string, cgroup *configs.Cgroup) error {
// 设置CFS
if cgroup.Resources.CpuShares != 0 {
if err := writeFile(path, "cpu.shares", strconv.FormatUint(cgroup.Resources.CpuShares, 10)); err != nil {
return err
}
}
if cgroup.Resources.CpuPeriod != 0 {
if err := writeFile(path, "cpu.cfs_period_us", strconv.FormatUint(cgroup.Resources.CpuPeriod, 10)); err != nil {
return err
}
}
if cgroup.Resources.CpuQuota != 0 {
if err := writeFile(path, "cpu.cfs_quota_us", strconv.FormatInt(cgroup.Resources.CpuQuota, 10)); err != nil {
return err
}
}
// 设置RT
if err := s.SetRtSched(path, cgroup); err != nil {
return err
}

return nil
}

func (s *CpuGroup) SetRtSched(path string, cgroup *configs.Cgroup) error {
if cgroup.Resources.CpuRtPeriod != 0 {
if err := writeFile(path, "cpu.rt_period_us", strconv.FormatUint(cgroup.Resources.CpuRtPeriod, 10)); err != nil {
return err
}
}
if cgroup.Resources.CpuRtRuntime != 0 {
if err := writeFile(path, "cpu.rt_runtime_us", strconv.FormatInt(cgroup.Resources.CpuRtRuntime, 10)); err != nil {
return err
}
}
return nil
}

cpu的Apply() 只能设置RT:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
func (s *CpuGroup) Apply(d *cgroupData) error {
// We always want to join the cpu group, to allow fair cpu scheduling
// on a container basis
path, err := d.path("cpu")
if err != nil && !cgroups.IsNotFound(err) {
return err
}
return s.ApplyDir(path, d.config, d.pid)
}

func (s *CpuGroup) ApplyDir(path string, cgroup *configs.Cgroup, pid int) error {
// This might happen if we have no cpu cgroup mounted.
// Just do nothing and don't fail.
if path == "" {
return nil
}
if err := os.MkdirAll(path, 0755); err != nil {
return err
}
// We should set the real-Time group scheduling settings before moving
// in the process because if the process is already in SCHED_RR mode
// and no RT bandwidth is set, adding it will fail.
if err := s.SetRtSched(path, cgroup); err != nil {
return err
}
// because we are not using d.join we need to place the pid into the procs file
// unlike the other subsystems
if err := cgroups.WriteCgroupProc(path, pid); err != nil {
return err
}

return nil
}
K8s Limits & Request 代码分析

k8s中管理cgroup的结构体在k8s.io/kubernetes/pkg/kubelet/cm/cgroup_manager_linux.go中:

1
2
3
4
5
6
7
8
9
10
11
// cgroupManagerImpl implements the CgroupManager interface.
// Its a stateless object which can be used to
// update,create or delete any number of cgroups
// It uses the Libcontainer raw fs cgroup manager for cgroup management.
type cgroupManagerImpl struct {
// subsystems holds information about all the
// mounted cgroup subsystems on the node
subsystems *CgroupSubsystems
// simplifies interaction with libcontainer and its cgroup managers
adapter *libcontainerAdapter
}

来看下Update方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Update updates the cgroup with the specified Cgroup Configuration
func (m *cgroupManagerImpl) Update(cgroupConfig *CgroupConfig) error {
...
// 提取cgroup资源参数
resourceConfig := cgroupConfig.ResourceParameters
resources := m.toResources(resourceConfig)

cgroupPaths := m.buildCgroupPaths(cgroupConfig.Name)

// 获得cgroupfs的位置.
abstractCgroupFsName := string(cgroupConfig.Name)
abstractParent := CgroupName(path.Dir(abstractCgroupFsName))
abstractName := CgroupName(path.Base(abstractCgroupFsName))

driverParent := m.adapter.adaptName(abstractParent, false)
driverName := m.adapter.adaptName(abstractName, false)

// 获取systemd的绝对位置
if m.adapter.cgroupManagerType == libcontainerSystemd {
driverName = m.adapter.adaptName(cgroupConfig.Name, false)
}

// 初始化cgroup配置
libcontainerCgroupConfig := &libcontainerconfigs.Cgroup{
Name: driverName,
Parent: driverParent,
Resources: resources,
Paths: cgroupPaths,
}

// 设置cgroup配置
if err := setSupportedSubsystems(libcontainerCgroupConfig); err != nil {
return fmt.Errorf("failed to set supported cgroup subsystems for cgroup %v: %v", cgroupConfig.Name, err)
}
return nil
}

先看下参数是如何提取的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
func (m *cgroupManagerImpl) toResources(resourceConfig *ResourceConfig) *libcontainerconfigs.Resources {
resources := &libcontainerconfigs.Resources{}
if resourceConfig == nil {
return resources
}
if resourceConfig.Memory != nil {
resources.Memory = *resourceConfig.Memory
}
if resourceConfig.CpuShares != nil {
resources.CpuShares = *resourceConfig.CpuShares
}
if resourceConfig.CpuQuota != nil {
resources.CpuQuota = *resourceConfig.CpuQuota
}
if resourceConfig.CpuPeriod != nil {
resources.CpuPeriod = *resourceConfig.CpuPeriod
}

// huge page参数提取, 略过
if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.HugePages) {
...
return resources
}

在来看看CpuShare, CpuQuota, 和CpuPeriod都来字哪里:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
// ResourceConfigForPod takes the input pod and outputs the cgroup resource config.
func ResourceConfigForPod(pod *v1.Pod) *ResourceConfig {
// sum requests and limits.
reqs, limits := resource.PodRequestsAndLimits(pod)

cpuRequests := int64(0)
cpuLimits := int64(0)
memoryLimits := int64(0)
if request, found := reqs[v1.ResourceCPU]; found {
cpuRequests = request.MilliValue()
}
if limit, found := limits[v1.ResourceCPU]; found {
cpuLimits = limit.MilliValue()
}
if limit, found := limits[v1.ResourceMemory]; found {
memoryLimits = limit.Value()
}

// convert to CFS values
cpuShares := MilliCPUToShares(cpuRequests)
cpuQuota, cpuPeriod := MilliCPUToQuota(cpuLimits)
...
}

// MilliCPUToShares converts the milliCPU to CFS shares.
func MilliCPUToShares(milliCPU int64) uint64 {
if milliCPU == 0 {
// Docker converts zero milliCPU to unset, which maps to kernel default
// for unset: 1024. Return 2 here to really match kernel default for
// zero milliCPU.
return MinShares
}
// Conceptually (milliCPU / milliCPUToCPU) * sharesPerCPU, but factored to improve rounding.
shares := (milliCPU * SharesPerCPU) / MilliCPUToCPU
if shares < MinShares {
return MinShares
}
return uint64(shares)
}

// MilliCPUToQuota converts milliCPU to CFS quota and period values.
func MilliCPUToQuota(milliCPU int64) (quota int64, period uint64) {
// CFS quota is measured in two values:
// - cfs_period_us=100ms (the amount of time to measure usage across)
// - cfs_quota=20ms (the amount of cpu time allowed to be used across a period)
// so in the above example, you are limited to 20% of a single CPU
// for multi-cpu environments, you just scale equivalent amounts

if milliCPU == 0 {
return
}

// we set the period to 100ms by default
period = QuotaPeriod

// we then convert your milliCPU to a value normalized over a period
quota = (milliCPU * QuotaPeriod) / MilliCPUToCPU

// quota needs to be a minimum of 1ms.
if quota < MinQuotaPeriod {
quota = MinQuotaPeriod
}

return
}

const (
// Taken from lmctfy https://github.com/google/lmctfy/blob/master/lmctfy/controllers/cpu_controller.cc
MinShares = 2
SharesPerCPU = 1024
MilliCPUToCPU = 1000

// 100000 is equivalent to 100ms
QuotaPeriod = 100000
MinQuotaPeriod = 1000
)

设置的代码:

1
2
3
4
5
6
7
8
9
10
11
12
func setSupportedSubsystems(cgroupConfig *libcontainerconfigs.Cgroup) error {
for _, sys := range getSupportedSubsystems() {
if _, ok := cgroupConfig.Paths[sys.Name()]; !ok {
return fmt.Errorf("Failed to find subsystem mount for subsystem: %v", sys.Name())
}
// 调用前面的cgroup.Set()方法
if err := sys.Set(cgroupConfig.Paths[sys.Name()], cgroupConfig); err != nil {
return fmt.Errorf("Failed to set config for supported subsystems : %v", err)
}
}
return nil
}

代码很明显了:

Request -> cpu.shares

Limits -> cpu.quota(CFS或者RT)

QuotaPeriod 为100ms

所以当Request设置0.5(0.5 1024=512), 等价于设置500m(500 1024/1000=512), 也就是512.

1
2
[root@tdc-tester04 ~]# cat /sys/fs/cgroup/cpu/kubepods/burstable/pod4de91174-9002-11e8-a663-ac1f6b83dd66/1cbc69fd7756efdcba38d11a57acab5cdbe0b9b2e5eef2a9e86e3c6c8850b1a1/cpu.shares 
512

当Limit设置1(1 1000 100000/1000=100000), 等价于1000m(1000 * 100000 / 1000 = 100000 ), 也就是quota=100000, period=100000.

1
2
3
4
[root@tdc-tester04 ~]# cat /sys/fs/cgroup/cpu/kubepods/burstable/pod4de91174-9002-11e8-a663-ac1f6b83dd66/1cbc69fd7756efdcba38d11a57acab5cdbe0b9b2e5eef2a9e86e3c6c8850b1a1/cpu.cfs_quota_us 
100000
[root@tdc-tester04 ~]# cat /sys/fs/cgroup/cpu/kubepods/burstable/pod4de91174-9002-11e8-a663-ac1f6b83dd66/1cbc69fd7756efdcba38d11a57acab5cdbe0b9b2e5eef2a9e86e3c6c8850b1a1/cpu.cfs_period_us
100000

至此所有路走通.

值得注意的点

1
2
3
4
5
6
The CPU resource is measured in cpu units. One cpu, in Kubernetes, is equivalent to:

1 AWS vCPU
1 GCP Core
1 Azure vCore
1 Hyperthread on a bare-metal Intel processor with Hyperthreading

从文档来看所以4核8线程对k8s来讲就是8核.

如果只设置了request没有设置limit, 意味着一个pod可以任意使用node资源, 如果没有其他pod创建. 如果集群管理员设置了LimitRange, 那么当pod没有设置limit的时候就会使用LimitRange里设置的值作为默认.

参考文献

  1. https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/memory-default-namespace/#what-if-you-specify-a-container-s-request-but-not-its-limit
  2. https://segmentfault.com/a/1190000008323952
  3. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu

Go排序实现在/src/go4.org/sort/sort.go中.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
func Sort(data Interface) {
n := data.Len()
if fs, ok := data.(*funcs); ok {
quickSort_func(fs.lessSwap, 0, n, maxDepth(n))
} else {
quickSort(data, 0, n, maxDepth(n))
}
}

func quickSort(data Interface, a, b, maxDepth int) {
for b-a > 12 { // Use ShellSort for slices <= 12 elements
if maxDepth == 0 {
heapSort(data, a, b)
return
}
maxDepth--
mlo, mhi := doPivot(data, a, b)
// Avoiding recursion on the larger subproblem guarantees
// a stack depth of at most lg(b-a).
if mlo-a < b-mhi {
quickSort(data, a, mlo, maxDepth)
a = mhi // i.e., quickSort(data, mhi, b)
} else {
quickSort(data, mhi, b, maxDepth)
b = mlo // i.e., quickSort(data, a, mlo)
}
}
if b-a > 1 {
// Do ShellSort pass with gap 6
// It could be written in this simplified form cause b-a <= 12
for i := a + 6; i < b; i++ {
if data.Less(i, i-6) {
data.Swap(i, i-6)
}
}
insertionSort(data, a, b)
}
}
// Insertion sort
func insertionSort(data Interface, a, b int) {
for i := a + 1; i < b; i++ {
for j := i; j > a && data.Less(j, j-1); j-- {
data.Swap(j, j-1)
}
}
}

quickSort和quickSort_func基本相同, 子不过自己定义的比较方法, 这里不在赘述.

可以看到主体是quickSort, 所以不保证稳定排序, 当数据量小于等于12时. 退化为希尔排序, 但希尔排序只做了一次gap为6, 之后再次退化为了简单插入排序. 这部分实现非常简单, i从前往后, j从i往前, 每次比较j-1和j, 将小的换到前面.

当数据量大于12时, 根据maxDepth是否为0决定使用对排序还是使用快速排序, maxDepth计算:

1
2
3
4
5
6
7
8
9
// maxDepth returns a threshold at which quicksort should switch
// to heapsort. It returns 2*ceil(lg(n+1)).
func maxDepth(n int) int {
var depth int
for i := n; i > 0; i >>= 1 {
depth++
}
return depth * 2
}

返回的是目前树高度的2倍. 每进行一轮快速排序的partition(doPivot), 这个数值就减1, 当减为0, 此时认为数据已经比较有序, 如果数据长度依然大于12, 改为堆排序效率更高.

这里先看快速排序的主体部分, 再来看堆排序的实现. partition是快排的思想核心, 他实现在doPivot中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
func doPivot(data Interface, lo, hi int) (midlo, midhi int) {
m := lo + (hi-lo)/2 // 避免整数溢出
if hi-lo > 40 { 如果长度大于40, 使用Tukey ninther法
// Tukey's ``Ninther,'' median of three medians of three.
s := (hi - lo) / 8
medianOfThree(data, lo, lo+s, lo+2*s)
medianOfThree(data, m, m-s, m+s)
medianOfThree(data, hi-1, hi-1-s, hi-1-2*s)
}
// medianOfThree 保证lo, m hi-1三个值是从小到大的
medianOfThree(data, lo, m, hi-1)

// Invariants are:
// data[lo] = pivot (set up by ChoosePivot)
// data[lo < i < a] < pivot
// data[a <= i < b] <= pivot
// data[b <= i < c] unexamined
// data[c <= i < hi-1] > pivot
// data[hi-1] >= pivot
pivot := lo
a, c := lo+1, hi-1

// 找到第一个比起始值点大的Index
for ; a < c && data.Less(a, pivot); a++ {
}
b := a
for {
for ; b < c && !data.Less(pivot, b); b++ { // data[b] <= pivot
}
for ; b < c && data.Less(pivot, c-1); c-- { // data[c-1] > pivot
}
if b >= c {
break
}
// data[b] > pivot; data[c-1] <= pivot
data.Swap(b, c-1)
b++
c--
}
// If hi-c<3 then there are duplicates (by property of median of nine).
// Let be a bit more conservative, and set border to 5.
protect := hi-c < 5
if !protect && hi-c < (hi-lo)/4 {
// Lets test some points for equality to pivot
dups := 0
if !data.Less(pivot, hi-1) { // data[hi-1] = pivot
data.Swap(c, hi-1)
c++
dups++
}
if !data.Less(b-1, pivot) { // data[b-1] = pivot
b--
dups++
}
// m-lo = (hi-lo)/2 > 6
// b-lo > (hi-lo)*3/4-1 > 8
// ==> m < b ==> data[m] <= pivot
if !data.Less(m, pivot) { // data[m] = pivot
data.Swap(m, b-1)
b--
dups++
}
// if at least 2 points are equal to pivot, assume skewed distribution
protect = dups > 1
}
if protect {
// Protect against a lot of duplicates
// Add invariant:
// data[a <= i < b] unexamined
// data[b <= i < c] = pivot
for {
for ; a < b && !data.Less(b-1, pivot); b-- { // data[b] == pivot
}
for ; a < b && data.Less(a, pivot); a++ { // data[a] < pivot
}
if a >= b {
break
}
// data[a] == pivot; data[b-1] < pivot
data.Swap(a, b-1)
a++
b--
}
}
// Swap pivot into middle
data.Swap(pivot, b-1)
return b - 1, c
}
// medianOfThree moves the median of the three values data[m0], data[m1], data[m2] into data[m1].
func medianOfThree(data Interface, m1, m0, m2 int) {
// sort 3 elements
if data.Less(m1, m0) {
data.Swap(m1, m0)
}
// data[m0] <= data[m1]
if data.Less(m2, m1) {
data.Swap(m2, m1)
// data[m0] <= data[m2] && data[m1] < data[m2]
if data.Less(m1, m0) {
data.Swap(m1, m0)
}
}
// 保证 data[m0] <= data[m1] <= data[m2]
}
0%