Linux Networking Deep Dive: Introduction and Roadmap
Complete guide to Linux kernel networking - from hardware interrupts to application delivery. Learn packet flow, debugging techniques, and performance optimization.
Complete guide to Linux kernel networking - from hardware interrupts to application delivery. Learn packet flow, debugging techniques, and performance optimization.
When you run top, you can see the usage of wa. What does it means and how it calculated?
1 | %Cpu(s): 0.1 us, 2.4 sy, 0.0 ni, 93.6 id, 3.9 wa, 0.0 hi, 0.0 si, 0.0 st |
Values related to processor utilization are displayed on the third line. They provide insight into exactly what the CPUs are doing.
us is the percent of time spent running user processes.sy is the percent of time spent running the kernel.ni is the percent of time spent running processes with manually configured nice values.id is the percent of time idle (if high, CPU may be overworked).wa is the percent of wait time (if high, CPU is waiting for I/O access).hi is the percent of time managing hardware interrupts.si is the percent of time managing software interrupts.st is the percent of virtual CPU time waiting for access to physical CPU.In the context of the top command in Unix-like operating systems, the “wa” field represents the percentage of time the CPU spends waiting for I/O operations to complete. Specifically, it stands for “waiting for I/O.”
The calculation includes the time the CPU is idle while waiting for data to be read from or written to storage devices such as hard drives or SSDs. High values in the “wa” field may indicate that the system is spending a significant amount of time waiting for I/O operations to complete, which could be a bottleneck if not addressed.
The formula for calculating the “wa” value is as follows:
wa %=(Time spent waiting for I/O / Total CPU time)×100
Also we can use vmstat 1 to watch the wa state in system, but it is the not the percent:
1 | # vmstat 1 |
1 | Procs |
In summary, a high “wa” value in the output of top suggests that your system is spending a considerable amount of time waiting for I/O operations to be completed, which could impact overall system performance. Investigating and optimizing disk I/O can be beneficial in such cases, possibly by improving disk performance, optimizing file systems, or identifying and addressing specific I/O-intensive processes. We can use iotop to find the high io or slow io processes in system:
1 | Current DISK READ: 341.78 M/s | Current DISK WRITE: 340.53 M/s |
From strace the top command, the wa value is get the status from /proc/*/stat. The /proc/*/stat is wrote by the proc ops and get the iowait time from kernel_cpustat.cpustat[CPUTIME_IOWAIT] accroding to the code fs/proc/stat.c.
1 | static const struct proc_ops stat_proc_ops = { |
In the Linux kernel, the calculation of the “wa” (waiting for I/O) value that is reported by tools like top is handled within the kernel’s scheduler. The specific code responsible for updating the CPU usage statistics can be found in the kernel/sched/cputime.c file.
One of the key functions related to this is account_idle_time().
1 | /* |
This function is part of the kernel’s scheduler and is responsible for updating the various CPU time statistics, including the time spent waiting for I/O. The function takes into account different CPU states, such as idle time and time spent waiting for I/O.When The blocked threads with waiting for I/O, the cpu time accumulated into CPUTIME_IOWAIT.
Here is a basic outline of how the “wa” time is accounted for in the Linux kernel:
1 | # block/fops.c |
1 | # block/blk-core.c |
1 | # kernel/sched/core.c |
1 | # kernel/time/timer.c |
1 | static int |
Investigate the blktrace implementation to understand the trace hook of the blk IO stack?
blktrace explores the Linux kernel tracepoint infrastructure to track requests in-flight through the block I/O stack. It traces everything that goes through to block devices, while observing timing information.
First we using dd command to make a write IO (the read IO is the same).
1 | // write to a file with direct io by dd command |
Usingblktrace and blkparse analize results:
1 | # blktrace -d /dev/vda -o res |

1 | The following table shows the various actions which may be |
The result is long and there only paste few of them. I try to trace an different sectors and behaviors of block layer.
blkparse command still hard to know the total metrics and statistics, so we can use btt to see the statistic of whole table and picture, this is the part of results:
1 | # blkparse -q -i res -d sda.blk.bin |
1 | Q------->G------------>I--------->M------------------->D----------------------------->C |
Q2Q — time between requests sent to the block layer
Q2G — time from a block I/O is queued to the time it gets a request allocated for it
G2I — time from a request is allocated to the time it is inserted into the device’s queue
Q2M — time from a block I/O is queued to the time it gets merged with an existing request
I2D — time from a request is inserted into the device’s queue to the time it is actually issued to the device (time the I/O is “idle” on the request queue)
M2D — time from a block I/O is merged with an exiting request until the request is issued to the device
D2C — service time of the request by the device (time the I/O is “active” in the driver and on the device)
Q2C — total time spent in the block layer for a request
Actually the blktrace’s implement is using various linux kernel tracepoint to trace different phase of IO. Here are some of the key tracepoints used by blktrace:
block_rq_insert: This tracepoint is hit when a request is inserted into the request queue.(Q)block_rq_issue: This tracepoint is hit when a request is issued to the device.(I)block_rq_complete: This tracepoint is hit when a request is completed.(C)block_bio_queue: This tracepoint is hit when a bio is queued.(Q)block_bio_backmerge: This tracepoint is hit when a bio is being merged with the last bio in the request.block_bio_frontmerge: This tracepoint is hit when a bio is being merged with the first bio in the request.block_bio_bounce: This tracepoint is hit when a bio is bounced.block_getrq: This tracepoint is hit when get_request() is called to allocate a request.block_sleeprq: This tracepoint is hit when a process is going to sleep waiting for a request to be available.block_plug: This tracepoint is hit when a plug is inserted into the request queue.block_unplug: This tracepoint is hit when a plug is removed from the request queue.To deepdiv the block layer, i read the code and finding as followes:
1 | /* |


Bio -> Request -> plug request list -> staging request queue in sheduler -> hardware request queue


These pictures records the total main code and function flow from the filessystem to block layer and to driver layder. It also prints the tracepoint mapping to the acation in output of the blktrace where they trace from. By these tracepoint, we can understand the block io process flow more clearly.
environment:
5.15 kernel
way:
Enable cgroup v2 and bfq, set bfq as scheduler on target deviceEnable io cost by /sys/fs/cgroup/io.cost.qos, set cgroup io.weightStart two or more fio process to read or write on the target device, at least contains 1 direct and 1 buffer io, each one use different cgroupsWait 10-30 minutes
Kernel Dump 1 (Set hardlockup_panic=1 by sysctl):
1 | [ 1408.828715] Call Trace: |
Kernel dump 2 :
1 | [60977.577690] BUG: spinlock recursion on CPU#21, fio/11770 |
Kernel dump 3 (enable lockdep):
1 | [ 4398.422037] WARNING: inconsistent lock state |
Lock chain:
spin_lock_irq(&bfqd->lock);
blk_mq_sched_try_merge:
adjust_inuse_and_calc_cost:
spin_lock_irq(&ioc->lock);
spin_unlock_irq(&ioc->lock);
spin_lock_irqsave(&bfqd->lock, flags);
…
spin_unlock_irqrestore(&bfqd->lock, flags);
There is a process running on cpu and hold a lock spin_lock_irq(&bfqd->lock), then the process to hold the second lock by spin_lock_irq(&ioc->lock), after finish calculation, the process release the second lock spin_unlock_irq(&ioc->lock), at this time the kernel allow irq, and a irq preempts the cpu and when run into function bfq_finish_requeue_request, it try to hold the first lock by spin_lock_irqsave(&bfqd->lock, flags), but it is still hold by original process, so it is deadlock. The locked module can detect this case and reprot the risk into dmesg (stack 3).
This kernel patch fix the issue in kernel 6.3.13 version:
https://lwn.net/Articles/937933/
https://www.spinics.net/lists/stable/msg669695.html
In this patch, the fix is using spin_lock_irqsave to replace spin_lock_irq in the function of “adjust_inuse_and_calc_cost” to block the irq when unlocked.
We will cherry-pick the patch into our 5.15 kernel.
RAID是用于增加数据存储的性能和可靠性的技术。全程为Redundant Array of Inexpensive Disks(廉价磁盘冗余阵列)
mdadm目前是Linux标准RAID管理工具, 能够支持多种模式,可以对RAID进行创建,管理,添加和移除设备等,还能查看RAID设备的详细信息。功能非常强大。
1 | $yum install mdadm -y |
devicemapper 也能创建RAID设备.
用dd创建4个镜像作为loop设备
1 | $dd if=/dev/zero of=disk1.img bs=512 count=4096000 |
Raid 0 将数据进行条带化,同时向多个磁盘(至少2个)写数据.不做数据冗余,理想的情况,每个磁盘使用单独的控制器管理。
1 | mdadm --create --verbose /dev/md0 --level=stripe --raid-devices=2 /dev/loop2 /dev/loop3 |
查看结果
1 | $fdisk -l |
1 | // 创建4设备组成的raid0设备,16384000是4个设备总的Sector数量, 0 是每个设备的偏移量 |
查看结果
1 | $lsblk |
RAID 1 数据存储两次,如果一个drive挂了,控制器会使用另外一个drive或者直接拷贝另一个drive的数据给它。
RAID 1的优缺点很明显,值得一提的的是,它不能保证热交换磁盘,也就是说,当一块盘坏了之后,需要将计算机停机,然后才能更换磁盘。
1 | $mdadm --create --verbose /dev/md0 --level=mirror --raid-devices=2 /dev/loop2 /dev/loop3 |
devicemapper中mirror与raid1是有差别的
1 | // 4096000是单个设备的sector数量,因为mirror没有扩大容量,2是设备数量, - 表示没有metadata 设备 |
最为普遍的RAID方式。要求至少3个drive,其中一个drive的一个磁盘作为奇偶校验盘,其他块做striping。所有奇偶校验盘会广泛分布在所有drive上。当有某个盘挂掉后,可以利用其他块以及奇偶校验盘来恢复数据,但如果同一个块中有两个设备挂掉,那么整个RAID就挂了。也就是说,RAID5能够支持单drive失败。
RAID5的优点是兼顾了读写性能和安全性,能够支持单drive的失败情况。缺点在于实现比较复杂,恢复数据比较慢,如果在恢复的过程中其他drive也发生故障,那么整个RAID就挂了。
1 | // spare device指定备用磁盘,创建3块盘的RAID5 |
查看信息
1 | `$mdadm --detail /dev/md1/dev/md1: Version : 1.2 Creation Time : Tue Mar 7 16:57:39 2017 Raid Level : raid5 Array Size : 4093952 (3.90 GiB 4.19 GB) Used Dev Size : 2046976 (1999.34 MiB 2096.10 MB) Raid Devices : 3 Total Devices : 4 Persistence : Superblock is persistent Update Time : Tue Mar 7 16:57:57 2017 State : clean Active Devices : 3Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 512K Name : st-integrat-node00:1 (local to host st-integrat-node00) UUID : 071d2e75:2029e9a2:2e427dd9:d2ab804f Events : 18 Number Major Minor RaidDevice State 0 7 4 0 active sync /dev/loop4 1 7 5 1 active sync /dev/loop5 4 7 6 2 active sync /dev/loop6 3 7 7 - spare /dev/loop7` |
1 | // 这里创建了没有metadata设备的4个设备的raid5,因此大小是3倍磁盘扇区大小,有一个作为奇偶校验。 |
查看结果
1 | // resync 表示正在同步, 4096000/4096000表示全部同步完成。 |
RAID6改进了RAID5,使用了两块奇偶校验盘,
RAID6的创建过程和RAID5相似.
RAID10的优点在于恢复数据速度很快,但相比与RAID5和6,使用设备的代价更大。
1 | $mdadm --create --verbose /dev/md1 --level=10 --raid-devices=4 /dev/loop4 /dev/loop5 /dev/loop6 /dev/loop7 |
查看结果
1 | $mdadm --detail /dev/md1 |
1 | // 创建4个drive的RAID10,一半做镜像,所以大小为2倍drive sector大小。 |
查看结果
1 | $dmsetup status test-raid10 |
If you would have latest lvm2 tools - you could have tried:
deactive if pool is active, before deactive pool, you must to deactive the volumes created from pool. if volume is created by lvm, just umount volume and lvchange, but if volume is created by devicemapper, need manual remove volume and record the deviceId deviceName and table to activate volume. Deactive, actually, is the process of remove the file link of /dev/vg/volume and* /dev/mapper/vg-volume.
1 | lvchange -an vg |
repair meta and active pool
1 | lvconvert --repair vg/pool |
active pool
1 | lvchange -ay vg |
Below are the steps which happen while running the consistency check:
but in my lvm (version 2.02.166(2)-RHEL7 (2016-11-16)), repair will not create new pmspare, it will direct use free vg to create new meta:
1 | [root@tosqatest4 ~]# lvs -a silver_vg |
So pmspare device just a free space device nor a mirror of metadata, if not, we can add pv to vg for repair.
With older tools - you need to go in these manual step:
create temporary small LV
1 | lvcreate -an -Zn -L10 --name temp vg |
replace pool’s metadata volume with this tempLV
1 | lvconvert --thinpool vg/pool --poolmetadata temp |
(say ‘y’ to swap)
activate & repair metadata from ‘temp’ volume - you will likely need another volume where to store repaire metadata -
so create:
1 | lvcreate -Lat_least_as_big_as_temp --name repaired vg |
if everything when fine - compare visualy ‘transaction_id’ of repaired metadata (thin_dump /dev/vg/repaired)
swap deactivated repaired volume back to your thin-pool
1 | lvchange -an vg/repaired |
try to activate pool - if it doesn’t work report more problems.
Metadata space exhaustion can lead to inconsistent thin pool metadata
and inconsistent file systems, so the response requires offline
checking and repair.
Deactivate the thin pool LV, or reboot the system if this is not
1 | possible. |
Repair thin pool with lvconvert –repair.
1 | See "Metadata check and repair". |
Extend pool metadata space with lvextend VG/ThinPoolLV_tmeta.
1 | See "Manually manage free metadata space of a thin pool LV". |
Check and repair file system with fsck.
KMP算法用于做字符串匹配, 是计算机中经常用到的算法. 大多数都文章比较难懂, 这里参考了阮一封的博客文章, 比较通俗易懂, 但是没有详细解释Next表怎么具体算出来, 只是给出了概念的解释. 这里给出更完整的算法实现和解释.
KMP的基本思想非常简单, 举例来说, 现在判断字符串”ABCDABD”是否在字符串”BBC ABCDAB ABCDABCDABDE”中.
首先,字符串”BBC ABCDAB ABCDABCDABDE”的第一个字符与搜索词”ABCDABD”的第一个字符,进行比较。因为B与A不匹配,所以搜索词后移一位。
因为B与A不匹配,搜索词再往后移。
就这样,直到字符串有一个字符,与搜索词的第一个字符相同为止。
接着比较字符串和搜索词的下一个字符,还是相同。
直到字符串有一个字符,与搜索词对应的字符不相同为止。
这时,最自然的反应是,将搜索词整个后移一位,再从头逐个比较。这样做虽然可行,但是效率很差,因为你要把”搜索位置”移到已经比较过的位置,重比一遍。
一个基本事实是,当空格与D不匹配时,你其实知道前面六个字符是”ABCDAB”。KMP算法的想法是,设法利用这个已知信息,不要把”搜索位置”移回已经比较过的位置,继续把它向后移,这样就提高了效率。
怎么做到这一点呢?可以针对搜索词,算出一张《部分匹配表》(Partial Match Table)。这张表是如何产生的,后面再介绍,这里只要会用就可以了。
已知空格与D不匹配时,前面六个字符”ABCDAB”是匹配的。查表可知,最后一个匹配字符B对应的”部分匹配值”为2,因此按照下面的公式算出向后移动的位数:
移动位数 = 已匹配的字符数 - 对应的部分匹配值
因为 6 - 2 等于4,所以将搜索词向后移动4位。
因为空格与C不匹配,搜索词还要继续往后移。这时,已匹配的字符数为2(”AB”),对应的”部分匹配值”为0。所以,移动位数 = 2 - 0,结果为 2,于是将搜索词向后移2位。
因为空格与A不匹配,继续后移一位。
逐位比较,直到发现C与D不匹配。于是,移动位数 = 6 - 2,继续将搜索词向后移动4位。
逐位比较,直到搜索词的最后一位,发现完全匹配,于是搜索完成。如果还要继续搜索(即找出全部匹配),移动位数 = 7 - 0,再将搜索词向后移动7位,这里就不再重复了。
下面介绍《部分匹配表》是如何产生的。
首先,要了解两个概念:”前缀”和”后缀”。 “前缀”指除了最后一个字符以外,一个字符串的全部头部组合;”后缀”指除了第一个字符以外,一个字符串的全部尾部组合。
“部分匹配值”就是”前缀”和”后缀”的最长的共有元素的长度。以”ABCDABD”为例,
- “A”的前缀和后缀都为空集,共有元素的长度为0;
- “AB”的前缀为[A],后缀为[B],共有元素的长度为0;
- “ABC”的前缀为[A, AB],后缀为[BC, C],共有元素的长度0;
- “ABCD”的前缀为[A, AB, ABC],后缀为[BCD, CD, D],共有元素的长度为0;
- “ABCDA”的前缀为[A, AB, ABC, ABCD],后缀为[BCDA, CDA, DA, A],共有元素为”A”,长度为1;
- “ABCDAB”的前缀为[A, AB, ABC, ABCD, ABCDA],后缀为[BCDAB, CDAB, DAB, AB, B],共有元素为”AB”,长度为2;
- “ABCDABD”的前缀为[A, AB, ABC, ABCD, ABCDA, ABCDAB],后缀为[BCDABD, CDABD, DABD, ABD, BD, D],共有元素的长度为0。
“部分匹配”的实质是,有时候,字符串头部和尾部会有重复。比如,”ABCDAB”之中有两个”AB”,那么它的”部分匹配值”就是2(”AB”的长度)。搜索词移动的时候,第一个”AB”向后移动4位(字符串长度-部分匹配值),就可以来到第二个”AB”的位置。
####算法实现
从上述思想来看, KMP的核心在于计算Next(部分匹配表)表, 其他的就是按照一个公式去移动而已. 我们来看看这个表用算法如何构造. 从15我们知道, Next就是”部分匹配值”就是”前缀”和”后缀”的最长的共有元素的长度.
我们先用最直观的做法来写这段代码, 然后再来优化
1 | func Next(str string) []int { |
这段代码将next数组计算出来, 符合正常的code思路, 但是复杂度是O(n^2), 显然不符合KMP的思想. 我们来做一些优化. 首先, 按照next的特性, 每次不需要回溯到0, 而是充分利用next特性回溯到已经匹配的字串, 如果不中再去找上一个, 算法见:
1 | func Next(str string) []int { |
改进后的算法时间复杂度为O(2n)
这种写法看起来不是很美观, 但是可读性更强, 有点写法会将初始值为-1, 然后将k := next[q - 1] 和 k = next[k]合在一起, 把 next[q] = k + 1和 next[q] = 0也合在一起, 这种”炫技” 行为我个人是非常不喜欢的, 看起来简洁但是将算法思想隐藏了, 可理解性很差. 同样我也不喜欢用任何lamda写法, 除非语言层面对此有性能优化, 否则只是用来恶心看代码的人.
按照这个思想写KMP的算法就非常简单了:
1 | func kmp(str1, str2 string) bool { |
KMP算法的复杂度为O(M+N), 其他写法会更简洁, 但是同样为了可读性, 情愿写的罗嗦一点.
1 | func main() { |
带上术例子进行计算, 总共循环次数为15次, 其中计算next数组总共只计算了7次.
记录一个kernel bug.
在组建中调用了lvconvert生成存储池, 但是测试告诉我过程失败并且机器重启了, 通过抓kdump生成以下日志:
1 | [ 983.179047] BUG kmalloc-256(4:099fef32f6df55082271a9ab572acec8251aa754a14a60444d7ce4cb110be56f) (Tainted: G B ------------ T): Objects remaining in kmalloc-256(4:099fef32f6df55082271a9ab572acec8251aa754a14a60444d7ce4cb110be56f |
看起来是lvconvert过程中缓存分配出现错误, 查到这一步基本无从下手了, 只能google搜索. 在docker项目中发现有相似的情况https://github.com/moby/moby/issues/29879, 出问题的三个网址都是3.10的kernel, 回到说是kernel问题, 升到4.16.7-1.el7.elrepo.x86_64没有问题.
make build过程中忽然出现错误:
1 | Sending build context to Docker daemon 220 MB |
尝试run 07c99697b16e报同样的错, 查看docker日志, 出现"oci runtime error: container_linux.go:247: starting container process caused \"process_linux .go:245: running exec setns process for init caused \\\"exit status 6\\\"\"\n"
1 | 8月 21 15:29:01 zhenghc-tos dockerd[2517]: time="2018-08-21T15:29:01.778212288+08:00" level=error msg="Handler for POST /v1.24/containers/79a0906242e0180dfd7aeff30e9fb179f7b7c37f0ba20533f83b4cd40b409d2a/start returned error: invali |
发现Docker比较卡, 于是清理了大量的images, 删掉了所有未运行的container, 再清理了系统缓存, 问题解决了, 但深层次的问题没弄清楚, 尽管清了缓存, 但是清完其实缓存差距不大, docker 也没有显示资源问题, 另外一个点就是, 只有这个image是进不去的, 其他的Images是可以run -it的, 而且在做COPY . /go/src/github.com/transwarp/warpdrive/这一步时间等的比较长.
https://github.com/opencontainers/runc/issues/1130 记录了相关的问题, 说runc的问题.
Go排序实现在/src/go4.org/sort/sort.go中.
1 | func Sort(data Interface) { |
quickSort和quickSort_func基本相同, 子不过自己定义的比较方法, 这里不在赘述.
可以看到主体是quickSort, 所以不保证稳定排序, 当数据量小于等于12时. 退化为希尔排序, 但希尔排序只做了一次gap为6, 之后再次退化为了简单插入排序. 这部分实现非常简单, i从前往后, j从i往前, 每次比较j-1和j, 将小的换到前面.
当数据量大于12时, 根据maxDepth是否为0决定使用对排序还是使用快速排序, maxDepth计算:
1 | // maxDepth returns a threshold at which quicksort should switch |
返回的是目前树高度的2倍. 每进行一轮快速排序的partition(doPivot), 这个数值就减1, 当减为0, 此时认为数据已经比较有序, 如果数据长度依然大于12, 改为堆排序效率更高.
这里先看快速排序的主体部分, 再来看堆排序的实现. partition是快排的思想核心, 他实现在doPivot中:
1 | func doPivot(data Interface, lo, hi int) (midlo, midhi int) { |