VM volume attacher enhancement

Volume attach failed

  1. Duplicated error
2021-09-28T16:00:03.816682107-07:00 stderr F I0928 23:00:03.816558 1 cephvolume.go:73] attach disk &{XMLName:{Space: Local:} Device:disk RawIO: SGIO: Snapshot: Model: Driver:0xc00073b420 Auth:0xc000f49e40 Source:0xc000159860 BackingStore:<nil> Geometry:<nil> BlockIO:<nil> Mirror:<nil> Target:0xc000a58b40 IOTune:0xc00019b970 ReadOnly:<nil> Shareable:<nil> Transient:<nil> Serial:pvc-33003998-6624-4ac9-a923-d94f9401abdf WWN: Vendor: Product: Encryption:<nil> Boot:<nil> Alias:<nil> Address:0xc001420b40} error: virError(Code=27, Domain=20, Message=’XML error: target ‘vdf’ duplicated for disk sources ‘volume-0aab375c-1858-4f09-b276-ea297cd29a3d’ and ‘volume-63ef92c4-a027-476c-a2de-9fcf501dd4de’’) <disk type=’network’ device=’disk’> <driver name=’qemu’ type=’raw’ cache=’none’ io=’native’/> <auth username=’cinder’> <secret type=’ceph’ uuid=’1bcf1a49-b42f-4bc2-6e70-9ea7a4006740’/> </auth> <source protocol=’rbd’ name=’volumes/volume-0aab375c-1858-4f09-b276-ea297cd29a3d’ > <host name=’10.166.77.141’ port=’6789’/> <host name=’10.78.225.48’ port=’6789’/> <host name=’10.33.212.26’ port=’6789’/> <host name=’10.212.255.52’ port=’6789’/> <host name=’10.164.134.166’ port=’6789’/> </source> <target dev=’vdf’ bus=’virtio’/> <iotune> <total_bytes_sec>157286400</total_bytes_sec> <total_iops_sec>300</total_iops_sec> </iotune> <serial>pvc-22650282-34fe-412c-a5a3-1df0bdb3cadd</serial> <alias name=’virtio-disk5’/> <address type=’pci’ domain=’0x0000’ bus=’0x00’ slot=’0x09’ function=’0x0’/> </disk> ➜ ~ k. 130 tSwitched to context “130”. ➜ ~ tk get pv pvc-22650282-34fe-412c-a5a3-1df0bdb3cadd Error from server (NotFound): persistentvolumes “pvc-22650282-34fe-412c-a5a3-1df0bdb3cadd” not found

Dump exists vdf device but persistent xml not found.

  1. If backend volume not exist, report this
I0721 03:39:04.965504 1 cephvolume.go:73] attach disk &{XMLName:{Space: Local:} Device:disk RawIO: SGIO: Snapshot: Model: Driver:0xc0015061c0 Auth:0xc00294c880 Source:0xc0004db310 BackingStore:<nil> Geometry:<nil> BlockIO:<nil> Mirror:<nil> Target:0xc00685fd00 IOTune:0xc0014fc8f0 ReadOnly:<nil> Shareable:<nil> Transient:<nil> Serial:pvc-20e1dd78-f543-40ae-bdb0-3eb74f0ffb1c WWN: Vendor: Product: Encryption:<nil> Boot:<nil> Alias:<nil> Address:0xc001508cc0} error: virError(Code=1, Domain=10, Message=’internal error: unable to execute QEMU command ‘device_add’: Property ‘virtio-blk-device.drive’ can’t find value ‘drive-virtio-disk15’’)

Try to use rbd info to find the volume, if not found, report the error.

  1. List-Watch item lost and restart container recovered:
2023-03-28T16:46:39.34066771-07:00 stderr F I0328 23:46:39.340535 1 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF 2023-03-28T16:46:39.340722015-07:00 stderr F I0328 23:46:39.340600 1 reflector.go:371] tess.io/ebay/vm-volume/pkg/controller/attach/attach_controller.go:285: Watch close - *v1.VmVolumeAttachment total 2 items received 2023-03-28T16:47:39.094070215-07:00 stderr F I0328 23:47:39.093952 1 reflector.go:371] tess.io/ebay/vm-volume/pkg/controller/attach/attach_controller.go:286: Watch close - *v1.Node total 1007 items received 2023-03-28T16:49:39.350681116-07:00 stderr F I0328 23:49:39.350572 1 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF 2023-03-28T16:49:39.35071423-07:00 stderr F I0328 23:49:39.350619 1 reflector.go:371] tess.io/ebay/vm-volume/pkg/controller/attach/attach_controller.go:285: Watch close - *v1.VmVolumeAttachment total 0 items received 2023-03-28T16:51:29.857528981-07:00 stderr F I0328 23:51:29.857397 1 attach_controller.go:453] VmVolumeAttachment pvc-891a1e4f-fe72-447d-ac29-efea9675bc51.tess-node-7fm8k is already attached phase 2023-03-28T16:51:29.857578201-07:00 stderr F I0328 23:51:29.857451 1 attach_controller.go:252] tess140/pvc-891a1e4f-fe72-447d-ac29-efea9675bc51.tess-node-7fm8k added to queue 2023-03-28T16:51:32.341806809-07:00 stderr F I0328 23:51:32.341711 1 attach_controller.go:453] VmVolumeAttachment pvc-891a1e4f-fe72-447d-ac29-efea9675bc51.tess-node-7fm8k is already attached phase 2023-03-28T16:51:32.341832149-07:00 stderr F I0328 23:51:32.341743 1 attach_controller.go:252] tess140/pvc-891a1e4f-fe72-447d-ac29-efea9675bc51.tess-node-7fm8k added to queue 2023-03-28T16:51:32.341835164-07:00 stderr F E0328 23:51:32.341758 1 attach_controller.go:323] vmvolumeattachment “tess140/pvc-891a1e4f-fe72-447d-ac29-efea9675bc51.tess-node-7fm8k” in work queue no longer exists 2023-03-28T16:54:35.827164294-07:00 stderr F I0328 23:54:35.827044 1 streamwatcher.go:103] Unexpected EOF during watch stream event decoding: unexpected EOF 2023-03-28T16:54:35.827197159-07:00 stderr F I0328 23:54:35.827105 1 reflector.go:371] tess.io/ebay/vm-volume/pkg/controller/attach/attach_controller.go:285: Watch close - *v1.VmVolumeAttachment total 3 items received 2023-03-28T16:54:37.095045006-07:00 stderr F I0328 23:54:37.094917 1 reflector.go:371] tess.io/ebay/vm-volume/pkg/controller/attach/attach_controller.go:286: Watch close - *v1.Node total 1056 items received 2023-03-28T16:54:40.55082912-07:00 stderr F I0328 23:54:40.550708 1 attach_controller.go:453] VmVolumeAttachment pvc-d0c67ff7-aeb0-4e13-b373-507b335a67fb.tess-node-2nb47 is already attached phase

All attacher happened EOF at the same time.

  1. Double use pci address
​​I0330 00:28:55.070331 1 cephvolume.go:73] attach disk &{XMLName:{Space: Local:} Device:disk RawIO: SGIO: Snapshot: Model: Driver:0xc000024380 Auth:0xc006a51600 Source:0xc000640cd0 BackingStore:<nil> Geometry:<nil> BlockIO:<nil> Mirror:<nil> Target:0xc00342db80 IOTune:0xc005071c30 ReadOnly:<nil> Shareable:<nil> Transient:<nil> Serial:pvc-39c80157-0862-433c-a1ec-49475db818cf WWN: Vendor: Product: Encryption:<nil> Boot:<nil> Alias:<nil> Address:0xc000f5e240} error: virError(Code=27, Domain=20, Message=’XML error: Attempted double use of PCI Address 0000:00:0a.0’)

Auto recovered first time : https://tessio.slack.com/archives/C03JP804D/p1670377925841209
Dumpxml not found the 0x0a address but it is in persistent file and used by a pv. The pv detached failed because not found in dump like detach failed case 1.
The address can transfer to pci address:
<address type=’pci’ domain=’0x0000’ bus=’0x00’ slot=’0x0a’ function=’0x0’/>
So it is a slot conflict issue. Need skip the slot.

Volume detach failed

  1. Volume exist in persistent file but not in the dump
I0330 00:38:54.615254 1 cephvolume.go:227] detach disk &{XMLName:{Space: Local:disk} Device:disk RawIO: SGIO: Snapshot: Model: Driver:0xc000736540 Auth:0xc0027b68a0 Source:0xc000641040 BackingStore:<nil> Geometry:<nil> BlockIO:<nil> Mirror:<nil> Target:0xc004557640 IOTune:0xc0006e4370 ReadOnly:<nil> Shareable:<nil> Transient:<nil> Serial:pvc-febae406-15ad-4d05-9c93-b2d09c197840 WWN: Vendor: Product: Encryption:<nil> Boot:<nil> Alias:<nil> Address:0xc00709cba0} error: virError(Code=9, Domain=10, Message=’operation failed: disk vde not found’)

This case can not directly return nil because the device still exists in the VM. If returned, success might produce complex conditions. Attacher will retry to check and auto return success if actually detach.

  1. Volume exist in dump and host but vm not running
I0417 07:57:33.764883 1 iscsivolume.go:176] detach disk &{XMLName:{Space: Local:disk} Device:disk RawIO: SGIO: Snapshot: Model: Driver:0xc004d3b0a0 Auth:<nil> Source:0xc000a66b90 BackingStore:<nil> Geometry:<nil> BlockIO:<nil> Mirror:<nil> Target:0xc009aefdc0 IOTune:<nil> ReadOnly:<nil> Shareable:<nil> Transient:<nil> Serial:pvc-9ad2511a-d8d9-4b14-b285-7acb4ef33800 WWN: Vendor: Product: Encryption:<nil> Boot:<nil> Alias:<nil> Address:0xc0048ce780} error: virError(Code=55, Domain=20, Message=’Requested operation is not valid: domain is not running’) root@tess-node-bf7rf-tess93:/# virsh list --all Id Name State -——————————————————- - virtlet-19c7472c-40a1-tess-node-69wjv shut off

Cgroup V2 offers a unified hierarchy, better IO QoS — including buffer IO throttling — and cleaner semantics compared to V1. This post documents the end-to-end process of migrating a production Kubernetes cluster to cgroup v2: component version requirements, kernel boot parameters, and compatibility verification results for CPU, memory, PID, hugetlb, and IO controllers.

Read more »

Cgroup V2 相较于 V1 提供了统一层级、更完善的 IO QoS 支持,尤其是对 Buffer IO 的限速能力,是 Kubernetes 集群提升资源利用率的重要基础。本文记录在生产 Kubernetes 集群上迁移到 Cgroup V2 的完整过程:依赖版本要求、启用步骤,以及 CPU、内存、PID、IO 等各资源控制器的兼容性验证结果。

Read more »

节点重启或 OS patching 时,本地磁盘可能短暂丢失,导致依赖 Local PV 的 Pod 启动失败——kubelet 找不到设备路径,mount 操作报错。本文记录一种通过 loop device 创建”假设备”(fake device)来解除 Pod 启动阻塞的工程方案,以及各类节点修复场景下的处理策略。

Read more »
0%