CSI Inline Volume Becomes Orphan After Kubelet Restart During Pod Termination

When a pod is terminating and the kubelet restarts or shuts down at the same time, a CSI inline (ephemeral) volume can be left as an orphan — kubelet skips both the unmount and the cleanup of the volume after it restarts. This post walks through the root cause and the fix.

Problem

When a pod is in Terminating state and the kubelet is restarted, the following errors appear in the kubelet log:

1
2
3
4
5
6
7
8
9
10
11
12
kubelet: I reconciler.go:388] "Could not construct volume information, cleaning up mounts"
podName=1517b38e-fa84-4138-b6c0-06663741e385
volumeSpecName="data"
err="failed to GetVolumeName from volumePlugin for volumeSpec \"data\"
err=kubernetes.io/csi: plugin.GetVolumeName failed to extract volume source
from spec: unexpected api.CSIVolumeSource found in volume.Spec"

kubelet: E operation_generator.go:952] UnmountVolume.MarkVolumeMountAsUncertain failed
for volume "" (UniqueName: "data")
pod "1517b38e-fa84-4138-b6c0-06663741e385"
Error: UnmountVolume.TearDown failed: rpc error: code = Aborted
desc = NodeUnpublish operation for volume csi-c6b0a910... still ongoing

After kubelet restarts, it receives the SyncLoop DELETE event for the pod, but the underlying LVM logical volume is never removed — it becomes an orphan resource on the node.


Root Cause Analysis

1. reconstructVolume takes the wrong plugin lookup path

After kubelet restarts, it calls reconstructVolume to rebuild the in-memory volume state from existing mount paths on disk. The flow is:

1
2
3
4
5
6
7
kubelet restart
→ reconstructVolume
→ FindAttachablePluginByName ← wrong lookup here
→ FindDeviceMountablePluginByName
→ GetUniqueVolumeNameFromSpec ← only handles CSIPersistentVolumeSource
→ getPVSourceFromSpec
→ ERROR: "unexpected api.CSIVolumeSource found in volume.Spec"

A CSI inline (ephemeral) volume uses api.CSIVolumeSource, while a CSI PV uses api.CSIPersistentVolumeSource. The getPVSourceFromSpec function only handles the latter and errors out on the former.

2. Inline volumes should go through GetUniqueVolumeNameFromSpecWithPod

For ephemeral volumes, the unique name must incorporate the pod UID and should call GetUniqueVolumeNameFromSpecWithPod. The branching logic is:

1
2
3
4
5
if attachablePlugin != nil || deviceMountablePlugin != nil {
uniqueVolumeName, err = util.GetUniqueVolumeNameFromSpec(plugin, volumeSpec)
} else {
uniqueVolumeName = util.GetUniqueVolumeNameFromSpecWithPod(volume.podName, plugin, volumeSpec)
}

A CSI inline volume should not have attachablePlugin or deviceMountablePlugin. However, the code uses FindAttachablePluginByName (lookup by plugin name) instead of FindDeviceMountablePluginBySpec (lookup by spec). This causes the inline volume to incorrectly enter the if branch and call GetUniqueVolumeNameFromSpec, which fails.

3. Consequence of reconstructVolume failure

1
2
3
4
5
6
7
8
9
10
reconstructedVolume, err := rc.reconstructVolume(volume)
if err != nil {
if volumeInDSW {
// Some pod still needs the volume, skip cleanup
continue
}
// No pod needs the volume, attempt cleanup
rc.cleanupMounts(volume)
continue
}

Because the pod is Terminating, it is not added to the Desired State of World (DSW), so volumeInDSW is false. Kubelet calls cleanupMountsUnmountVolume. But at this point the container has not yet fully exited, so the CSI driver returns Aborted (NodeUnpublish is still in progress). This is the only cleanup opportunity — after the failure, kubelet does not retry, and the volume is left as an orphan.

Trigger conditions

All three must be true simultaneously:

  1. Pod is in Terminating state
  2. Kubelet restarts or crashes during pod termination
  3. reconstructVolume fails, and the pod is not in DSW

CSI Driver Side

While kubelet attempts cleanup, the local CSI driver also tries to execute NodeUnpublishVolume but fails because the LV is still in use by the container:

1
2
3
4
5
6
7
CSI local driver: NodeUnpublishVolume
volume_id:"csi-17ef0134..." target_path:"...mount"
CSI local driver: Removing volume with id=vg-data_csi-17ef0134...
mke2fs: /dev/vg-data/vg-data_csi-17ef0134... is apparently in use by the system
wipefs: error: probing initialization failed: Device or resource busy
lvremove: Logical volume vg-data/vg-data_csi-17ef0134... contains a filesystem in use.
NodeUnpublishVolume failed: Failed to lvremove: filesystem in use

The LV cannot be removed because the container is still using the filesystem. NodeUnpublish returns an Internal error.


Fix

Upstream fix

This bug was fixed in Kubernetes 1.25: kubernetes/kubernetes#108997

The core change: in reconstructVolume, the plugin lookup for CSI inline volumes is changed to use FindDeviceMountablePluginBySpec instead of FindAttachablePluginByName. This correctly skips the attachablePlugin/deviceMountablePlugin check for inline volumes, routes them to GetUniqueVolumeNameFromSpecWithPod, and allows the volume to be properly rebuilt in ActualStateOfWorld so the normal unmount flow can proceed.

Workaround for older versions

Before the fix is rolled out, orphan LVs must be cleaned up manually:

1
2
3
4
5
6
7
8
9
# 1. List all LVs to find orphans (those with no active mount)
lvs vg-data

# 2. Check if the LV is still mounted
lsblk /dev/vg-data/<lv-name>

# 3. Force unmount and remove
umount -f /dev/vg-data/<lv-name>
lvremove -f vg-data/<lv-name>

Summary

Stage Bug Impact
Plugin lookup FindAttachablePluginByName incorrectly matches inline volumes reconstructVolume errors out
Path decision Inline volume incorrectly enters GetUniqueVolumeNameFromSpec Volume cannot be added to ASW
Cleanup window cleanupMounts is called while container is still running Guaranteed to fail

The root cause is that CSI inline volumes share the same reconstructVolume code path as CSI PVs but require different plugin lookup semantics. Kubernetes 1.25 fixes this by differentiating the plugin lookup method based on the volume spec type.