CSI Driver Can Leak Access Points #1467
Labels
kind/bug
Categorizes issue or PR as related to a bug.
lifecycle/stale
Denotes an issue or PR has remained open with no activity and has become stale.
/kind bug
What happened?
We have the EFS CSI driver installed with dynamic provisioning and a reclaim policy of Delete.
A flood of requests came in, which caused the CSI controller to eventually fail its health check and get restarted. After restart, the controller kept trying to provision APs for new PVCs, but they kept failing with
AccessPointAlreadyExists
. The controller kept retrying and it kept failing. Eventually the PVCs were deleted, and the APs were leaked. Most likely, the APs were created but never recorded as being provisioned in K8s, thus causing the controller on restart to keep trying to recreate them.What you expected to happen?
Upon restart, the controller should recognize that the Access Point was already created and "adopt" it. Alternatively, the controller should recognize that this isn't a retriable error and not retry. Finally, when the PVC is deleted, the AP should be deleted according to the reclaim policy. This shouldn't require enabling
reuseAccessPoint
.How to reproduce it (as minimally and precisely as possible)?
Anything else we need to know?:
This code here:
aws-efs-csi-driver/pkg/cloud/cloud.go
Lines 190 to 195 in fe845cc
doesn't check to see if the error code is
AccessPointAlreadyExists
in which case it should returnErrAlreadyExists
and thus the code path inaws-efs-csi-driver/pkg/driver/controller.go
Lines 346 to 348 in fe845cc
Environment
kubectl version
): 1.29Please also attach debug logs to help us better diagnose
The text was updated successfully, but these errors were encountered: