Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI Driver Can Leak Access Points #1467

Open
joelthompson opened this issue Oct 7, 2024 · 2 comments
Open

CSI Driver Can Leak Access Points #1467

joelthompson opened this issue Oct 7, 2024 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@joelthompson
Copy link

/kind bug

What happened?

We have the EFS CSI driver installed with dynamic provisioning and a reclaim policy of Delete.

A flood of requests came in, which caused the CSI controller to eventually fail its health check and get restarted. After restart, the controller kept trying to provision APs for new PVCs, but they kept failing with AccessPointAlreadyExists. The controller kept retrying and it kept failing. Eventually the PVCs were deleted, and the APs were leaked. Most likely, the APs were created but never recorded as being provisioned in K8s, thus causing the controller on restart to keep trying to recreate them.

What you expected to happen?

Upon restart, the controller should recognize that the Access Point was already created and "adopt" it. Alternatively, the controller should recognize that this isn't a retriable error and not retry. Finally, when the PVC is deleted, the AP should be deleted according to the reclaim policy. This shouldn't require enabling reuseAccessPoint.

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

This code here:

if err != nil {
if isAccessDenied(err) {
return nil, ErrAccessDenied
}
return nil, fmt.Errorf("Failed to create access point: %v", err)
}

doesn't check to see if the error code is AccessPointAlreadyExists in which case it should return ErrAlreadyExists and thus the code path in

if err == cloud.ErrAlreadyExists {
return nil, status.Errorf(codes.AlreadyExists, "Access Point already exists")
}
is never hit.

Environment

  • Kubernetes version (use kubectl version): 1.29
  • Driver version: 2.0.5

Please also attach debug logs to help us better diagnose

  • Instructions to gather debug logs can be found here
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 7, 2024
@jrakas-dev
Copy link
Contributor

Hi Joel, thanks for opening this issue. The team is looking into it. In the meantime, can you please provide any debug logs you might have so that we can better understand the issue? Thanks!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants