Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from 1.7.7 to 2.1.0 uses ~6X More Memory #1523

Open
sherifabdlnaby opened this issue Dec 6, 2024 · 6 comments
Open

Upgrade from 1.7.7 to 2.1.0 uses ~6X More Memory #1523

sherifabdlnaby opened this issue Dec 6, 2024 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@sherifabdlnaby
Copy link
Contributor

sherifabdlnaby commented Dec 6, 2024

/kind bug

What happened?

After upgrading from v1.7.7 to 2.1.0 we noticed OOMs in Daemonset's efs-csi-node pods after upgrade.
Before the upgrade, we set a 150Mi memory requests/limits and didn't hit it. After the upgrade, we consistently hit the Memory Limit and didn't stop until we increased the requests/limit to 500Mi

Our load and distribution of pods with EFS Mounts to nodes didn't change. We use EFS Mounts with Encryption Enabled. No any configuration overrides, everything is using default configurations as installed by the chart.

Given that this is a daemonset pod, any increase in memory is multiplied by the node count, and in our case, this is a significant increase in memory requests for a daemonset pod.

What you expected to happen?

Average memory consumption to not triple after upgrade

How to reproduce it (as minimally and precisely as possible)?

In our case it's upgrading. The increase is consistent across 9 clusters my team's operate.

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version): v1.29.10-eks-7f9249a
  • Driver version: v2.1.0

Below is the Average Memory Usage per Daemonset Pod across 9 different clusters we have and about (~600 pod)
The load and density of pods that write to EFS didn't change. The graph shows how upgrading to v2 with the new EFS Utils is using at least 600% more memory.

CleanShot - 2024 12 10 - Google Chrome - 001163

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 6, 2024
@sherifabdlnaby sherifabdlnaby changed the title Upgrade from 1.7.7 to 2.1.0 use significantly more memory Upgrade from 1.7.7 to 2.1.0 uses Significantly More Memory Dec 11, 2024
@sherifabdlnaby sherifabdlnaby changed the title Upgrade from 1.7.7 to 2.1.0 uses Significantly More Memory Upgrade from 1.7.7 to 2.1.0 uses ~6X More Memory Dec 11, 2024
@tdachille-dev
Copy link
Contributor

Hi, this increased memory footprint is the result of replacing stunnel with an in-house AWS component, efs-proxy, in efs-csi-driver v2.0+. efs-proxy is designed to optimize throughput by utilizing more aggressive caching and at higher throughput levels, efs-proxy employs TCP multiplexing to achieve higher throughput, which can also increase the memory usage.

@sherifabdlnaby
Copy link
Contributor Author

Excuse my ignorance of the internals of the system here, Is there is an option to toggle this multiplexing and caching On/Off so that we can still have the ability to run with minimal memory overhead ? Not every app have the need to achieve maximum throughput; In my case, our team wouldn't have made this tradeoff because all our apps are not pushing throughput limits.

Going from a 60Mi daemonset to 500Mi is significant jump especially for big clusters with thousands of nodes, and especially for general purpose clusters where not all pods (and often majority of pods) don't mount an EFS Volume yet you're paying a 500Mi tax for a daemonset that you -relatively- don't use that much.

@sherifabdlnaby
Copy link
Contributor Author

Ofc you can do sophisticated scheduling and have more than one node-groups, and only deploy pods with EFS Mounts to a Node Group that runs the Daemonset... or have two separate daemonset with one that runs with more resources on a specific node-gorup... but all of these are solutions that introduce too much complexity.

@tdachille-dev
Copy link
Contributor

understood! so efs-proxy unlocks higher (3x) throughput than with stunnel, but if throughput is not an issue for you and you'd like to avoid the additional memory overhead you may use

  mountOptions:
    - stunnel

in your PV (for static provisioning) or StorageClass (for dynamic provisioning) k8s definition file. hope that helps!

@tdachille-dev
Copy link
Contributor

for explicit examples-

apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  storageClassName: efs-sc
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-123
  mountOptions:
    - stunnel

or

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: fs-123
  directoryPerms: "700"
  gidRangeStart: "1750" # optional
  gidRangeEnd: "1751" # optional
  basePath: "/dynamic_provisioning" # optional
  subPathPattern: "${.PVC.namespace}/${.PVC.name}" # optional
  ensureUniqueDirectory: "false" # optional
  reuseAccessPoint: "false" # optional
mountOptions:
  - stunnel

@azagarelz
Copy link

I have the same issue and I saw the efs CSI container growing consistently over 800 MB of RAM.
Is there a way to limit the memory reserved for caching or to configure the eviction strategy?
I don't think caching is useful at all for the apps i'm running so I would like to at least limit it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants