Upgrade from `1.7.7` to `2.1.0` uses `~6X` More Memory #1523

sherifabdlnaby · 2024-12-06T23:51:12Z

/kind bug

What happened?

After upgrading from v1.7.7 to 2.1.0 we noticed OOMs in Daemonset's efs-csi-node pods after upgrade.
Before the upgrade, we set a 150Mi memory requests/limits and didn't hit it. After the upgrade, we consistently hit the Memory Limit and didn't stop until we increased the requests/limit to 500Mi

Our load and distribution of pods with EFS Mounts to nodes didn't change. We use EFS Mounts with Encryption Enabled. No any configuration overrides, everything is using default configurations as installed by the chart.

Given that this is a daemonset pod, any increase in memory is multiplied by the node count, and in our case, this is a significant increase in memory requests for a daemonset pod.

What you expected to happen?

Average memory consumption to not triple after upgrade

How to reproduce it (as minimally and precisely as possible)?

In our case it's upgrading. The increase is consistent across 9 clusters my team's operate.

Anything else we need to know?:

Environment

Kubernetes version (use kubectl version): v1.29.10-eks-7f9249a
Driver version: v2.1.0

Below is the Average Memory Usage per Daemonset Pod across 9 different clusters we have and about (~600 pod)
The load and density of pods that write to EFS didn't change. The graph shows how upgrading to v2 with the new EFS Utils is using at least 600% more memory.

The text was updated successfully, but these errors were encountered:

tdachille-dev · 2024-12-17T20:19:09Z

Hi, this increased memory footprint is the result of replacing stunnel with an in-house AWS component, efs-proxy, in efs-csi-driver v2.0+. efs-proxy is designed to optimize throughput by utilizing more aggressive caching and at higher throughput levels, efs-proxy employs TCP multiplexing to achieve higher throughput, which can also increase the memory usage.

sherifabdlnaby · 2024-12-17T21:28:25Z

Excuse my ignorance of the internals of the system here, Is there is an option to toggle this multiplexing and caching On/Off so that we can still have the ability to run with minimal memory overhead ? Not every app have the need to achieve maximum throughput; In my case, our team wouldn't have made this tradeoff because all our apps are not pushing throughput limits.

Going from a 60Mi daemonset to 500Mi is significant jump especially for big clusters with thousands of nodes, and especially for general purpose clusters where not all pods (and often majority of pods) don't mount an EFS Volume yet you're paying a 500Mi tax for a daemonset that you -relatively- don't use that much.

sherifabdlnaby · 2024-12-17T21:29:50Z

Ofc you can do sophisticated scheduling and have more than one node-groups, and only deploy pods with EFS Mounts to a Node Group that runs the Daemonset... or have two separate daemonset with one that runs with more resources on a specific node-gorup... but all of these are solutions that introduce too much complexity.

tdachille-dev · 2024-12-18T14:38:18Z

understood! so efs-proxy unlocks higher (3x) throughput than with stunnel, but if throughput is not an issue for you and you'd like to avoid the additional memory overhead you may use

  mountOptions:
    - stunnel

in your PV (for static provisioning) or StorageClass (for dynamic provisioning) k8s definition file. hope that helps!

tdachille-dev · 2024-12-18T14:39:25Z

for explicit examples-

apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
  storageClassName: efs-sc
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-123
  mountOptions:
    - stunnel

or

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
parameters:
  provisioningMode: efs-ap
  fileSystemId: fs-123
  directoryPerms: "700"
  gidRangeStart: "1750" # optional
  gidRangeEnd: "1751" # optional
  basePath: "/dynamic_provisioning" # optional
  subPathPattern: "${.PVC.namespace}/${.PVC.name}" # optional
  ensureUniqueDirectory: "false" # optional
  reuseAccessPoint: "false" # optional
mountOptions:
  - stunnel

azagarelz · 2025-01-08T00:41:36Z

I have the same issue and I saw the efs CSI container growing consistently over 800 MB of RAM.
Is there a way to limit the memory reserved for caching or to configure the eviction strategy?
I don't think caching is useful at all for the apps i'm running so I would like to at least limit it

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 6, 2024

sherifabdlnaby changed the title ~~Upgrade from 1.7.7 to 2.1.0 use significantly more memory~~ Upgrade from 1.7.7 to 2.1.0 uses Significantly More Memory Dec 11, 2024

sherifabdlnaby changed the title ~~Upgrade from 1.7.7 to 2.1.0 uses Significantly More Memory~~ Upgrade from 1.7.7 to 2.1.0 uses ~6X More Memory Dec 11, 2024

sherifabdlnaby mentioned this issue Dec 11, 2024

Using efs-utils v2 in aws-efs-driver use to 6X more memory aws/efs-utils#261

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade from `1.7.7` to `2.1.0` uses `~6X` More Memory #1523

Upgrade from `1.7.7` to `2.1.0` uses `~6X` More Memory #1523

sherifabdlnaby commented Dec 6, 2024 •

edited

Loading

tdachille-dev commented Dec 17, 2024

sherifabdlnaby commented Dec 17, 2024

sherifabdlnaby commented Dec 17, 2024

tdachille-dev commented Dec 18, 2024

tdachille-dev commented Dec 18, 2024

azagarelz commented Jan 8, 2025

Upgrade from 1.7.7 to 2.1.0 uses ~6X More Memory #1523

Upgrade from 1.7.7 to 2.1.0 uses ~6X More Memory #1523

Comments

sherifabdlnaby commented Dec 6, 2024 • edited Loading

tdachille-dev commented Dec 17, 2024

sherifabdlnaby commented Dec 17, 2024

sherifabdlnaby commented Dec 17, 2024

tdachille-dev commented Dec 18, 2024

tdachille-dev commented Dec 18, 2024

azagarelz commented Jan 8, 2025

Upgrade from `1.7.7` to `2.1.0` uses `~6X` More Memory #1523

Upgrade from `1.7.7` to `2.1.0` uses `~6X` More Memory #1523

sherifabdlnaby commented Dec 6, 2024 •

edited

Loading