Runners using local NVMe storage don't clean up after themselves — possible solutions? #1696
-
Controller Version0.25.2 Helm Chart Version0.20.2 CertManager Version1.9.1 Deployment MethodHelm cert-manager installationIt's all good Checks
Resource DefinitionsapiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: fabricam-github-runners
namespace: github-runners
spec:
template:
spec:
organization: fabricam
dockerRegistryMirror: https://mirror.gcr.io/
labels:
- x64
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: fabricam-github-runners-autoscaler
namespace: github-runners
spec:
scaleDownDelaySecondsAfterScaleOut: 300
minReplicas: 1
maxReplicas: 12
scaleTargetRef:
name: fabricam-github-runners
metrics:
- type: PercentageRunnersBusy
scaleUpThreshold: '0.75'
scaleDownThreshold: '0.25'
scaleUpFactor: '3'
scaleDownFactor: '0.5'
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: fabricam-github-runners-arm
namespace: github-runners
spec:
template:
spec:
labels:
- arm64
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
organization: fabricam
dockerRegistryMirror: https://mirror.gcr.io/
nodeSelector:
kubernetes.io/arch: arm64
tolerations:
- key: "nvme-ssd-enabled"
operator: "Equal"
value: "true"
effect: "NoSchedule"
dockerVolumeMounts:
- mountPath: /var/lib/docker
name: docker
subPathExpr: $(POD_NAME)-docker
- mountPath: /runner/_work
name: work
subPathExpr: $(POD_NAME)-work
volumeMounts:
- mountPath: /runner/_work
name: work
subPathExpr: $(POD_NAME)-work
# - mountPath: /tmp
# name: tmp
# subPathExpr: $(POD_NAME)-tmp
dockerEnv:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumes:
- hostPath:
path: /nvme/disk
name: docker
- hostPath:
path: /nvme/disk
name: work
# - hostPath:
# path: /nvme/disk
# name: tmp
ephemeral: true
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: fabricam-github-runners-autoscaler-arm
namespace: github-runners
spec:
scaleDownDelaySecondsAfterScaleOut: 120
minReplicas: 1
maxReplicas: 24
scaleTargetRef:
name: fabricam-github-runners-arm
scaleUpTriggers:
- githubEvent:
workflowJob: {}
duration: "7m"
scheduledOverrides:
- startTime: "2022-07-31T04:50:00Z"
endTime: "2022-07-31T05:10:00Z"
recurrenceRule:
frequency: Daily
minReplicas: 0
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: fabricam-github-runners-x64-fast
namespace: github-runners
spec:
template:
spec:
labels:
- x64-fast
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
organization: fabricam
dockerRegistryMirror: https://mirror.gcr.io/
nodeSelector:
github-actions-runners-x64-fast: "true"
tolerations:
- key: "x64-fast"
operator: "Equal"
value: "true"
effect: "NoSchedule"
dockerVolumeMounts:
- mountPath: /var/lib/docker
name: docker
subPathExpr: $(POD_NAME)-docker
- mountPath: /runner/_work
name: work
subPathExpr: $(POD_NAME)-work
volumeMounts:
- mountPath: /runner/_work
name: work
subPathExpr: $(POD_NAME)-work
# - mountPath: /tmp
# name: tmp
# subPathExpr: $(POD_NAME)-tmp
dockerEnv:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumes:
- hostPath:
path: /nvme/disk
name: docker
- hostPath:
path: /nvme/disk
name: work
# - hostPath:
# path: /nvme/disk
# name: tmp
ephemeral: true
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: fabricam-github-runners-autoscaler-x64-fast
namespace: github-runners
spec:
scaleDownDelaySecondsAfterScaleOut: 120
minReplicas: 1
maxReplicas: 24
scaleTargetRef:
name: fabricam-github-runners-x64-fast
scaleUpTriggers:
- githubEvent:
workflowJob: {}
duration: "7m"
scheduledOverrides:
- startTime: "2022-07-31T04:50:00Z"
endTime: "2022-07-31T05:10:00Z"
recurrenceRule:
frequency: Daily
minReplicas: 0 To ReproduceSet up local NVMe storage-based machines for the runners, watch them slowly run out of space, bang head against wall. Describe the bugIf you set up runners on machines with local NVMe-backed storage (which is great for performance), those machines will eventually run out of free space because the runners don't clean up after themselves, even if you set This is documented, albeit the wording is somewhat confusing. Is there a way to improve this? We run this configuration on EKS and right now, we solved this problem mostly by:
#!/bin/sh
find /nvme/disk/* -type d -mmin +1440 -exec rm -rf {} \; 2> /dev/null
This is mostly fine, but it'd be great if we wouldn't need this. Describe the expected behaviorWe don't need to do anything special to have runners clean up after themselves when using local NVMe-backed storage. Controller Logshttps://gist.github.com/KTamas/aaeea1de4b5ea9f18fa0b78f223d6cc9 Runner Pod Logshttps://gist.github.com/KTamas/1c79ed632095ad978728a6816abe6924 Additional ContextNo response |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
We deploy with - name: actions-runner-controller
version: 0.20.2
namespace: actions-runner-system
chart: actions-runner-controller/actions-runner-controller
createNamespace: true
installed: {{ .Values.action_runners_controller.installed | default "false" }}
set:
- name: image.actionsRunnerRepositoryAndTag
value: "redacted.dkr.ecr.us-east-1.amazonaws.com/github-runner-image:latest"
- name: githubWebhookServer.enabled
value: true
- name: githubWebhookServer.service.type
value: LoadBalancer
- name: authSecret.create
value: true
- name: authSecret.github_app_id
value: "redacted"
- name: authSecret.github_app_installation_id
value: "redacted"
- name: authSecret.github_app_private_key
value: |
-----BEGIN RSA PRIVATE KEY-----
lmao
-----END RSA PRIVATE KEY-----
Anyways, huge fan of this project! |
Beta Was this translation helpful? Give feedback.
-
@KTamas Hey! As it's just a volume and volumeMount pair, I think it's more than natural for K8s and ARC to NOT clean up the volume in this case. What the config says is that "use this host path as a volume" and it doesn't mean "clean this up after pod termination" in K8s, right? Probably you would better try K8s dynamic volume provisioning using our RunnerSet API. https://github.com/actions-runner-controller/actions-runner-controller#pv-backed-runner-work-directory I thought there were a few CSI plugins and PV/PVC configs to allow you to let K8s dynamically provision PVs on your NVME device like explained in https://blog.mayadata.io/why-use-localpv-with-nvme-for-your-workload. Also note that the vanilla K8s StatefulSet does not support automated clean-ups of freed PVs yet, but our RunnerSet does support it. |
Beta Was this translation helpful? Give feedback.
-
Also see #1605. It's almost about how to correctly use the vanilla K8s feature. My take is this isn't a bug. |
Beta Was this translation helpful? Give feedback.
-
Please let me move this to Discussions as that's the place for Q&A (It's unfortunate that this project is not funded enough to let me allow answering every question, but that does not justify using Issues for Q&A! |
Beta Was this translation helpful? Give feedback.
@KTamas Hey!
As it's just a volume and volumeMount pair, I think it's more than natural for K8s and ARC to NOT clean up the volume in this case. What the config says is that "use this host path as a volume" and it doesn't mean "clean this up after pod termination" in K8s, right?
Probably you would better try K8s dynamic volume provisioning using our RunnerSet API.
https://github.com/actions-runner-controller/actions-runner-controller#pv-backed-runner-work-directory
https://github.com/actions-runner-controller/actions-runner-controller#docker-image-layers-caching
I thought there were a few CSI plugins and PV/PVC configs to allow you to let K8s dynamically provision PVs on your NVME device …