Runners using local NVMe storage don't clean up after themselves — possible solutions? #1696

sarahkadar · 2022-08-04T10:16:09Z

sarahkadar
Aug 4, 2022

Controller Version

0.25.2

Helm Chart Version

0.20.2

CertManager Version

1.9.1

Deployment Method

Helm

cert-manager installation

It's all good

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
My actions-runner-controller version (v0.x.y) does support the feature
I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: fabricam-github-runners
  namespace: github-runners
spec:
  template:
    spec:
      organization: fabricam
      dockerRegistryMirror: https://mirror.gcr.io/
      labels:
        - x64
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: fabricam-github-runners-autoscaler
  namespace: github-runners
spec:
  scaleDownDelaySecondsAfterScaleOut: 300
  minReplicas: 1
  maxReplicas: 12
  scaleTargetRef:
    name: fabricam-github-runners
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'
    scaleDownThreshold: '0.25'
    scaleUpFactor: '3'
    scaleDownFactor: '0.5'
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: fabricam-github-runners-arm
  namespace: github-runners
spec:
  template:
    spec:
      labels:
      - arm64
      env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
      organization: fabricam
      dockerRegistryMirror: https://mirror.gcr.io/
      nodeSelector:
        kubernetes.io/arch: arm64
      tolerations:
      - key: "nvme-ssd-enabled"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      dockerVolumeMounts:
      - mountPath: /var/lib/docker
        name: docker
        subPathExpr: $(POD_NAME)-docker
      - mountPath: /runner/_work
        name: work
        subPathExpr: $(POD_NAME)-work
      volumeMounts:
      - mountPath: /runner/_work
        name: work
        subPathExpr: $(POD_NAME)-work
      # - mountPath: /tmp
      #   name: tmp
      #   subPathExpr: $(POD_NAME)-tmp
      dockerEnv:
      - name: POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
      volumes:
      - hostPath:
          path: /nvme/disk
        name: docker
      - hostPath:
          path: /nvme/disk
        name: work
      # - hostPath:
      #     path: /nvme/disk
      #   name: tmp
      ephemeral: true
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: fabricam-github-runners-autoscaler-arm
  namespace: github-runners
spec:
  scaleDownDelaySecondsAfterScaleOut: 120
  minReplicas: 1
  maxReplicas: 24
  scaleTargetRef:
    name: fabricam-github-runners-arm
  scaleUpTriggers:
  - githubEvent:
      workflowJob: {}
    duration: "7m"
  scheduledOverrides:
  - startTime: "2022-07-31T04:50:00Z"
    endTime: "2022-07-31T05:10:00Z"
    recurrenceRule:
      frequency: Daily
    minReplicas: 0
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: fabricam-github-runners-x64-fast
  namespace: github-runners
spec:
  template:
    spec:
      labels:
        - x64-fast
      env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
      organization: fabricam
      dockerRegistryMirror: https://mirror.gcr.io/
      nodeSelector:
        github-actions-runners-x64-fast: "true"
      tolerations:
      - key: "x64-fast"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      dockerVolumeMounts:
      - mountPath: /var/lib/docker
        name: docker
        subPathExpr: $(POD_NAME)-docker
      - mountPath: /runner/_work
        name: work
        subPathExpr: $(POD_NAME)-work
      volumeMounts:
      - mountPath: /runner/_work
        name: work
        subPathExpr: $(POD_NAME)-work
      # - mountPath: /tmp
      #   name: tmp
      #   subPathExpr: $(POD_NAME)-tmp
      dockerEnv:
      - name: POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
      volumes:
      - hostPath:
          path: /nvme/disk
        name: docker
      - hostPath:
          path: /nvme/disk
        name: work
      # - hostPath:
      #     path: /nvme/disk
      #   name: tmp
      ephemeral: true
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: fabricam-github-runners-autoscaler-x64-fast
  namespace: github-runners
spec:
  scaleDownDelaySecondsAfterScaleOut: 120
  minReplicas: 1
  maxReplicas: 24
  scaleTargetRef:
    name: fabricam-github-runners-x64-fast
  scaleUpTriggers:
  - githubEvent:
      workflowJob: {}
    duration: "7m"
  scheduledOverrides:
  - startTime: "2022-07-31T04:50:00Z"
    endTime: "2022-07-31T05:10:00Z"
    recurrenceRule:
      frequency: Daily
    minReplicas: 0

To Reproduce

Set up local NVMe storage-based machines for the runners, watch them slowly run out of space, bang head against wall.

Describe the bug

If you set up runners on machines with local NVMe-backed storage (which is great for performance), those machines will eventually run out of free space because the runners don't clean up after themselves, even if you set ephemeral: true.

This is documented, albeit the wording is somewhat confusing.

Is there a way to improve this? We run this configuration on EKS and right now, we solved this problem mostly by:

Creating a custom AMI with Packer that has a cronjob that runs a script every day at 5am that deletes any folder over a day old

#!/bin/sh
find /nvme/disk/* -type d -mmin +1440 -exec rm -rf {} \; 2> /dev/null

set minReplicas: 0 via scheduledOverrides between 4:50am and 5:10am, so it doesn't accidentally delete things with a runner running. Also nobody runs CI jobs at 5am, at least for now.

This is mostly fine, but it'd be great if we wouldn't need this.

Describe the expected behavior

We don't need to do anything special to have runners clean up after themselves when using local NVMe-backed storage.

Controller Logs

https://gist.github.com/KTamas/aaeea1de4b5ea9f18fa0b78f223d6cc9

Runner Pod Logs

https://gist.github.com/KTamas/1c79ed632095ad978728a6816abe6924

Additional Context

No response

Answered by mumoshu

Aug 5, 2022

@KTamas Hey!

As it's just a volume and volumeMount pair, I think it's more than natural for K8s and ARC to NOT clean up the volume in this case. What the config says is that "use this host path as a volume" and it doesn't mean "clean this up after pod termination" in K8s, right?

Probably you would better try K8s dynamic volume provisioning using our RunnerSet API.

https://github.com/actions-runner-controller/actions-runner-controller#pv-backed-runner-work-directory
https://github.com/actions-runner-controller/actions-runner-controller#docker-image-layers-caching

I thought there were a few CSI plugins and PV/PVC configs to allow you to let K8s dynamically provision PVs on your NVME device …

View full answer

sarahkadar · 2022-08-04T10:20:04Z

sarahkadar
Aug 4, 2022
Author

We deploy with helmfile, so here is the config for that as well, just in case:

  - name: actions-runner-controller
    version: 0.20.2
    namespace: actions-runner-system
    chart: actions-runner-controller/actions-runner-controller
    createNamespace: true
    installed: {{ .Values.action_runners_controller.installed | default "false" }}
    set:
      - name: image.actionsRunnerRepositoryAndTag
        value: "redacted.dkr.ecr.us-east-1.amazonaws.com/github-runner-image:latest"
      - name: githubWebhookServer.enabled
        value: true
      - name: githubWebhookServer.service.type
        value: LoadBalancer
      - name: authSecret.create
        value: true
      - name: authSecret.github_app_id
        value: "redacted"
      - name: authSecret.github_app_installation_id
        value: "redacted"
      - name: authSecret.github_app_private_key
        value: |
          -----BEGIN RSA PRIVATE KEY-----
          lmao
          -----END RSA PRIVATE KEY-----

Anyways, huge fan of this project!

0 replies

mumoshu · 2022-08-05T01:26:46Z

mumoshu
Aug 5, 2022
Maintainer

@KTamas Hey!

As it's just a volume and volumeMount pair, I think it's more than natural for K8s and ARC to NOT clean up the volume in this case. What the config says is that "use this host path as a volume" and it doesn't mean "clean this up after pod termination" in K8s, right?

Probably you would better try K8s dynamic volume provisioning using our RunnerSet API.

https://github.com/actions-runner-controller/actions-runner-controller#pv-backed-runner-work-directory
https://github.com/actions-runner-controller/actions-runner-controller#docker-image-layers-caching

I thought there were a few CSI plugins and PV/PVC configs to allow you to let K8s dynamically provision PVs on your NVME device like explained in https://blog.mayadata.io/why-use-localpv-with-nvme-for-your-workload.

Also note that the vanilla K8s StatefulSet does not support automated clean-ups of freed PVs yet, but our RunnerSet does support it.

1 reply

sarahkadar Aug 5, 2022
Author

Thanks for the answer, I'll look into these.

mumoshu · 2022-08-05T01:32:53Z

mumoshu
Aug 5, 2022
Maintainer

Also see #1605. It's almost about how to correctly use the vanilla K8s feature. My take is this isn't a bug.

0 replies

mumoshu · 2022-08-05T01:35:54Z

mumoshu
Aug 5, 2022
Maintainer

Please let me move this to Discussions as that's the place for Q&A (It's unfortunate that this project is not funded enough to let me allow answering every question, but that does not justify using Issues for Q&A!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runners using local NVMe storage don't clean up after themselves — possible solutions? #1696

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Runners using local NVMe storage don't clean up after themselves — possible solutions? #1696

sarahkadar Aug 4, 2022

Controller Version

Helm Chart Version

CertManager Version

Deployment Method

cert-manager installation

Checks

Resource Definitions

To Reproduce

Describe the bug

Describe the expected behavior

Controller Logs

Runner Pod Logs

Additional Context

Replies: 4 comments · 1 reply

sarahkadar Aug 4, 2022 Author

mumoshu Aug 5, 2022 Maintainer

sarahkadar Aug 5, 2022 Author

mumoshu Aug 5, 2022 Maintainer

mumoshu Aug 5, 2022 Maintainer

sarahkadar
Aug 4, 2022

Replies: 4 comments 1 reply

sarahkadar
Aug 4, 2022
Author

mumoshu
Aug 5, 2022
Maintainer

sarahkadar Aug 5, 2022
Author

mumoshu
Aug 5, 2022
Maintainer

mumoshu
Aug 5, 2022
Maintainer