Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu number无法使用 #31

Open
Trainbow opened this issue Jan 18, 2023 · 11 comments
Open

gpu number无法使用 #31

Trainbow opened this issue Jan 18, 2023 · 11 comments

Comments

@Trainbow
Copy link

No description provided.

@Trainbow Trainbow changed the title gpu gpu number无法使用 Jan 18, 2023
@Trainbow
Copy link
Author

你好,我在尝试volcano gpu number的服务调度,在根据volcano的教程步骤安装之后,每一个带gpu的node都能够正确的显示有多少块gpu,但是在创建pod的时候,container的容器中没有volcano-gpu-number这一个环境变量,在里面输入nvidia-smi能够看到该节点所有的gpu,想问一下是否需要更改yaml文件?

@Thor-wl
Copy link
Contributor

Thor-wl commented Jan 19, 2023

你好,我在尝试volcano gpu number的服务调度,在根据volcano的教程步骤安装之后,每一个带gpu的node都能够正确的显示有多少块gpu,但是在创建pod的时候,container的容器中没有volcano-gpu-number这一个环境变量,在里面输入nvidia-smi能够看到该节点所有的gpu,想问一下是否需要更改yaml文件?

Hey, which version do you make use of?

@Trainbow
Copy link
Author

你好,我在尝试volcano gpu number的服务调度,在根据volcano的教程步骤安装之后,每一个带gpu的node都能够正确的显示有多少块gpu,但是在创建pod的时候,container的容器中没有volcano-gpu-number这一个环境变量,在里面输入nvidia-smi能够看到该节点所有的gpu,想问一下是否需要更改yaml文件?

Hey, which version do you make use of?

volcano-1.6.0

@Thor-wl
Copy link
Contributor

Thor-wl commented Jan 29, 2023

/cc @wangyang0616 Can you help take a look?

@wangyang0616
Copy link
Member

/cc @wangyang0616 Can you help take a look?

ok, let me take a look

@wangyang0616
Copy link
Member

@Trainbow Is it convenient to post the yaml file for creating the test task?
By the way, can it be successfully scheduled using the default scheduler of k8s?

@Trainbow
Copy link
Author

@Trainbow Is it convenient to post the yaml file for creating the test task? By the way, can it be successfully scheduled using the default scheduler of k8s?

I used the sample yaml in vaolcano-gpu-number readme.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
  namespace: model
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          volcano.sh/gpu-number: 1 # requesting 1 gpu cards
          # nvidia.com/gpu: 1

I also installed nvidia's k8s-device-plugin for testing. For example, when the limits field used nvidia.com/gpu, the pod's container works well, and it has one gpu devices. When i used volcano.sh/gpu-number, the container's env doesn't have the variable VOLCANO_GPU_ALLOCATED, the NVIDIA_VISIBLE_DEVICES is all.
I tried the gpu-sharing with volcano, according to the official tutorial to test, I can find the corresponding environment variables in the pod.

@wangyang0616
Copy link
Member

Volcano Device Plugin GPUSTRATEGY default is the Share mode, that is, you can use the Volcano.sh/GPU-MEMOMORY.
If you use the volcano.sh/gpu-number, you need number`, see for details: config-the-volcano-device-plugin-binary

Hope the above information is helpful to you.

@Hugh-yw
Copy link

Hugh-yw commented Nov 6, 2024

@wangyang0616 你好,我用的volcano版本:v1.8.1,k8s版本:v1.23.17,首次安装了volcano-device-plugin,测试过后我将volcano-device-plugin组件进行卸载,并卸载volcano组件,发现集群中的节点还是存在volcano.sh资源标签,并且通过k8s原生scheduler可以申请volcano.sh/gpu-number 进行调度的,除了volcano-device-plugin.yml,还有其他特殊化的资源没有清理干净吗?还是什么原因?

Capacity:
  cpu:                    128
  ephemeral-storage:      824646552Ki
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 1056477056Ki
  nvidia.com/gpu:         8
  pods:                   520
  volcano.sh/gpu-memory:  0   #为啥还存在volcano.sh资源标签
  volcano.sh/gpu-number:  8    #为啥还存在volcano.sh资源标签
Allocatable:
  cpu:                    127600m
  ephemeral-storage:      824646552Ki
  hugepages-1Gi:          0
  hugepages-2Mi:          0
  memory:                 1027216591271
  nvidia.com/gpu:         8
  pods:                   520
  volcano.sh/gpu-memory:  0      #为啥还存在volcano.sh资源标签
  volcano.sh/gpu-number:  8       #为啥还存在volcano.sh资源标签

@Trainbow
Copy link
Author

Trainbow commented Nov 6, 2024 via email

@Hugh-yw
Copy link

Hugh-yw commented Nov 7, 2024

此问题已解决,需要调用 apiserver 接口去删除扩展资源,这部分逻辑期望可以集成到device-plugin中,当卸载plugin时并携带自动清除资源资源标记。

curl --header "Content-Type: application/json-patch+json" \
  --request PATCH \
  --data '[{"op": "remove", "path": "/status/capacity/volcano.sh~1gpu-number"}]' \
  http://localhost:8001/api/v1/nodes/ser-inspur-01/status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants