-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu number无法使用 #31
Comments
你好,我在尝试volcano gpu number的服务调度,在根据volcano的教程步骤安装之后,每一个带gpu的node都能够正确的显示有多少块gpu,但是在创建pod的时候,container的容器中没有volcano-gpu-number这一个环境变量,在里面输入nvidia-smi能够看到该节点所有的gpu,想问一下是否需要更改yaml文件? |
Hey, which version do you make use of? |
volcano-1.6.0 |
/cc @wangyang0616 Can you help take a look? |
ok, let me take a look |
@Trainbow Is it convenient to post the yaml file for creating the test task? |
I used the sample yaml in vaolcano-gpu-number readme.
I also installed nvidia's k8s-device-plugin for testing. For example, when the limits field used nvidia.com/gpu, the pod's container works well, and it has one gpu devices. When i used volcano.sh/gpu-number, the container's env doesn't have the variable |
Volcano Device Plugin Hope the above information is helpful to you. |
@wangyang0616 你好,我用的volcano版本:v1.8.1,k8s版本:v1.23.17,首次安装了volcano-device-plugin,测试过后我将volcano-device-plugin组件进行卸载,并卸载volcano组件,发现集群中的节点还是存在volcano.sh资源标签,并且通过k8s原生scheduler可以申请volcano.sh/gpu-number 进行调度的,除了volcano-device-plugin.yml,还有其他特殊化的资源没有清理干净吗?还是什么原因? Capacity:
cpu: 128
ephemeral-storage: 824646552Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1056477056Ki
nvidia.com/gpu: 8
pods: 520
volcano.sh/gpu-memory: 0 #为啥还存在volcano.sh资源标签
volcano.sh/gpu-number: 8 #为啥还存在volcano.sh资源标签
Allocatable:
cpu: 127600m
ephemeral-storage: 824646552Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1027216591271
nvidia.com/gpu: 8
pods: 520
volcano.sh/gpu-memory: 0 #为啥还存在volcano.sh资源标签
volcano.sh/gpu-number: 8 #为啥还存在volcano.sh资源标签 |
已收到,谢谢!
|
此问题已解决,需要调用 apiserver 接口去删除扩展资源,这部分逻辑期望可以集成到device-plugin中,当卸载plugin时并携带自动清除资源资源标记。 curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "remove", "path": "/status/capacity/volcano.sh~1gpu-number"}]' \
http://localhost:8001/api/v1/nodes/ser-inspur-01/status |
No description provided.
The text was updated successfully, but these errors were encountered: