-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot fine-tune LLM without GPU - CUDA error and DDP initialization #2371
Comments
Thank you for creating this!
|
Hi , this is the output
|
So, as you can see the GPU has not been allocated to your PyTorch's pod: resources:
limits:
cpu: 20
memory: 20G
requests:
cpu: 20
memory: 20G Locally on Kind using MacOS, I was able to run the example on CPU using Where do you run your Kubernetes cluster ? |
Are you using public cloud or on-prem to deploy Kubeflow Control Plane ? |
I used Jarvice to create a Kubeflow instance. |
Do you know which instances do they run for Kubernetes Nodes ? |
Sorry , i don't know |
@thuytrang32 Can you also try to set the trainer_parameters=HuggingFaceTrainerParams(
training_parameters=transformers.TrainingArguments(
output_dir="test_trainer",
save_strategy="no",
evaluation_strategy="no",
do_eval=False,
disable_tqdm=True,
log_level="info",
ddp_backend="gloo",
use_cpu=True,
),
) |
When I ran with both ddp_backend and use_cpu , it still had the old error Then I tried to run with use_cpu = True only , the code passed. Then i checked with kubectl logs fine-tune-bert-master-0 -n kubeflow-user-example-com, it had these errors again For kubectl describe pod fine-tune-bert-worker-0 -n kubeflow-user-example-com , because the worker has GPU , it didn't have CUDA error but it still had this |
I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected. I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags
Regarding the error |
But target_module is just parameter of LoraConfig, why i already tried not to use lora_config = LoraConfig(....) but the error still existed ? |
It didn't work even though i put both no_cuda=True, use_cpu = True And i saw that the size of Bert model is only 1.3GB , why i already set memory per worker is 20GB but it's still not enough ? |
Oh I see. It seems that when
As a result, the trainer still attempts to configure the PEFT model as indicated in the script: training-operator/sdk/python/kubeflow/trainer/hf_llm_training.py Lines 118 to 130 in 25c760c
It seems that even without specifying Meanwhile, @andreyvelich do you think this might be a bug? |
For the CUDA issue, sorry I don’t have a solution at the moment. It might be related to the base image used in the trainer. @andreyvelich @deepanker13 @johnugeorge @saileshd1402 Do you have any insights on this? Regarding the memory issue, the trainer image is quite large, so please ensure the device has at least 10GB of available memory. It could be a memory constraint on the device you’re using. Could you confirm if the device meets this requirement? |
What happened?
I am trying to fine-tune an LLM using Kubeflow without GPU devices. However, I encountered two issues during the process :
When I removed the gpu key from resources_per_worker, the training job still attempted to allocate GPUs, resulting in the CUDA error: invalid device ordinal (the training job tried to allocate GPUs)
To address this, I tried adding ddp_backend="gloo" to training_parameters. However, this led to another error:
I followed this instruction : https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/ . This is the code i ran :
`import transformers
from peft import LoraConfig
from kubeflow.training import TrainingClient
from kubeflow.storage_initializer.hugging_face import (
HuggingFaceModelParams,
HuggingFaceTrainerParams,
HuggingFaceDatasetParams,
)
TrainingClient().train(
name="fine-tune-bert",
# BERT model URI and type of Transformer to train it.
)`
What did you expect to happen?
The training job should correctly initialize without attempting to allocate GPUs.
Environment
Kubernetes version:
Training Operator version:
$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}" kubeflow/training-operator:latest
Training Operator Python SDK version:
Impacted by this bug?
👍
The text was updated successfully, but these errors were encountered: