-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace the use of a ReFrame template config file for a manually created one #850
base: 2023.06-software.eessi.io
Are you sure you want to change the base?
Replace the use of a ReFrame template config file for a manually created one #850
Conversation
…ted one. This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app
Instance
|
Instance
|
@laraPPr I think you set
The only thing I would be curious about is if the autodetected CPU topology shows 12 CPUs (i.e. the part that is in the CGROUP for this allocation), or 48. Maybe you can have a look at the generated topology file. Anyway, let me know :) |
Hmmm, so I tested this myself. I had the following config file:
Disappointingly enough, the CPU autodetection still gives the numbers for a full node, e.g.
In a way that's understandable: you don't know which socket you'll land on, so what should it put for the A way out is of course to define the full thing manually. It means we don't have the core layout - but that piece of information is unreliable anyway, since we don't know a-prioro on which core set our build job (which allocates 1/4 of a node) will land anyway. But, I could quite easily define:
|
The Pytorch test don't run when processor information is set in the config file |
And I'm affraid that we will be in the queue for ever waiting for a free node. |
I do already need this https://github.com/laraPPr/software-layer/blob/5c77cb67231057fae05fb86a2c062866aaf5f804/bot/test.sh#L128-L130 |
What's the error you're getting? Could there be some piece of processor information missing that I didn't include above?
I'm confused how that's related to this change in the PR :D You mean your bot job doesn't get allocated because it is busy, i.e. you have trouble testing? |
Yes it takes very long to get an allocation but maybe in production we should just do a full node. But it could take 24 or more to get an allocation. Because now it starts quickly because I'm only asking for 1 GPU for half an hour. |
|
This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app.
@laraPPr I'll send you an example config file that should work with this PR. I'd be great if you can test it for me and let me know if this works. I'll also see if I can find someone with bot access on the AWS MC cluster to deploy the necessary config files and see if I can get it to work there...
WARNING: merging this PR will break any bot instance that has not set up a ReFrame config file manually and has set the
RFM_CONFIG_FILES
environment variable to point to it. Ideally, we should first fix that for all bot instances, and only then merge this PR.