Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace the use of a ReFrame template config file for a manually created one #850

Open
wants to merge 1 commit into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

casparvl
Copy link
Collaborator

@casparvl casparvl commented Jan 13, 2025

This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app.

@laraPPr I'll send you an example config file that should work with this PR. I'd be great if you can test it for me and let me know if this works. I'll also see if I can find someone with bot access on the AWS MC cluster to deploy the necessary config files and see if I can get it to work there...

WARNING: merging this PR will break any bot instance that has not set up a ReFrame config file manually and has set the RFM_CONFIG_FILES environment variable to point to it. Ideally, we should first fix that for all bot instances, and only then merge this PR.

…ted one. This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app
Copy link

eessi-bot bot commented Jan 13, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat

Copy link

eessi-bot bot commented Jan 13, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-compat, eessi.io-2023.06-software

@casparvl
Copy link
Collaborator Author

casparvl commented Jan 13, 2025

@laraPPr I think you set RFM_CONFIG_FILES to point to the file below (in the shell session running the bot app), this _should)_work for you:

# reframe_config_bot.py

from eessi.testsuite.common_config import common_logging_config
from eessi.testsuite.constants import *  # noqa: F403


site_configuration = {
    'systems': [
        {
            'name': 'BotBuildTests',  # The system HAS to have this name, do NOT change it
            'descr': 'Software-layer bot',
            'hostnames': ['.*'],
            'modules_system': 'lmod',
            'partitions': [
                {
                    'name': 'x86_64_amd_zen3_nvidia_cc80',
                    'scheduler': 'local',
                    'launcher': 'mpirun',
                    'access': ['--export=None', '--nodes=1', '--cluster=accelgor', '--ntasks-per-node=12', '--gpus-per-node=1' ]
                    'environs': ['default'],
                    'features': [
                        FEATURES[GPU]
                    ] + list(SCALES.keys()),
                    'resources': [
                        {
                            'name': '_rfm_gpu',
                            'options': ['--gpus-per-node={num_gpus_per_node}'],
                        },
                        {
                            'name': 'memory',
                            'options': ['--mem={size}'],
                        }
                    ],
                    'extras': {
                        # Make sure to round down, otherwise a job might ask for more mem than is available
                        # per node
                        'mem_per_node': __MEM_PER_NODE__,
                        GPU_VENDOR: GPU_VENDORS[NVIDIA],
                    },
                    'devices': [
                        {
                            'type': DEVICE_TYPES[GPU],
                            # Since we specified --gpus-per-node 1, we pretend this virtual partition only has 1 GPU
                            # per node
                            'num_devices': 1,
                        }
                    ],
                    'max_jobs': 1
                    }
                ]
            }
        ],
    'environments': [
        {
            'name': 'default',
            'cc': 'cc',
            'cxx': '',
            'ftn': ''
            }
        ],
    'general': [
        {
            'purge_environment': True,
            'resolve_module_conflicts': False,  # avoid loading the module before submitting the job
            'remote_detect': True,
        }
    ],
    'logging': common_logging_config(),
}

The only thing I would be curious about is if the autodetected CPU topology shows 12 CPUs (i.e. the part that is in the CGROUP for this allocation), or 48. Maybe you can have a look at the generated topology file.

Anyway, let me know :)

@casparvl
Copy link
Collaborator Author

casparvl commented Jan 14, 2025

Hmmm, so I tested this myself. I had the following config file:

$ cat example_reframe_config.py
# WARNING: this file is intended as template and the __X__ template variables need to be replaced
# before it can act as a configuration file
# Once replaced, this is a config file for running tests after the build phase, by the bot

from eessi.testsuite.common_config import common_logging_config
from eessi.testsuite.constants import *  # noqa: F403


site_configuration = {
    'systems': [
        {
            'name': 'BotBuildTests',  # The system HAS to have this name, do NOT change it
            'descr': 'Software-layer bot',
            'hostnames': ['.*'],
            'modules_system': 'lmod',
            'partitions': [
                {
                    'name': 'x86_64_intel_icelake_nvidia_cc80',
                    'scheduler': 'local',
                    'launcher': 'mpirun',
                    # Suppose that we have configured the bot with
                    # slurm_params = --hold --nodes=1 --export=None --time=0:30:0
                    # arch_target_map = {
                    #     "linux/x86_64/amd/zen3" : "--partition=gpu --ntasks-per-node=12 --gpus-per-node 1" }
                    # We would specify the relevant parameters as access flags:
                    'access': ['--export=None', '--nodes=1', '--partition=gpu_a100', '--ntasks-per-node=18', '--gpus-per-node=1' ],
                    'environs': ['default'],
                    'features': [
                        FEATURES[GPU]
                    ] + list(SCALES.keys()),
                    'resources': [
                        {
                            'name': '_rfm_gpu',
                            'options': ['--gpus-per-node={num_gpus_per_node}'],
                        },
                        {
                            'name': 'memory',
                            'options': ['--mem={size}'],
                        }
                    ],
                    'extras': {
                        # Make sure to round down, otherwise a job might ask for more mem than is available
                        # per node
                        'mem_per_node': 491520,
                        GPU_VENDOR: GPU_VENDORS[NVIDIA],
                    },
                    'devices': [
                        {
                            'type': DEVICE_TYPES[GPU],
                            # Since we specified --gpus-per-node 1, we pretend this virtual partition only has 1 GPU
                            # per node
                            'num_devices': 1,
                        }
                    ],
                    'max_jobs': 1
                    }
                ]
            }
        ],
    'environments': [
        {
            'name': 'default',
            'cc': 'cc',
            'cxx': '',
            'ftn': ''
            }
        ],
    'general': [
        {
            'purge_environment': True,
            'resolve_module_conflicts': False,  # avoid loading the module before submitting the job
            'remote_detect': True,
        }
    ],
    'logging': common_logging_config(),
}

Disappointingly enough, the CPU autodetection still gives the numbers for a full node, e.g.

...
    "sockets": [
      "0x000000000fffffffff",
      "0xfffffffff000000000"
    ],
...
...
  "num_cpus": 72,
  "num_cpus_per_core": 1,
  "num_cpus_per_socket": 36,
  "num_sockets": 2
}

In a way that's understandable: you don't know which socket you'll land on, so what should it put for the sockets field: "0x000000000fffffffff" or "0xfffffffff000000000"? That would depend on which part of the node your job happens to land on.

A way out is of course to define the full thing manually. It means we don't have the core layout - but that piece of information is unreliable anyway, since we don't know a-prioro on which core set our build job (which allocates 1/4 of a node) will land anyway. But, I could quite easily define:

{
    "num_cpus": 18,
    "num_cpus_per_core": 1,
    "num_cpus_per_socket": 18,
    "num_sockets": 1
}
manually. We'd have to see if the tests don't request any information outside of this, but I think (at least for now) they don't.

Anyway, unless your bot is allocating full nodes, we should probably turn off CPU autodetection and specify CPU topology manually in the ReFrame config file...

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 14, 2025

The Pytorch test don't run when processor information is set in the config file

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 14, 2025

And I'm affraid that we will be in the queue for ever waiting for a free node.

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 14, 2025

I do already need this https://github.com/laraPPr/software-layer/blob/5c77cb67231057fae05fb86a2c062866aaf5f804/bot/test.sh#L128-L130
So maybe We should do something similar for the reframe command?

@casparvl
Copy link
Collaborator Author

The Pytorch test don't run when processor information is set in the config file

What's the error you're getting? Could there be some piece of processor information missing that I didn't include above?

And I'm affraid that we will be in the queue for ever waiting for a free node.

I'm confused how that's related to this change in the PR :D You mean your bot job doesn't get allocated because it is busy, i.e. you have trouble testing?

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 14, 2025

I'm confused how that's related to this change in the PR :D You mean your bot job doesn't get allocated because it is busy, i.e. you have trouble testing?

Yes it takes very long to get an allocation but maybe in production we should just do a full node. But it could take 24 or more to get an allocation. Because now it starts quickly because I'm only asking for 1 GPU for half an hour.

@laraPPr
Copy link
Collaborator

laraPPr commented Jan 14, 2025

What's the error you're getting? Could there be some piece of processor information missing that I didn't include above?

Reason: attribute error: ../../../../../../../scratch/gent/461/vsc46128/EESSI/test-suite/eessi/testsuite/utils.py:163: Processor information (num_cores_per_numa_node) missing. Check that processor information is either autodetected (see https://reframe-hpc.readthedocs.io/en/stable/configure.html#proc-autodetection), or manually set in the ReFrame configuration file (see https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#processor-info).

    raise AttributeError(msg)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants