Replace the use of a ReFrame template config file for a manually created one #850

casparvl · 2025-01-13T17:06:01Z

This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app.

@laraPPr I'll send you an example config file that should work with this PR. I'd be great if you can test it for me and let me know if this works. I'll also see if I can find someone with bot access on the AWS MC cluster to deploy the necessary config files and see if I can get it to work there...

WARNING: merging this PR will break any bot instance that has not set up a ReFrame config file manually and has set the RFM_CONFIG_FILES environment variable to point to it. Ideally, we should first fix that for all bot instances, and only then merge this PR.

…ted one. This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app

eessi-bot · 2025-01-13T17:06:05Z

Instance eessi-bot-mc-aws is configured to build for:

architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
repositories: eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat

eessi-bot · 2025-01-13T17:06:07Z

Instance eessi-bot-mc-azure is configured to build for:

architectures: x86_64/amd/zen4
repositories: eessi.io-2023.06-compat, eessi.io-2023.06-software

casparvl · 2025-01-13T17:20:11Z

@laraPPr I think you set RFM_CONFIG_FILES to point to the file below (in the shell session running the bot app), this _should)_work for you:

# reframe_config_bot.py

from eessi.testsuite.common_config import common_logging_config
from eessi.testsuite.constants import *  # noqa: F403


site_configuration = {
    'systems': [
        {
            'name': 'BotBuildTests',  # The system HAS to have this name, do NOT change it
            'descr': 'Software-layer bot',
            'hostnames': ['.*'],
            'modules_system': 'lmod',
            'partitions': [
                {
                    'name': 'x86_64_amd_zen3_nvidia_cc80',
                    'scheduler': 'local',
                    'launcher': 'mpirun',
                    'access': ['--export=None', '--nodes=1', '--cluster=accelgor', '--ntasks-per-node=12', '--gpus-per-node=1' ]
                    'environs': ['default'],
                    'features': [
                        FEATURES[GPU]
                    ] + list(SCALES.keys()),
                    'resources': [
                        {
                            'name': '_rfm_gpu',
                            'options': ['--gpus-per-node={num_gpus_per_node}'],
                        },
                        {
                            'name': 'memory',
                            'options': ['--mem={size}'],
                        }
                    ],
                    'extras': {
                        # Make sure to round down, otherwise a job might ask for more mem than is available
                        # per node
                        'mem_per_node': __MEM_PER_NODE__,
                        GPU_VENDOR: GPU_VENDORS[NVIDIA],
                    },
                    'devices': [
                        {
                            'type': DEVICE_TYPES[GPU],
                            # Since we specified --gpus-per-node 1, we pretend this virtual partition only has 1 GPU
                            # per node
                            'num_devices': 1,
                        }
                    ],
                    'max_jobs': 1
                    }
                ]
            }
        ],
    'environments': [
        {
            'name': 'default',
            'cc': 'cc',
            'cxx': '',
            'ftn': ''
            }
        ],
    'general': [
        {
            'purge_environment': True,
            'resolve_module_conflicts': False,  # avoid loading the module before submitting the job
            'remote_detect': True,
        }
    ],
    'logging': common_logging_config(),
}

The only thing I would be curious about is if the autodetected CPU topology shows 12 CPUs (i.e. the part that is in the CGROUP for this allocation), or 48. Maybe you can have a look at the generated topology file.

Anyway, let me know :)

casparvl · 2025-01-14T12:19:12Z

Hmmm, so I tested this myself. I had the following config file:

$ cat example_reframe_config.py
# WARNING: this file is intended as template and the __X__ template variables need to be replaced
# before it can act as a configuration file
# Once replaced, this is a config file for running tests after the build phase, by the bot

from eessi.testsuite.common_config import common_logging_config
from eessi.testsuite.constants import *  # noqa: F403


site_configuration = {
    'systems': [
        {
            'name': 'BotBuildTests',  # The system HAS to have this name, do NOT change it
            'descr': 'Software-layer bot',
            'hostnames': ['.*'],
            'modules_system': 'lmod',
            'partitions': [
                {
                    'name': 'x86_64_intel_icelake_nvidia_cc80',
                    'scheduler': 'local',
                    'launcher': 'mpirun',
                    # Suppose that we have configured the bot with
                    # slurm_params = --hold --nodes=1 --export=None --time=0:30:0
                    # arch_target_map = {
                    #     "linux/x86_64/amd/zen3" : "--partition=gpu --ntasks-per-node=12 --gpus-per-node 1" }
                    # We would specify the relevant parameters as access flags:
                    'access': ['--export=None', '--nodes=1', '--partition=gpu_a100', '--ntasks-per-node=18', '--gpus-per-node=1' ],
                    'environs': ['default'],
                    'features': [
                        FEATURES[GPU]
                    ] + list(SCALES.keys()),
                    'resources': [
                        {
                            'name': '_rfm_gpu',
                            'options': ['--gpus-per-node={num_gpus_per_node}'],
                        },
                        {
                            'name': 'memory',
                            'options': ['--mem={size}'],
                        }
                    ],
                    'extras': {
                        # Make sure to round down, otherwise a job might ask for more mem than is available
                        # per node
                        'mem_per_node': 491520,
                        GPU_VENDOR: GPU_VENDORS[NVIDIA],
                    },
                    'devices': [
                        {
                            'type': DEVICE_TYPES[GPU],
                            # Since we specified --gpus-per-node 1, we pretend this virtual partition only has 1 GPU
                            # per node
                            'num_devices': 1,
                        }
                    ],
                    'max_jobs': 1
                    }
                ]
            }
        ],
    'environments': [
        {
            'name': 'default',
            'cc': 'cc',
            'cxx': '',
            'ftn': ''
            }
        ],
    'general': [
        {
            'purge_environment': True,
            'resolve_module_conflicts': False,  # avoid loading the module before submitting the job
            'remote_detect': True,
        }
    ],
    'logging': common_logging_config(),
}

Disappointingly enough, the CPU autodetection still gives the numbers for a full node, e.g.

...
    "sockets": [
      "0x000000000fffffffff",
      "0xfffffffff000000000"
    ],
...
...
  "num_cpus": 72,
  "num_cpus_per_core": 1,
  "num_cpus_per_socket": 36,
  "num_sockets": 2
}

In a way that's understandable: you don't know which socket you'll land on, so what should it put for the sockets field: "0x000000000fffffffff" or "0xfffffffff000000000"? That would depend on which part of the node your job happens to land on.

A way out is of course to define the full thing manually. It means we don't have the core layout - but that piece of information is unreliable anyway, since we don't know a-prioro on which core set our build job (which allocates 1/4 of a node) will land anyway. But, I could quite easily define:

{
    "num_cpus": 18,
    "num_cpus_per_core": 1,
    "num_cpus_per_socket": 18,
    "num_sockets": 1
}
manually. We'd have to see if the tests don't request any information outside of this, but I think (at least for now) they don't.

Anyway, unless your bot is allocating full nodes, we should probably turn off CPU autodetection and specify CPU topology manually in the ReFrame config file...

laraPPr · 2025-01-14T12:22:46Z

The Pytorch test don't run when processor information is set in the config file

laraPPr · 2025-01-14T12:23:59Z

And I'm affraid that we will be in the queue for ever waiting for a free node.

laraPPr · 2025-01-14T12:26:23Z

I do already need this https://github.com/laraPPr/software-layer/blob/5c77cb67231057fae05fb86a2c062866aaf5f804/bot/test.sh#L128-L130
So maybe We should do something similar for the reframe command?

casparvl · 2025-01-14T13:13:36Z

The Pytorch test don't run when processor information is set in the config file

What's the error you're getting? Could there be some piece of processor information missing that I didn't include above?

And I'm affraid that we will be in the queue for ever waiting for a free node.

I'm confused how that's related to this change in the PR :D You mean your bot job doesn't get allocated because it is busy, i.e. you have trouble testing?

laraPPr · 2025-01-14T14:08:47Z

I'm confused how that's related to this change in the PR :D You mean your bot job doesn't get allocated because it is busy, i.e. you have trouble testing?

Yes it takes very long to get an allocation but maybe in production we should just do a full node. But it could take 24 or more to get an allocation. Because now it starts quickly because I'm only asking for 1 GPU for half an hour.

laraPPr · 2025-01-14T14:26:45Z

What's the error you're getting? Could there be some piece of processor information missing that I didn't include above?

Reason: attribute error: ../../../../../../../scratch/gent/461/vsc46128/EESSI/test-suite/eessi/testsuite/utils.py:163: Processor information (num_cores_per_numa_node) missing. Check that processor information is either autodetected (see https://reframe-hpc.readthedocs.io/en/stable/configure.html#proc-autodetection), or manually set in the ReFrame configuration file (see https://reframe-hpc.readthedocs.io/en/stable/config_reference.html#processor-info).

    raise AttributeError(msg)

Replace the use of a ReFrame template config file for a manually crea…

65e4c36

…ted one. This means the user deploying a bot to build for software-layer will have to create those ReFrame config files and set the RFM_CONFIG_FILES environment variable in the session running the bot app

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace the use of a ReFrame template config file for a manually created one #850

Replace the use of a ReFrame template config file for a manually created one #850

casparvl commented Jan 13, 2025 •

edited

Loading

eessi-bot bot commented Jan 13, 2025

eessi-bot bot commented Jan 13, 2025

casparvl commented Jan 13, 2025 •

edited

Loading

casparvl commented Jan 14, 2025 •

edited

Loading

laraPPr commented Jan 14, 2025

laraPPr commented Jan 14, 2025 •

edited

Loading

laraPPr commented Jan 14, 2025

casparvl commented Jan 14, 2025

laraPPr commented Jan 14, 2025

laraPPr commented Jan 14, 2025

Replace the use of a ReFrame template config file for a manually created one #850

Are you sure you want to change the base?

Replace the use of a ReFrame template config file for a manually created one #850

Conversation

casparvl commented Jan 13, 2025 • edited Loading

eessi-bot bot commented Jan 13, 2025

eessi-bot bot commented Jan 13, 2025

casparvl commented Jan 13, 2025 • edited Loading

casparvl commented Jan 14, 2025 • edited Loading

laraPPr commented Jan 14, 2025

laraPPr commented Jan 14, 2025 • edited Loading

laraPPr commented Jan 14, 2025

casparvl commented Jan 14, 2025

laraPPr commented Jan 14, 2025

laraPPr commented Jan 14, 2025

casparvl commented Jan 13, 2025 •

edited

Loading

casparvl commented Jan 13, 2025 •

edited

Loading

casparvl commented Jan 14, 2025 •

edited

Loading

laraPPr commented Jan 14, 2025 •

edited

Loading