Generic Resource (GRES) Design Guide
Overview
Generic Resources (GRES) are resources associated with a specific node that can be allocated to jobs and steps. The most obvious example of GRES use would be GPUs. GRES are identified by a specific name and use an optional plugin to provide device-specific support. This document is meant to provide details about Slurm's implementation of GRES support including the relevant data structures. For an overview of GRES configuration and use, see Generic Resource (GRES) Scheduling.
Data Structures
GRES are associated with Slurm nodes, jobs and job steps. You will find a string variable named gres in those data structures which is used to store the GRES configured on a node or required by a job or step (e.g. "gpu:2,nic:1"). This string is also visible to various Slurm commands viewing information about those data structures (e.g. "scontrol show job"). There is a second variable associated with each of those data structures on the slurmctld daemon named gres_list that is intended for program use only. Each element in the list gres_list provides information about a specific GRES type (e.g. one data structure for "gpu" and a second structure with information about "nic"). The structures on gres_list contain an ID number (which is faster to compare than a string) plus a pointer to another structure. This second structure differs somewhat for nodes, jobs, and steps (see gres_node_state_t, gres_job_state_t, and gres_step_state_t in src/common/gres.h for details), but contains various counters and bitmaps. Since these data structures differ for various entity types, the functions used to work with them are also different. If no GRES are associated with a node, job or step, then both gres and gres_list will be NULL.
------------------------ | Job Information | |----------------------| | gres = "gpu:2,nic:1" | | gres_list | ------------------------ | +--------------------------------- | | ------------------ ------------------ | List Struct | | List Struct | |----------------| |----------------| | id = 123 (gpu) | | id = 124 (nic) | | gres_data | | gres_data | ------------------ ------------------ | | | .... | | ------------------------------------------------ | gres_job_state_t | |----------------------------------------------| | gres_count = 2 | | node_count = 3 | | gres_bitmap(by node) = 0,1; | | 2,3; | | 0,2 | | gres_count_allocated_to_steps(by node) = 1; | | 1; | | 1 | | gres_bitmap_allocated_to_steps(by node) = 0; | | 2; | | 0 | ------------------------------------------------
Mode of Operation
After the slurmd daemon reads the configuration files, it calls the function node_config_load() for each configured plugin. This can be used to validate the configuration, for example validate that the appropriate devices actually exist. If no GRES plugin exists for that resource type, the information in the configuration file is assumed correct. Each node's GRES information is reported by slurmd to the slurmctld daemon at node registration time.
The slurmctld daemon maintains GRES information in the data structures described above for each node, including the number of configured and allocated resources. If those resources are identified with a specific device file rather than just a count, bitmaps are used record which specific resources have been allocated to jobs.
The slurmctld daemon's GRES information about jobs includes several arrays equal in length to the number of allocated nodes. The index into each of the arrays is the sequence number of the node in that's job's allocation (e.g. the first element is node zero of the job allocation). The job step's GRES information is similar to that of a job including the design where the index into arrays is based upon the job's allocation. This means when a job step is allocated or terminates, the required bitmap operations are very easy to perform without computing different index values for job and step data structures.
The most complex operation on the GRES data structures happens when a job changes size (has nodes added or removed). In that case, the array indexed by node index must be rebuilt, with records shifting as appropriate. Note that the current software is not compatible with having different GRES counts by node (a job can not have 2 GPUs on one node and 1 GPU on a second node), although that might be addressed at a later time.
When a job or step is initiated, its credential includes allocated GRES information. This can be used by the slurmd daemon to associate those resources with that job. Our plan is to use the Linux cgroups logic to bind a job and/or its tasks with specific GRES devices, however that logic does not currently exist. What does exist today is a pair of plugin APIs, job_set_env() and step_set_env() which can be used to set environment variables for the program directing it to GRES which have been allocated for its use (the CUDA libraries base their GPU selection upon environment variables, so this logic should work for CUDA today if users do not attempt to manipulate the environment variables reserved for CUDA use).
If you want to see how GRES logic is allocating resources, configure DebugFlags=GRES to log GRES state changes. Note the resulting output can be quite verbose, especially for larger clusters.
Last modified 6 August 2021