[slurm-users] submit wrong core-gpu binding job, it will be pending with "resources" reason which impact the main scheduler.

wenxia...@126.com Mon, 25 Jan 2021 21:58:40 -0800

Hi list，
I am a learner of SLURM, now encountered one issue in the slurm19.05 version.
when I submit a job with 16 cores and 1 GPU, the job will be in PD state with 
reason "Resources", which will impact the main scheduler to deal with lower 
priority jobs(PD reason is Priority) in the same partition. the open tickets 
is: https://bugs.schedmd.com/show_bug.cgi?id=10697
my gres.conf likes below:
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0 Cores=0-7
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1 Cores=8-15
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2 Cores=16-23
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3 Cores=24-31
After checked the problem found the reasons:
1. why this job can be submitted?
In my env, there are several nodes in a power-down state, so these nodes will 
not send the register message to the controller, the slurmctld will not have 
the cores-GPU binding info. so this job can be submitted in SLURM and with the 
reason "ReqNotAvail", which will not impact the main scheduler.
2. why this job was in PD with "Resources"?
because there was a running job with option --exclusive, so the global variant 
share_node_bitmap was partly cleared according to the running job's nodes. when 
I run the problem job, in function _pick_best_nodes it will set the nodes_busy 
to be true because of exclusive job, after executing select_g_job_test()
against the list of nodes that exist in any state, found it ok to submit, then 
go to the logic1 to below, found it nodes_busy is true, so it will go through 
logic 2. That is the reason for this question.
        
        //logic 1 in _pick_best_nodes
        else if (!runable_avail && !nodes_busy) {
                error_code = ESLURM_NODE_NOT_AVAIL;
        }
        //logic 2 in _pick_best_nodes
        if (error_code == SLURM_SUCCESS) {
                error_code = ESLURM_NODES_BUSY;
                *select_bitmap = possible_bitmap;
        } else {
                FREE_NULL_BITMAP(possible_bitmap);
        }
        return error_code;
To avoid problems, the following solutions are proposed, please help to give 
your advice.
(1) I think "nodes_busy" variant is not good to as a checking condition. What 
about changing the "!nodes_busy" parameter to "bit_super_set(possible_bitmap, 
share_node_bitmap)" in logic 1. which was checked in local env, this is an 
available way to resolve this problem.
(2) what about removing the gres.conf to remove the resources binding?
(3) what about using cli_filter to filter out jobs with not-right bindings?
It will be appreciated for receiving your response.




wenxia...@126.com

[slurm-users] submit wrong core-gpu binding job, it will be pending with "resources" reason which impact the main scheduler.

Reply via email to