Thank you for the problem analysis and patch. It will be included in the version 2.6.7 release. The commit is here:

https://github.com/SchedMD/slurm/commit/f005e5086aa9461e5accac6ef812b92c9b0b8bf7


Quoting Carlos Bederián <b...@famaf.unc.edu.ar>:

Here's a patch to avoid the overrun assert on bit_test:

diff --git a/src/common/gres.c b/src/common/gres.c
index 5eae827..4651ff9 100644
--- a/src/common/gres.c
+++ b/src/common/gres.c
@@ -3155,7 +3155,10 @@ static int _job_dealloc(void *job_gres_data, void
*node_gres_data,
        } else if (job_gres_ptr->gres_bit_alloc &&
                   job_gres_ptr->gres_bit_alloc[node_offset] &&
                   node_gres_ptr->topo_gres_cnt_alloc) {
-               for (i = 0; i < node_gres_ptr->gres_cnt_config; i++) {
+               len = MIN(node_gres_ptr->gres_cnt_config,
+                         bit_size(job_gres_ptr->
+                                  gres_bit_alloc[node_offset]));
+               for (i = 0; i < len; i++) {
                        if (bit_test(job_gres_ptr->
                                     gres_bit_alloc[node_offset], i) &&
                            node_gres_ptr->topo_gres_cnt_alloc[i])



On Tue, Feb 18, 2014 at 4:19 PM, Carlos Bederián <b...@famaf.unc.edu.ar>wrote:

 Hi, I've been having sporadic slurm 2.6.5 crashes that require clearing
the state to recover from.
I haven't been able to isolate the hardware issue that results in GPUs
going AWOL to slurm yet (nvidia-smi still lists them), but slurmctld begins
to log:

[2014-02-17T05:26:46.000] error: gres/gpu: job 14727 and node mendieta06
bitmap sizes differ (1 != 2)
[2014-02-17T05:26:46.000] error: gres/gpu: job 14731 dealloc node
mendieta06 gres count underflow

Later, when a job using the resource that went missing finishes somehow,
slurmctld crashes. Looking at the backtrace I have:

#0  0x0000003c000328e5 in raise () from /lib64/libc.so.6
#1  0x0000003c000340c5 in abort () from /lib64/libc.so.6
#2  0x0000003c0002ba0e in __assert_fail_base () from /lib64/libc.so.6
#3  0x0000003c0002bad0 in __assert_fail () from /lib64/libc.so.6
#4  0x00000000004a41fa in bit_test (b=<value optimized out>, bit=<value
optimized out>) at bitstring.c:183
#5  0x0000000000533790 in _job_dealloc (job_gres_list=<value optimized
out>, node_gres_list=0x249a0e8, node_offset=0, job_id=14727,
    node_name=0x246cd68 "mendieta06") at gres.c:3160
#6  gres_plugin_job_dealloc (job_gres_list=<value optimized out>,
node_gres_list=0x249a0e8, node_offset=0, job_id=14727, node_name=0x246cd68
"mendieta06")
    at gres.c:3228
#7  0x00007f4f14741a35 in _rm_job_from_res (part_record_ptr=0x2474758,
node_usage=0x24cc418, job_ptr=0x2477108, action=0) at select_cons_res.c:1169
#8  0x00007f4f14741dc2 in select_p_job_fini (job_ptr=<value optimized
out>) at select_cons_res.c:2140
#9  0x000000000045bf57 in deallocate_nodes (job_ptr=0x2477108,
timeout=true, suspended=false, preempted=false) at node_scheduler.c:478
#10 0x0000000000442281 in job_time_limit () at job_mgr.c:5465
#11 0x000000000042fa8f in _slurmctld_background (no_data=<value optimized
out>) at controller.c:1462
#12 0x000000000043259f in main (argc=<value optimized out>, argv=<value
optimized out>) at controller.c:586

This looks to be due to gres_cnt_config being inconsistent with the node's
state at job deallocation time, but I'm not familiar enough with gres code
to propose a decent solution.


--
Carlos S. Bederián
Instituto de Física Enrique Gaviola - CONICET
Medina Allende S/N, Ciudad Universitaria
X5000HUA Córdoba, Argentina




--
Carlos S. Bederián
Instituto de Física Enrique Gaviola - CONICET
Medina Allende S/N, Ciudad Universitaria
X5000HUA Córdoba, Argentina


Reply via email to