OK when restarting slurmd on the nodes I get these errors: Apr 07 16:52:33 node001 systemd[1]: Starting Slurm node daemon... Apr 07 16:52:33 node001 slurmd[299181]: Message aggregation disabled Apr 07 16:52:33 node001 slurmd[299181]: WARNING: A line in gres.conf for GRES mps has 400 more configured than expected in slurm.conf. Ignoring extra GRES. Apr 07 16:52:33 node001 slurmd[299181]: fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured. Apr 07 16:52:33 node001 systemd[1]: slurmd.service: control process exited, code=exited status=1 Apr 07 16:52:33 node001 systemd[1]: Failed to start Slurm node daemon. Apr 07 16:52:33 node001 systemd[1]: Unit slurmd.service entered failed state. Apr 07 16:52:33 node001 systemd[1]: slurmd.service failed.
Apr 07 16:43:27 node002 slurmd[273406]: error: GresPlugins changed from gpu,mic to gpu,mic,mps ignored Apr 07 16:43:27 node002 slurmd[273406]: error: Restart the slurmctld daemon to change GresPlugins Apr 07 16:43:27 node002 slurmd[273406]: error: Ignoring gres.conf record, invalid name: mps Apr 07 16:44:06 node002 slurmd[273406]: error: select_g_select_jobinfo_unpack: select plugin cons_tres not found Apr 07 16:44:06 node002 slurmd[273406]: error: select_g_select_jobinfo_unpack: unpack error Apr 07 16:44:06 node002 slurmd[273406]: error: Malformed RPC of type REQUEST_TERMINATE_JOB(6011) received Apr 07 16:44:06 node002 slurmd[273406]: error: slurm_receive_msg_and_forward: Header lengths are longer than data received Apr 07 16:44:06 node002 slurmd[273406]: error: service_connection: slurm_receive_msg: Header lengths are longer than dat...ceived so that " WARNING: A line in gres.conf for GRES mps has 400" must come from this entry in gres.conf: NodeName=node[001-003] Name=gpu Type=v100 File=/dev/nvidia0 # END AUTOGENERATED SECTION -- DO NOT REMOVE Name=mps Count=400 AutoDetect=nvml Perhaps I'm misunderstanding the Count option? On Tue, Apr 7, 2020 at 4:34 PM Davide Vanzo <davide.va...@utsouthwestern.edu> wrote: > Robert, > > > > That error is typically due to slurmd/slurmctld version mismatch or > different configuration. I would not be surprised if you need to restart > slurmd too after changing the SelectType configuration. > > Also, do not forget this warning from the documentation when it comes to > modifying SelectType: > > > > *Changing this value can only be done by restarting the slurmctld daemon > and will result in the loss of all job information (running and pending) > since the job state save format used by each plugin is different.* > > > > -- > > *Davide Vanzo, PhD* > > *Computer Scientist* > > BioHPC – Lyda Hill Dept. of Bioinformatics > > UT Southwestern Medical Center > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of > *Robert Kudyba > *Sent:* Tuesday, April 7, 2020 3:26 PM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* [slurm-users] Header lengths are longer than data received > after changing SelectType & GresTypes to use MPS > > > > *EXTERNAL MAIL* > > Using Slurm 20.02 on CentIOS 7.7 with Bright Cluster. We changed the > following options to enable MPS: > SelectType=select/cons_tres > GresTypes=gpu,mic,mps > > I restarted slurmctld and ran scontrol reconfigure, however all jobs get > the below error: > [2020-04-07T15:29:00.741] debug: backfill: no jobs to backfill > [2020-04-07T15:29:03.051] Resending TERMINATE_JOB request JobId=3056 > Nodelist=node[001-002] > [2020-04-07T15:29:03.051] Resending TERMINATE_JOB request JobId=3061 > Nodelist=node003 > [2020-04-07T15:29:03.051] debug: sched: Running job scheduler > [2020-04-07T15:29:03.063] agent/is_node_resp: node:node003 > RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received > [2020-04-07T15:29:03.071] agent/is_node_resp: node:node002 > RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received > [2020-04-07T15:29:03.071] agent/is_node_resp: node:node001 > RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received > > Do any other options need changing? What causes these header length > errors? > > *CAUTION: *This email originated from outside UTSW. Please be cautious of > links or attachments, and validate the sender's email address before > replying. > > ------------------------------ > > UT Southwestern > > Medical Center > > The future of medicine, today. >