I do not have experience with using NVML and MPS, but here are my thoughts.
What I would focus my attention is this line:

Apr 07 16:52:33 node001 slurmd[299181]: fatal: We were configured to autodetect 
nvml functionality, but we weren't able to find that lib when Slurm was 
configured.

Apparently the Slurm build you are using has not be compiled against NVML and 
as such it cannot use the autodetect functionality.

--
Davide Vanzo, PhD
Computer Scientist
BioHPC – Lyda Hill Dept. of Bioinformatics
UT Southwestern Medical Center

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Robert 
Kudyba
Sent: Tuesday, April 7, 2020 3:56 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Header lengths are longer than data received after 
changing SelectType & GresTypes to use MPS

OK when restarting slurmd on the nodes I get these errors:

Apr 07 16:52:33 node001 systemd[1]: Starting Slurm node daemon...
Apr 07 16:52:33 node001 slurmd[299181]: Message aggregation disabled
Apr 07 16:52:33 node001 slurmd[299181]: WARNING: A line in gres.conf for GRES 
mps has 400 more configured than expected in slurm.conf. Ignoring extra GRES.
Apr 07 16:52:33 node001 slurmd[299181]: fatal: We were configured to autodetect 
nvml functionality, but we weren't able to find that lib when Slurm was 
configured.
Apr 07 16:52:33 node001 systemd[1]: slurmd.service: control process exited, 
code=exited status=1
Apr 07 16:52:33 node001 systemd[1]: Failed to start Slurm node daemon.
Apr 07 16:52:33 node001 systemd[1]: Unit slurmd.service entered failed state.
Apr 07 16:52:33 node001 systemd[1]: slurmd.service failed.

Apr 07 16:43:27 node002 slurmd[273406]: error: GresPlugins changed from gpu,mic 
to gpu,mic,mps ignored
Apr 07 16:43:27 node002 slurmd[273406]: error: Restart the slurmctld daemon to 
change GresPlugins
Apr 07 16:43:27 node002 slurmd[273406]: error: Ignoring gres.conf record, 
invalid name: mps
Apr 07 16:44:06 node002 slurmd[273406]: error: select_g_select_jobinfo_unpack: 
select plugin cons_tres not found
Apr 07 16:44:06 node002 slurmd[273406]: error: select_g_select_jobinfo_unpack: 
unpack error
Apr 07 16:44:06 node002 slurmd[273406]: error: Malformed RPC of type 
REQUEST_TERMINATE_JOB(6011) received
Apr 07 16:44:06 node002 slurmd[273406]: error: slurm_receive_msg_and_forward: 
Header lengths are longer than data received
Apr 07 16:44:06 node002 slurmd[273406]: error: service_connection: 
slurm_receive_msg: Header lengths are longer than dat...ceived

so that " WARNING: A line in gres.conf for GRES mps has 400" must come from 
this entry in gres.conf:
NodeName=node[001-003] Name=gpu Type=v100 File=/dev/nvidia0
# END AUTOGENERATED SECTION   -- DO NOT REMOVE
Name=mps Count=400
AutoDetect=nvml

Perhaps I'm misunderstanding the Count option?

On Tue, Apr 7, 2020 at 4:34 PM Davide Vanzo 
<davide.va...@utsouthwestern.edu<mailto:davide.va...@utsouthwestern.edu>> wrote:
Robert,

That error is typically due to slurmd/slurmctld version mismatch or different 
configuration. I would not be surprised if you need to restart slurmd too after 
changing the SelectType configuration.
Also, do not forget this warning from the documentation when it comes to 
modifying SelectType:

Changing this value can only be done by restarting the slurmctld daemon and 
will result in the loss of all job information (running and pending) since the 
job state save format used by each plugin is different.

--
Davide Vanzo, PhD
Computer Scientist
BioHPC – Lyda Hill Dept. of Bioinformatics
UT Southwestern Medical Center

From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Robert Kudyba
Sent: Tuesday, April 7, 2020 3:26 PM
To: Slurm User Community List 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: [slurm-users] Header lengths are longer than data received after 
changing SelectType & GresTypes to use MPS

EXTERNAL MAIL
Using Slurm 20.02 on CentIOS 7.7 with Bright Cluster. We changed the following 
options to enable MPS:
SelectType=select/cons_tres
GresTypes=gpu,mic,mps

I restarted slurmctld and ran scontrol reconfigure, however all jobs get the 
below error:
[2020-04-07T15:29:00.741] debug:  backfill: no jobs to backfill
[2020-04-07T15:29:03.051] Resending TERMINATE_JOB request JobId=3056 
Nodelist=node[001-002]
[2020-04-07T15:29:03.051] Resending TERMINATE_JOB request JobId=3061 
Nodelist=node003
[2020-04-07T15:29:03.051] debug:  sched: Running job scheduler
[2020-04-07T15:29:03.063] agent/is_node_resp: node:node003 
RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received
[2020-04-07T15:29:03.071] agent/is_node_resp: node:node002 
RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received
[2020-04-07T15:29:03.071] agent/is_node_resp: node:node001 
RPC:REQUEST_TERMINATE_JOB : Header lengths are longer than data received

Do any other options need changing? What causes these header length errors?
CAUTION: This email originated from outside UTSW. Please be cautious of links 
or attachments, and validate the sender's email address before replying.

________________________________

UT Southwestern


Medical Center



The future of medicine, today.


Reply via email to