Re: [slurm-users] Regression with srun and task/affinity

Jason Bacon Tue, 14 May 2019 07:53:36 -0700

On 2019-05-14 09:24, Jason Bacon wrote:

On 2018-12-16 09:02, Jason Bacon wrote:
Good morning,
We've been running 17.02.11 for a long time and upon testing anupgrade to the 18 series, we discovered a regression. It appearedsomewhere between 17.02.11 and 17.11.7.
Everything works fine under 17.02.11.
Under later versions, everything is fine if I don't use srun or if Iuse TaskPlugin=task/none.
Just wondering if someone can suggest where to look in the sourcecode for this. If I can just pinpoint where the problem is, I'm sureI can come up with a solution pretty quickly. I've poked around abit but have not spotted anything yet. If this doesn't look familiarto anyone, I'll dig deeper and figure it out eventually. Just don'twant to duplicate someone's effort if this is something that's beenfixed already on other platforms.
Below is output from a failed srun and successful sbatch --array andopenmpi jobs.
Thanks,

    Jason

Failing job:

FreeBSD login.wren  bacon ~ 474: srun hostname
srun: error: slurm_receive_msgs: Zero Bytes were transmitted or received
srun: error: Task launch for 82.0 failed on node compute-001: ZeroBytes were transmitted or receivedsrun: error: Application launch failed: Zero Bytes were transmittedor received
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Tail of slurmctld log:

[2018-12-01T16:29:09.873] debug2: got 1 threads to send out
[2018-12-01T16:29:09.874] debug2: Tree head got back 0 looking for 2
[2018-12-01T16:29:09.874] debug3: Tree sending to compute-001
[2018-12-01T16:29:09.874] debug3: Tree sending to compute-002
[2018-12-01T16:29:09.874] debug2: slurm_connect failed: Connectionrefused[2018-12-01T16:29:09.874] debug2: Error connecting slurm streamsocket at 192.168.1.13:6818: Connection refused
[2018-12-01T16:29:09.874] debug3: connect refused, retrying
[2018-12-01T16:29:09.874] debug4: orig_timeout was 10000 we have 0steps and a timeout of 10000[2018-12-01T16:29:10.087] debug2: Processing RPC:MESSAGE_NODE_REGISTRATION_STATUS from uid=0
[2018-12-01T16:29:10.087] debug2: Tree head got back 1
[2018-12-01T16:29:10.087] debug2: _slurm_rpc_node_registrationcomplete for compute-002 usec=97[2018-12-01T16:29:10.917] debug2: slurm_connect failed: Connectionrefused[2018-12-01T16:29:10.917] debug2: Error connecting slurm streamsocket at 192.168.1.13:6818: Connection refused[2018-12-01T16:29:11.976] debug2: slurm_connect failed: Connectionrefused[2018-12-01T16:29:11.976] debug2: Error connecting slurm streamsocket at 192.168.1.13:6818: Connection refused[2018-12-01T16:29:12.007] debug2: Testing job time limits andcheckpoints[2018-12-01T16:29:13.011] debug2: slurm_connect failed: Connectionrefused[2018-12-01T16:29:13.011] debug2: Error connecting slurm streamsocket at 192.168.1.13:6818: Connection refused
Successful sbatch --array:

#!/bin/sh -e

#SBATCH --array=1-8

hostname

FreeBSD login.wren  bacon ~ 462: more slurm-69_8.out
cpu-bind=MASK - compute-002, task  0  0 [64261]: mask 0x8 set
compute-002.wren

Successful openmpi:

#!/bin/sh -e

#SBATCH --ntasks=8

mpirun --report-bindings ./mpi-bench 3

FreeBSD login.wren  bacon ~/Data/mpi-bench/trunk 468: more slurm-81.out
cpu-bind=MASK - compute-001, task  0  0 [64589]: mask 0xf set
CPU 0 is set
CPU 1 is set
CPU 2 is set
CPU 3 is set
CPU 0 is set
CPU 1 is set
[compute-001.wren:64590] MCW rank 0 bound to socket 0[core 0[hwt 0]],socket 0[core 1[hwt 0]]: [B/B][./.][compute-001.wren:64590] MCW rank 1 bound to socket 1[core 2[hwt 0]],socket 1[core 3[hwt 0]]: [./.][B/B][compute-001.wren:64590] MCW rank 2 bound to socket 0[core 0[hwt 0]],socket 0[core 1[hwt 0]]: [B/B][./.][compute-001.wren:64590] MCW rank 3 bound to socket 1[core 2[hwt 0]],socket 1[core 3[hwt 0]]: [./.][B/B]
FreeBSD login.wren  bacon ~ 474: srun hostname
srun: error: slurm_receive_msgs: Zero Bytes were transmitted or received
srun: error: Task launch for 82.0 failed on node compute-001: ZeroBytes were transmitted or receivedsrun: error: Application launch failed: Zero Bytes were transmittedor received
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
'Been busy upgrading our CentOS clusters, but finally got a chance todig into this.
I couldn't find any clues in the logs, but I noticed that slurmd wasdying every time I use srun, so I manually ran it under GDB:
root@compute-001:~ # gdb slurmd
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, andyou arewelcome to change it and/or distribute copies of it under certainconditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" fordetails.This GDB was configured as "amd64-marcel-freebsd"...(no debuggingsymbols found)...
(gdb) run -D
Starting program: /usr/local/sbin/slurmd -D
(no debugging symbols found)...(no debugging symbols found)...slurmd:debug: Log file re-openedslurmd: debug: CPUs:4 Boards:1 Sockets:2 CoresPerSocket:2ThreadsPerCore:1
slurmd: Message aggregation disabled
slurmd: debug: CPUs:4 Boards:1 Sockets:2 CoresPerSocket:2ThreadsPerCore:1
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug: Resource spec: No specialized cores configured bydefault on this nodeslurmd: debug: Resource spec: Reserved system memory limit notconfigured for this nodeslurmd: task affinity plugin loaded with CPU mask000000000000000000000000000000000000000000000000000000000000000f
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug:  spank: opening plugin stack /usr/local/etc/plugstack.conf
slurmd: Munge cryptographic signature plugin loaded
slurmd: slurmd version 18.08.7 started
slurmd: debug:  Job accounting gather LINUX plugin loaded
slurmd: debug:  job_container none plugin loaded
slurmd: debug:  switch NONE plugin loaded
slurmd: slurmd started on Tue, 14 May 2019 09:17:12 -0500
slurmd: CPUs=4 Boards=1 Sockets=2 Cores=2 Threads=1 Memory=16344TmpDisk=15853 Uptime=2815493 CPUSpecList=(null) FeaturesAvail=(null)FeaturesActive=(null)
slurmd: debug:  AcctGatherEnergy NONE plugin loaded
slurmd: debug:  AcctGatherProfile NONE plugin loaded
slurmd: debug:  AcctGatherInterconnect NONE plugin loaded
slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
slurmd: launch task 33.0 request from UID:2001 GID:2001HOST:192.168.0.20 PORT:244
slurmd: debug:  Checking credential with 336 bytes of sig data
slurmd: task affinity : enforcing 'verbose,cores' cpu bind method
slurmd: debug: task affinity : before lllp distribution cpu bindmethod is 'verbose,cores' ((null))slurmd: lllp_distribution jobid [33] binding:verbose,cores,one_thread, dist 1
slurmd: _task_layout_lllp_cyclic
/usr/local/lib/slurm/task_affinity.so: Undefined symbol "slurm_strlcpy"

Program exited with code 01.
(gdb)
Looks like a simply build issue. Seems a little odd that the buildsucceeded with an undefined symbol, but should be pretty easy to trackdown in any case.

Here's the culprit:

In src/common/slurm_xlator.h, strlcpy is unconditionally defined asslurm_strlcpy:


/* strlcpy.[ch] functions */
#define      strlcpy                 slurm_strlcpy

But in src/common/strlcpy.c, the definition of strlcpy() and theslurm_strlcpy alias are masked by


#if (!HAVE_STRLCPY)

So this will cause failures on platforms that already have an strlcpy()function.


Here's a quick fix:

--- src/common/slurm_xlator.h.orig      2019-04-12 04:20:25 UTC
+++ src/common/slurm_xlator.h
@@ -299,7 +299,9 @@
  * The header file used only for #define values. */

 /* strlcpy.[ch] functions */
+#if (!HAVE_STRLCPY)    // Match this to src/common/strlcpy.c
 #define        strlcpy                 slurm_strlcpy
+#endif

 /* switch.[ch] functions
  * None exported today.

--
Earth is a beta site.

Re: [slurm-users] Regression with srun and task/affinity

Reply via email to