Re: [slurm-users] srun seg faults immediately from within sbatch but not salloc

a . vitalis Tue, 08 May 2018 11:09:35 -0700

Dear all,

I tried to debug this with some apparent success (for now).


If anyone cares:
With the help of gdb inside sbatch, I tracked down the immediate seg fault to 
strcmp.
I then hacked src/srun/srun.c with some info statements and isolated this 
function as the culprit:
static void _setup_env_working_cluster(void)

With my configuration, this routine ended up performing a strcmp of two NULL 
pointers, which seg-faults on our system (and is not language-compliant I would 
think?). My current understanding is that this is a slurm bug.

The issue is rectifiable by simply giving the cluster a name in slurm.conf 
(e.g., ClusterName=bla). I am not using slurmdbd btw.

Hope this helps,
Andreas


-----"slurm-users" <slurm-users-boun...@lists.schedmd.com> wrote: -----
To: slurm-users@lists.schedmd.com
From: a.vita...@bioc.uzh.ch
Sent by: "slurm-users" 
Date: 05/08/2018 12:44AM
Subject: [slurm-users] srun seg faults immediately from within sbatch but       
not salloc

Dear all,

I am trying to set up a small cluster running slurm on Ubuntu 16.04.
I installed slurm-17.11.5 along with pmix-2.1.1 on an NFS-shared partition. 
Installation seems fine. Munge is taken from the system package.
Something like this:
./configure --prefix=/software/slurm/slurm-17.11.5 
--exec-prefix=/software/slurm/Gnu --with-pmix=/software/pmix --with-munge=/usr 
--sysconfdir=/software/slurm/etc

One of the nodes is also the control host and runs both slurmctld and slurmd 
(but the issue is there also if this is not the case). I start daemons manually 
at the moment (slurmctld first).
My configuration file looks like this (I removed the node-specific parts):

SlurmdUser=root
#
AuthType=auth/munge
# Epilog=/usr/local/slurm/etc/epilog
FastSchedule=1
JobCompLoc=/var/log/slurm/slurm.job.log
JobCompType=jobcomp/filetxt
JobCredentialPrivateKey=/usr/local/etc/slurm.key
JobCredentialPublicCertificate=/usr/local/etc/slurm.cert
#PluginDir=/usr/local/slurm/lib/slurm
# Prolog=/usr/local/slurm/etc/prolog
SchedulerType=sched/backfill
SelectType=select/linear
SlurmUser=cadmin # this user exists everywhere
SlurmctldPort=7002
SlurmctldTimeout=300
SlurmdPort=7003
SlurmdTimeout=300
SwitchType=switch/none
TreeWidth=50
#
# logging
StateSaveLocation=/var/log/slurm/tmp
SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool
SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid
SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h
#
# job settings
MaxTasksPerNode=64
MpiDefault=pmix_v2

# plugins
TaskPlugin=task/cgroup


There are no prolog or epilog scripts.
After some fiddling with MPI, I got the system to work with interactive jobs  
through salloc (MPI behaves correctly for jobs occupying one or all of  the 
nodes). sinfo produces expected results.
However, as soon as I try to submit through sbatch I get an instantaneous seg 
fault regardless of executable (even when there is none specified, i.e., the 
srun command is meaningless).

When I try to monitor slurmd in the foreground (-vvvv -D), I get something like 
this:

slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug:  Resource spec: No specialized cores configured by default on 
this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for 
this node
slurmd: debug:  Reading cgroup.conf file /software/slurm/etc/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2
slurmd: debug:  Reading cgroup.conf file /software/slurm/etc/cgroup.conf
slurmd: debug:  task/cgroup: loaded
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug:  spank: opening plugin stack /software/slurm/etc/plugstack.conf
slurmd: Munge cryptographic signature plugin loaded
slurmd: slurmd version 17.11.5 started
slurmd: debug:  Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug:  job_container none plugin loaded
slurmd: debug:  switch NONE plugin loaded
slurmd: slurmd started on Mon, 07 May 2018 23:54:31 +0200
slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 Memory=64062 
TmpDisk=187611 Uptime=1827335 CPUSpecList=(null) FeaturesAvail=(null) 
FeaturesActive=(null)
slurmd: debug:  AcctGatherEnergy NONE plugin loaded
slurmd: debug:  AcctGatherProfile NONE plugin loaded
slurmd: debug:  AcctGatherInterconnect NONE plugin loaded
slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/software/slurm/etc/acct_gather.conf)
slurmd: debug2: got this type of message 4005
slurmd: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
slurmd: debug2: _group_cache_lookup_internal: no entry found for andreas
slurmd: _run_prolog: run job script took usec=5
slurmd: _run_prolog: prolog with lock for job 100 ran for 0 seconds
slurmd: Launching batch job 100 for UID 1003
slurmd: debug2: got this type of message 6011
slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug:  _rpc_terminate_job, uid = 1001
slurmd: debug:  credential for job 100 revoked
slurmd: debug2: No steps in jobid 100 to send signal 999
slurmd: debug2: No steps in jobid 100 to send signal 18
slurmd: debug2: No steps in jobid 100 to send signal 15
slurmd: debug2: set revoke expiration for jobid 100 to 1525730207 UTS
slurmd: debug2: got this type of message 1008

Here, job 100 would be a submission script with something like:

#!/bin/bash -l
#SBATCH --job-name=FSPMXX
#SBATCH --output=/storage/andreas/camp3.out
#SBATCH --error=/storage/andreas/camp3.err
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1 --tasks-per-node=32 --ntasks-per-core=1
######## #SBATCH -pccm

srun

This produces in camp3.err:

/var/log/slurm/tmp/slurmd.stromboli001.spool/job00101/slurm_script: line 9: 
144905 Segmentation fault      (core dumped) srun

I tried to recompile pmix and slurm with debug options, but I cannot get to 
seem any more information than this.

I don't think the MPI integration can be broken per se as jobs run through 
salloc+srun seem to work fine.

My understanding of the inner workings of slurm is virtually nonexistent, so 
I'll be grateful for any clue you may offer.

Andreas (UZH, Switzerland)

Re: [slurm-users] srun seg faults immediately from within sbatch but not salloc

Reply via email to