Dear all,

I tried to debug this with some apparent success (for now).

If anyone cares:
With the help of gdb inside sbatch, I tracked down the immediate seg fault to 
strcmp.
I then hacked src/srun/srun.c with some info statements and isolated this 
function as the culprit:
static void _setup_env_working_cluster(void)

With my configuration, this routine ended up performing a strcmp of two NULL 
pointers, which seg-faults on our system (and is not language-compliant I would 
think?). My current understanding is that this is a slurm bug.

The issue is rectifiable by simply giving the cluster a name in slurm.conf 
(e.g., ClusterName=bla). I am not using slurmdbd btw.

Hope this helps,
Andreas


-----"slurm-users" <slurm-users-boun...@lists.schedmd.com> wrote: -----
To: slurm-users@lists.schedmd.com
From: a.vita...@bioc.uzh.ch
Sent by: "slurm-users" 
Date: 05/08/2018 12:44AM
Subject: [slurm-users] srun seg faults immediately from within sbatch but       
not salloc

Dear all,

I am trying to set up a small cluster running slurm on Ubuntu 16.04.
I installed slurm-17.11.5 along with pmix-2.1.1 on an NFS-shared partition. 
Installation seems fine. Munge is taken from the system package.
Something like this:
./configure --prefix=/software/slurm/slurm-17.11.5 
--exec-prefix=/software/slurm/Gnu --with-pmix=/software/pmix --with-munge=/usr 
--sysconfdir=/software/slurm/etc

One of the nodes is also the control host and runs both slurmctld and slurmd 
(but the issue is there also if this is not the case). I start daemons manually 
at the moment (slurmctld first).
My configuration file looks like this (I removed the node-specific parts):

SlurmdUser=root
#
AuthType=auth/munge
# Epilog=/usr/local/slurm/etc/epilog
FastSchedule=1
JobCompLoc=/var/log/slurm/slurm.job.log
JobCompType=jobcomp/filetxt
JobCredentialPrivateKey=/usr/local/etc/slurm.key
JobCredentialPublicCertificate=/usr/local/etc/slurm.cert
#PluginDir=/usr/local/slurm/lib/slurm
# Prolog=/usr/local/slurm/etc/prolog
SchedulerType=sched/backfill
SelectType=select/linear
SlurmUser=cadmin # this user exists everywhere
SlurmctldPort=7002
SlurmctldTimeout=300
SlurmdPort=7003
SlurmdTimeout=300
SwitchType=switch/none
TreeWidth=50
#
# logging
StateSaveLocation=/var/log/slurm/tmp
SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool
SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid
SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h
#
# job settings
MaxTasksPerNode=64
MpiDefault=pmix_v2

# plugins
TaskPlugin=task/cgroup


There are no prolog or epilog scripts.
After some fiddling with MPI, I got the system to work with interactive jobs  
through salloc (MPI behaves correctly for jobs occupying one or all of  the 
nodes). sinfo produces expected results.
However, as soon as I try to submit through sbatch I get an instantaneous seg 
fault regardless of executable (even when there is none specified, i.e., the 
srun command is meaningless).

When I try to monitor slurmd in the foreground (-vvvv -D), I get something like 
this:

slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug:  Resource spec: No specialized cores configured by default on 
this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for 
this node
slurmd: debug:  Reading cgroup.conf file /software/slurm/etc/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2
slurmd: debug:  Reading cgroup.conf file /software/slurm/etc/cgroup.conf
slurmd: debug:  task/cgroup: loaded
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug:  spank: opening plugin stack /software/slurm/etc/plugstack.conf
slurmd: Munge cryptographic signature plugin loaded
slurmd: slurmd version 17.11.5 started
slurmd: debug:  Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug:  job_container none plugin loaded
slurmd: debug:  switch NONE plugin loaded
slurmd: slurmd started on Mon, 07 May 2018 23:54:31 +0200
slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 Memory=64062 
TmpDisk=187611 Uptime=1827335 CPUSpecList=(null) FeaturesAvail=(null) 
FeaturesActive=(null)
slurmd: debug:  AcctGatherEnergy NONE plugin loaded
slurmd: debug:  AcctGatherProfile NONE plugin loaded
slurmd: debug:  AcctGatherInterconnect NONE plugin loaded
slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/software/slurm/etc/acct_gather.conf)
slurmd: debug2: got this type of message 4005
slurmd: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
slurmd: debug2: _group_cache_lookup_internal: no entry found for andreas
slurmd: _run_prolog: run job script took usec=5
slurmd: _run_prolog: prolog with lock for job 100 ran for 0 seconds
slurmd: Launching batch job 100 for UID 1003
slurmd: debug2: got this type of message 6011
slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug:  _rpc_terminate_job, uid = 1001
slurmd: debug:  credential for job 100 revoked
slurmd: debug2: No steps in jobid 100 to send signal 999
slurmd: debug2: No steps in jobid 100 to send signal 18
slurmd: debug2: No steps in jobid 100 to send signal 15
slurmd: debug2: set revoke expiration for jobid 100 to 1525730207 UTS
slurmd: debug2: got this type of message 1008

Here, job 100 would be a submission script with something like:

#!/bin/bash -l
#SBATCH --job-name=FSPMXX
#SBATCH --output=/storage/andreas/camp3.out
#SBATCH --error=/storage/andreas/camp3.err
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1 --tasks-per-node=32 --ntasks-per-core=1
######## #SBATCH -pccm

srun

This produces in camp3.err:

/var/log/slurm/tmp/slurmd.stromboli001.spool/job00101/slurm_script: line 9: 
144905 Segmentation fault      (core dumped) srun

I tried to recompile pmix and slurm with debug options, but I cannot get to 
seem any more information than this.

I don't think the MPI integration can be broken per se as jobs run through 
salloc+srun seem to work fine.

My understanding of the inner workings of slurm is virtually nonexistent, so 
I'll be grateful for any clue you may offer.

Andreas (UZH, Switzerland)

Reply via email to