I think this should already be fixed in the upcoming release. See: https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72
On 5/8/18 12:08 PM, a.vita...@bioc.uzh.ch wrote: > Dear all, > > I tried to debug this with some apparent success (for now). > > If anyone cares: > With the help of gdb inside sbatch, I tracked down the immediate seg > fault to strcmp. > I then hacked src/srun/srun.c with some info statements and isolated > this function as the culprit: > static void _setup_env_working_cluster(void) > > With my configuration, this routine ended up performing a strcmp of > two NULL pointers, which seg-faults on our system (and is not > language-compliant I would think?). My current understanding is that > this is a slurm bug. > > The issue is rectifiable by simply giving the cluster a name in > slurm.conf (e.g., ClusterName=bla). I am not using slurmdbd btw. > > Hope this helps, > Andreas > > > -----"slurm-users" <slurm-users-boun...@lists.schedmd.com > <mailto:slurm-users-boun...@lists.schedmd.com>> wrote: ----- > To: slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com> > From: a.vita...@bioc.uzh.ch <mailto:a.vita...@bioc.uzh.ch> > Sent by: "slurm-users" > Date: 05/08/2018 12:44AM > Subject: [slurm-users] srun seg faults immediately from within sbatch > but not salloc > > Dear all, > > I am trying to set up a small cluster running slurm on Ubuntu 16.04. > I installed slurm-17.11.5 along with pmix-2.1.1 on an NFS-shared > partition. Installation seems fine. Munge is taken from the system > package. > Something like this: > ./configure --prefix=/software/slurm/slurm-17.11.5 > --exec-prefix=/software/slurm/Gnu --with-pmix=/software/pmix > --with-munge=/usr --sysconfdir=/software/slurm/etc > > One of the nodes is also the control host and runs both slurmctld and > slurmd (but the issue is there also if this is not the case). I start > daemons manually at the moment (slurmctld first). > My configuration file looks like this (I removed the node-specific parts): > > SlurmdUser=root > # > AuthType=auth/munge > # Epilog=/usr/local/slurm/etc/epilog > FastSchedule=1 > JobCompLoc=/var/log/slurm/slurm.job.log > JobCompType=jobcomp/filetxt > JobCredentialPrivateKey=/usr/local/etc/slurm.key > JobCredentialPublicCertificate=/usr/local/etc/slurm.cert > #PluginDir=/usr/local/slurm/lib/slurm > # Prolog=/usr/local/slurm/etc/prolog > SchedulerType=sched/backfill > SelectType=select/linear > SlurmUser=cadmin # this user exists everywhere > SlurmctldPort=7002 > SlurmctldTimeout=300 > SlurmdPort=7003 > SlurmdTimeout=300 > SwitchType=switch/none > TreeWidth=50 > # > # logging > StateSaveLocation=/var/log/slurm/tmp > SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool > SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid > SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid > SlurmctldLogFile=/var/log/slurm/slurmctld.log > SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h > # > # job settings > MaxTasksPerNode=64 > MpiDefault=pmix_v2 > > # plugins > TaskPlugin=task/cgroup > > > There are no prolog or epilog scripts. > After some fiddling with MPI, I got the system to work with > interactive jobs through salloc (MPI behaves correctly for jobs > occupying one or all of the nodes). sinfo produces expected results. > However, as soon as I try to submit through sbatch I get an > instantaneous seg fault regardless of executable (even when there is > none specified, i.e., the srun command is meaningless). > > When I try to monitor slurmd in the foreground (-vvvv -D), I get > something like this: > > slurmd: debug: Log file re-opened > slurmd: debug2: hwloc_topology_init > slurmd: debug2: hwloc_topology_load > slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 > ThreadsPerCore:2 > slurmd: Message aggregation disabled > slurmd: topology NONE plugin loaded > slurmd: route default plugin loaded > slurmd: CPU frequency setting not configured for this node > slurmd: debug: Resource spec: No specialized cores configured by > default on this node > slurmd: debug: Resource spec: Reserved system memory limit not > configured for this node > slurmd: debug: Reading cgroup.conf file /software/slurm/etc/cgroup.conf > slurmd: debug2: hwloc_topology_init > slurmd: debug2: hwloc_topology_load > slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 > ThreadsPerCore:2 > slurmd: debug: Reading cgroup.conf file /software/slurm/etc/cgroup.conf > slurmd: debug: task/cgroup: loaded > slurmd: debug: Munge authentication plugin loaded > slurmd: debug: spank: opening plugin stack > /software/slurm/etc/plugstack.conf > slurmd: Munge cryptographic signature plugin loaded > slurmd: slurmd version 17.11.5 started > slurmd: debug: Job accounting gather NOT_INVOKED plugin loaded > slurmd: debug: job_container none plugin loaded > slurmd: debug: switch NONE plugin loaded > slurmd: slurmd started on Mon, 07 May 2018 23:54:31 +0200 > slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 Memory=64062 > TmpDisk=187611 Uptime=1827335 CPUSpecList=(null) FeaturesAvail=(null) > FeaturesActive=(null) > slurmd: debug: AcctGatherEnergy NONE plugin loaded > slurmd: debug: AcctGatherProfile NONE plugin loaded > slurmd: debug: AcctGatherInterconnect NONE plugin loaded > slurmd: debug: AcctGatherFilesystem NONE plugin loaded > slurmd: debug2: No acct_gather.conf file > (/software/slurm/etc/acct_gather.conf) > slurmd: debug2: got this type of message 4005 > slurmd: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH > slurmd: debug2: _group_cache_lookup_internal: no entry found for andreas > slurmd: _run_prolog: run job script took usec=5 > slurmd: _run_prolog: prolog with lock for job 100 ran for 0 seconds > slurmd: Launching batch job 100 for UID 1003 > slurmd: debug2: got this type of message 6011 > slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB > slurmd: debug: _rpc_terminate_job, uid = 1001 > slurmd: debug: credential for job 100 revoked > slurmd: debug2: No steps in jobid 100 to send signal 999 > slurmd: debug2: No steps in jobid 100 to send signal 18 > slurmd: debug2: No steps in jobid 100 to send signal 15 > slurmd: debug2: set revoke expiration for jobid 100 to 1525730207 UTS > slurmd: debug2: got this type of message 1008 > > Here, job 100 would be a submission script with something like: > > #!/bin/bash -l > #SBATCH --job-name=FSPMXX > #SBATCH --output=/storage/andreas/camp3.out > #SBATCH --error=/storage/andreas/camp3.err > #SBATCH --nodes=1 > #SBATCH --cpus-per-task=1 --tasks-per-node=32 --ntasks-per-core=1 > ######## #SBATCH -pccm > > srun > > This produces in camp3.err: > > /var/log/slurm/tmp/slurmd.stromboli001.spool/job00101/slurm_script: > line 9: 144905 Segmentation fault (core dumped) srun > > I tried to recompile pmix and slurm with debug options, but I cannot > get to seem any more information than this. > > I don't think the MPI integration can be broken per se as jobs run > through salloc+srun seem to work fine. > > My understanding of the inner workings of slurm is virtually > nonexistent, so I'll be grateful for any clue you may offer. > > Andreas (UZH, Switzerland)