Dear all, I tried to debug this with some apparent success (for now).
If anyone cares: With the help of gdb inside sbatch, I tracked down the immediate seg fault to strcmp. I then hacked src/srun/srun.c with some info statements and isolated this function as the culprit: static void _setup_env_working_cluster(void) With my configuration, this routine ended up performing a strcmp of two NULL pointers, which seg-faults on our system (and is not language-compliant I would think?). My current understanding is that this is a slurm bug. The issue is rectifiable by simply giving the cluster a name in slurm.conf (e.g., ClusterName=bla). I am not using slurmdbd btw. Hope this helps, Andreas -----"slurm-users" <slurm-users-boun...@lists.schedmd.com> wrote: ----- To: slurm-users@lists.schedmd.com From: a.vita...@bioc.uzh.ch Sent by: "slurm-users" Date: 05/08/2018 12:44AM Subject: [slurm-users] srun seg faults immediately from within sbatch but not salloc Dear all, I am trying to set up a small cluster running slurm on Ubuntu 16.04. I installed slurm-17.11.5 along with pmix-2.1.1 on an NFS-shared partition. Installation seems fine. Munge is taken from the system package. Something like this: ./configure --prefix=/software/slurm/slurm-17.11.5 --exec-prefix=/software/slurm/Gnu --with-pmix=/software/pmix --with-munge=/usr --sysconfdir=/software/slurm/etc One of the nodes is also the control host and runs both slurmctld and slurmd (but the issue is there also if this is not the case). I start daemons manually at the moment (slurmctld first). My configuration file looks like this (I removed the node-specific parts): SlurmdUser=root # AuthType=auth/munge # Epilog=/usr/local/slurm/etc/epilog FastSchedule=1 JobCompLoc=/var/log/slurm/slurm.job.log JobCompType=jobcomp/filetxt JobCredentialPrivateKey=/usr/local/etc/slurm.key JobCredentialPublicCertificate=/usr/local/etc/slurm.cert #PluginDir=/usr/local/slurm/lib/slurm # Prolog=/usr/local/slurm/etc/prolog SchedulerType=sched/backfill SelectType=select/linear SlurmUser=cadmin # this user exists everywhere SlurmctldPort=7002 SlurmctldTimeout=300 SlurmdPort=7003 SlurmdTimeout=300 SwitchType=switch/none TreeWidth=50 # # logging StateSaveLocation=/var/log/slurm/tmp SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h # # job settings MaxTasksPerNode=64 MpiDefault=pmix_v2 # plugins TaskPlugin=task/cgroup There are no prolog or epilog scripts. After some fiddling with MPI, I got the system to work with interactive jobs through salloc (MPI behaves correctly for jobs occupying one or all of the nodes). sinfo produces expected results. However, as soon as I try to submit through sbatch I get an instantaneous seg fault regardless of executable (even when there is none specified, i.e., the srun command is meaningless). When I try to monitor slurmd in the foreground (-vvvv -D), I get something like this: slurmd: debug: Log file re-opened slurmd: debug2: hwloc_topology_init slurmd: debug2: hwloc_topology_load slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2 slurmd: Message aggregation disabled slurmd: topology NONE plugin loaded slurmd: route default plugin loaded slurmd: CPU frequency setting not configured for this node slurmd: debug: Resource spec: No specialized cores configured by default on this node slurmd: debug: Resource spec: Reserved system memory limit not configured for this node slurmd: debug: Reading cgroup.conf file /software/slurm/etc/cgroup.conf slurmd: debug2: hwloc_topology_init slurmd: debug2: hwloc_topology_load slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2 slurmd: debug: Reading cgroup.conf file /software/slurm/etc/cgroup.conf slurmd: debug: task/cgroup: loaded slurmd: debug: Munge authentication plugin loaded slurmd: debug: spank: opening plugin stack /software/slurm/etc/plugstack.conf slurmd: Munge cryptographic signature plugin loaded slurmd: slurmd version 17.11.5 started slurmd: debug: Job accounting gather NOT_INVOKED plugin loaded slurmd: debug: job_container none plugin loaded slurmd: debug: switch NONE plugin loaded slurmd: slurmd started on Mon, 07 May 2018 23:54:31 +0200 slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 Memory=64062 TmpDisk=187611 Uptime=1827335 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) slurmd: debug: AcctGatherEnergy NONE plugin loaded slurmd: debug: AcctGatherProfile NONE plugin loaded slurmd: debug: AcctGatherInterconnect NONE plugin loaded slurmd: debug: AcctGatherFilesystem NONE plugin loaded slurmd: debug2: No acct_gather.conf file (/software/slurm/etc/acct_gather.conf) slurmd: debug2: got this type of message 4005 slurmd: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH slurmd: debug2: _group_cache_lookup_internal: no entry found for andreas slurmd: _run_prolog: run job script took usec=5 slurmd: _run_prolog: prolog with lock for job 100 ran for 0 seconds slurmd: Launching batch job 100 for UID 1003 slurmd: debug2: got this type of message 6011 slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB slurmd: debug: _rpc_terminate_job, uid = 1001 slurmd: debug: credential for job 100 revoked slurmd: debug2: No steps in jobid 100 to send signal 999 slurmd: debug2: No steps in jobid 100 to send signal 18 slurmd: debug2: No steps in jobid 100 to send signal 15 slurmd: debug2: set revoke expiration for jobid 100 to 1525730207 UTS slurmd: debug2: got this type of message 1008 Here, job 100 would be a submission script with something like: #!/bin/bash -l #SBATCH --job-name=FSPMXX #SBATCH --output=/storage/andreas/camp3.out #SBATCH --error=/storage/andreas/camp3.err #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 --tasks-per-node=32 --ntasks-per-core=1 ######## #SBATCH -pccm srun This produces in camp3.err: /var/log/slurm/tmp/slurmd.stromboli001.spool/job00101/slurm_script: line 9: 144905 Segmentation fault (core dumped) srun I tried to recompile pmix and slurm with debug options, but I cannot get to seem any more information than this. I don't think the MPI integration can be broken per se as jobs run through salloc+srun seem to work fine. My understanding of the inner workings of slurm is virtually nonexistent, so I'll be grateful for any clue you may offer. Andreas (UZH, Switzerland)