Re: [slurm-users] srun seg faults immediately from within sbatch but not salloc

a . vitalis Wed, 09 May 2018 01:53:11 -0700

Hi Benjamin,

thanks for getting back to me! I somehow failed to ever arrive at this page.


Andreas

-----"slurm-users" <slurm-users-boun...@lists.schedmd.com> wrote: -----
To: slurm-users@lists.schedmd.com
From: Benjamin Matthews 
Sent by: "slurm-users" 
Date: 05/09/2018 01:20AM
Subject: Re: [slurm-users] srun seg faults immediately from within sbatch but 
not salloc

                   I think this should already be fixed in the upcoming 
release. See: 
https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72
     
     
On 5/8/18 12:08 PM,       a.vita...@bioc.uzh.ch wrote:
          Dear all,
         
         I tried to debug this with some apparent success (for now).
         
         If anyone cares:
         With the help of gdb inside sbatch, I tracked down the immediate       
  seg fault to strcmp.
         I then hacked src/srun/srun.c with some info statements and         
isolated this function as the culprit:
         static           void _setup_env_working_cluster(void)
           
         With my configuration, this routine ended up performing a         
strcmp of two NULL pointers, which seg-faults on our system (and         is not 
language-compliant I would think?). My current         understanding is that 
this is a slurm bug.
         
         The issue is rectifiable by simply giving the cluster a name in        
 slurm.conf (e.g., ClusterName=bla). I am not using slurmdbd btw.
         
         Hope this helps,
         Andreas
         
         
         -----"slurm-users" <slurm-users-boun...@lists.schedmd.com>           
wrote: -----         
           
To: slurm-users@lists.schedmd.com
             From: a.vita...@bioc.uzh.ch
             Sent by: "slurm-users" 
               Date: 05/08/2018 12:44AM
               Subject: [slurm-users] srun seg faults immediately from          
     within sbatch but not salloc
               
               Dear                 all,
                 
                 I am trying to set up a small cluster running slurm on         
        Ubuntu 16.04.
                 I installed slurm-17.11.5 along with pmix-2.1.1 on an          
       NFS-shared partition. Installation seems fine. Munge is                 
taken from the system package.
                 Something like this:
                 ./configure                   
--prefix=/software/slurm/slurm-17.11.5                   
--exec-prefix=/software/slurm/Gnu                   --with-pmix=/software/pmix 
--with-munge=/usr                   --sysconfdir=/software/slurm/etc
                 
                 One of the nodes is also the control host and runs both        
         slurmctld and slurmd (but the issue is there also if                 
this is not the case). I start daemons manually at the                 moment 
(slurmctld first).
                 My configuration file looks like this (I removed the           
      node-specific parts):
                 
                 SlurmdUser=root
                   #
                   AuthType=auth/munge
                   # Epilog=/usr/local/slurm/etc/epilog
                   FastSchedule=1
                   JobCompLoc=/var/log/slurm/slurm.job.log
                   JobCompType=jobcomp/filetxt
                   JobCredentialPrivateKey=/usr/local/etc/slurm.key
 JobCredentialPublicCertificate=/usr/local/etc/slurm.cert
                   #PluginDir=/usr/local/slurm/lib/slurm
                   # Prolog=/usr/local/slurm/etc/prolog
                   SchedulerType=sched/backfill
                   SelectType=select/linear
                   SlurmUser=cadmin # this user exists everywhere
                   SlurmctldPort=7002
                   SlurmctldTimeout=300
                   SlurmdPort=7003
                   SlurmdTimeout=300
                   SwitchType=switch/none
                   TreeWidth=50
                   #
                   # logging
                   StateSaveLocation=/var/log/slurm/tmp
                   SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool
                   SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid
                   SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid
                   SlurmctldLogFile=/var/log/slurm/slurmctld.log
                   SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h
                   #
                   # job settings
                   MaxTasksPerNode=64
                   MpiDefault=pmix_v2
                   
                   # plugins
                   TaskPlugin=task/cgroup
                 
                 
                 There are no prolog or epilog scripts.
                 After some fiddling with MPI, I got the system to work         
        with interactive jobs through salloc (MPI behaves                 
correctly for jobs occupying one or all of the nodes).                 sinfo 
produces expected results.
                 However, as soon as I try to submit through sbatch I get       
          an instantaneous seg fault regardless of executable                 
(even when there is none specified, i.e., the srun                 command is 
meaningless).
                 
                 When I try to monitor slurmd in the foreground (-vvvv          
       -D), I get something like this:
                 
                 slurmd: debug:  Log file                   re-opened
                   slurmd: debug2: hwloc_topology_init
                   slurmd: debug2: hwloc_topology_load
                   slurmd: debug:  CPUs:64 Boards:1 Sockets:2                   
CoresPerSocket:16 ThreadsPerCore:2
                   slurmd: Message aggregation disabled
                   slurmd: topology NONE plugin loaded
                   slurmd: route default plugin loaded
                   slurmd: CPU frequency setting not configured for this        
           node
                   slurmd: debug:  Resource spec: No specialized cores          
         configured by default on this node
                   slurmd: debug:  Resource spec: Reserved system memory        
           limit not configured for this node
                   slurmd: debug:  Reading cgroup.conf file                   
/software/slurm/etc/cgroup.conf
                   slurmd: debug2: hwloc_topology_init
                   slurmd: debug2: hwloc_topology_load
                   slurmd: debug:  CPUs:64 Boards:1 Sockets:2                   
CoresPerSocket:16 ThreadsPerCore:2
                   slurmd: debug:  Reading cgroup.conf file                   
/software/slurm/etc/cgroup.conf
                   slurmd: debug:  task/cgroup: loaded
                   slurmd: debug:  Munge authentication plugin loaded
                   slurmd: debug:  spank: opening plugin stack                  
 /software/slurm/etc/plugstack.conf
                   slurmd: Munge cryptographic signature plugin loaded
                   slurmd: slurmd version 17.11.5 started
                   slurmd: debug:  Job accounting gather NOT_INVOKED            
       plugin loaded
                   slurmd: debug:  job_container none plugin loaded
                   slurmd: debug:  switch NONE plugin loaded
                   slurmd: slurmd started on Mon, 07 May 2018 23:54:31          
         +0200
                   slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2        
           Memory=64062 TmpDisk=187611 Uptime=1827335                   
CPUSpecList=(null) FeaturesAvail=(null)                   FeaturesActive=(null)
                   slurmd: debug:  AcctGatherEnergy NONE plugin loaded
                   slurmd: debug:  AcctGatherProfile NONE plugin loaded
                   slurmd: debug:  AcctGatherInterconnect NONE plugin           
        loaded
                   slurmd: debug:  AcctGatherFilesystem NONE plugin             
      loaded
                   slurmd: debug2: No acct_gather.conf file                   
(/software/slurm/etc/acct_gather.conf)
                   slurmd: debug2: got this type of message 4005
                   slurmd: debug2: Processing RPC:                   
REQUEST_BATCH_JOB_LAUNCH
                   slurmd: debug2: _group_cache_lookup_internal: no entry       
            found for andreas
                   slurmd: _run_prolog: run job script took usec=5
                   slurmd: _run_prolog: prolog with lock for job 100 ran        
           for 0 seconds
                   slurmd: Launching batch job 100 for UID 1003
                   slurmd: debug2: got this type of message 6011
                   slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
                   slurmd: debug:  _rpc_terminate_job, uid = 1001
                   slurmd: debug:  credential for job 100 revoked
                   slurmd: debug2: No steps in jobid 100 to send signal         
          999
                   slurmd: debug2: No steps in jobid 100 to send signal         
          18
                   slurmd: debug2: No steps in jobid 100 to send signal         
          15
                   slurmd: debug2: set revoke expiration for jobid 100 to       
            1525730207 UTS
                   slurmd: debug2: got this type of message 1008
                 
                 Here, job 100 would be a submission script with                
 something like:
                 
                 #!/bin/bash -l
                   #SBATCH --job-name=FSPMXX
                   #SBATCH --output=/storage/andreas/camp3.out
                   #SBATCH --error=/storage/andreas/camp3.err
                   #SBATCH --nodes=1
                   #SBATCH --cpus-per-task=1 --tasks-per-node=32                
   --ntasks-per-core=1
                   ######## #SBATCH -pccm
                   
                   srun
                 
                 This produces in camp3.err:
                 
 /var/log/slurm/tmp/slurmd.stromboli001.spool/job00101/slurm_script: line       
            9: 144905 Segmentation fault      (core dumped) srun
                 
                 I tried to recompile pmix and slurm with debug options,        
         but I cannot get to seem any more information than this.
                 
                 I don't think the MPI integration can be broken per se         
        as jobs run through salloc+srun seem to work fine.
                 
                 My understanding of the inner workings of slurm is             
    virtually nonexistent, so I'll be grateful for any clue                 you 
may offer.
                 
                 Andreas (UZH, Switzerland)

Re: [slurm-users] srun seg faults immediately from within sbatch but not salloc

Reply via email to