Hi Benjamin,

thanks for getting back to me! I somehow failed to ever arrive at this page.

Andreas

-----"slurm-users" <slurm-users-boun...@lists.schedmd.com> wrote: -----
To: slurm-users@lists.schedmd.com
From: Benjamin Matthews 
Sent by: "slurm-users" 
Date: 05/09/2018 01:20AM
Subject: Re: [slurm-users] srun seg faults immediately from within sbatch but 
not salloc

                   I think this should already be fixed in the upcoming 
release. See: 
https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72
     
     
On 5/8/18 12:08 PM,       a.vita...@bioc.uzh.ch wrote:
          Dear all,
         
         I tried to debug this with some apparent success (for now).
         
         If anyone cares:
         With the help of gdb inside sbatch, I tracked down the immediate       
  seg fault to strcmp.
         I then hacked src/srun/srun.c with some info statements and         
isolated this function as the culprit:
         static           void _setup_env_working_cluster(void)
           
         With my configuration, this routine ended up performing a         
strcmp of two NULL pointers, which seg-faults on our system (and         is not 
language-compliant I would think?). My current         understanding is that 
this is a slurm bug.
         
         The issue is rectifiable by simply giving the cluster a name in        
 slurm.conf (e.g., ClusterName=bla). I am not using slurmdbd btw.
         
         Hope this helps,
         Andreas
         
         
         -----"slurm-users" <slurm-users-boun...@lists.schedmd.com>           
wrote: -----         
           
To: slurm-users@lists.schedmd.com
             From: a.vita...@bioc.uzh.ch
             Sent by: "slurm-users" 
               Date: 05/08/2018 12:44AM
               Subject: [slurm-users] srun seg faults immediately from          
     within sbatch but not salloc
               
               Dear                 all,
                 
                 I am trying to set up a small cluster running slurm on         
        Ubuntu 16.04.
                 I installed slurm-17.11.5 along with pmix-2.1.1 on an          
       NFS-shared partition. Installation seems fine. Munge is                 
taken from the system package.
                 Something like this:
                 ./configure                   
--prefix=/software/slurm/slurm-17.11.5                   
--exec-prefix=/software/slurm/Gnu                   --with-pmix=/software/pmix 
--with-munge=/usr                   --sysconfdir=/software/slurm/etc
                 
                 One of the nodes is also the control host and runs both        
         slurmctld and slurmd (but the issue is there also if                 
this is not the case). I start daemons manually at the                 moment 
(slurmctld first).
                 My configuration file looks like this (I removed the           
      node-specific parts):
                 
                 SlurmdUser=root
                   #
                   AuthType=auth/munge
                   # Epilog=/usr/local/slurm/etc/epilog
                   FastSchedule=1
                   JobCompLoc=/var/log/slurm/slurm.job.log
                   JobCompType=jobcomp/filetxt
                   JobCredentialPrivateKey=/usr/local/etc/slurm.key
 JobCredentialPublicCertificate=/usr/local/etc/slurm.cert
                   #PluginDir=/usr/local/slurm/lib/slurm
                   # Prolog=/usr/local/slurm/etc/prolog
                   SchedulerType=sched/backfill
                   SelectType=select/linear
                   SlurmUser=cadmin # this user exists everywhere
                   SlurmctldPort=7002
                   SlurmctldTimeout=300
                   SlurmdPort=7003
                   SlurmdTimeout=300
                   SwitchType=switch/none
                   TreeWidth=50
                   #
                   # logging
                   StateSaveLocation=/var/log/slurm/tmp
                   SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool
                   SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid
                   SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid
                   SlurmctldLogFile=/var/log/slurm/slurmctld.log
                   SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h
                   #
                   # job settings
                   MaxTasksPerNode=64
                   MpiDefault=pmix_v2
                   
                   # plugins
                   TaskPlugin=task/cgroup
                 
                 
                 There are no prolog or epilog scripts.
                 After some fiddling with MPI, I got the system to work         
        with interactive jobs through salloc (MPI behaves                 
correctly for jobs occupying one or all of the nodes).                 sinfo 
produces expected results.
                 However, as soon as I try to submit through sbatch I get       
          an instantaneous seg fault regardless of executable                 
(even when there is none specified, i.e., the srun                 command is 
meaningless).
                 
                 When I try to monitor slurmd in the foreground (-vvvv          
       -D), I get something like this:
                 
                 slurmd: debug:  Log file                   re-opened
                   slurmd: debug2: hwloc_topology_init
                   slurmd: debug2: hwloc_topology_load
                   slurmd: debug:  CPUs:64 Boards:1 Sockets:2                   
CoresPerSocket:16 ThreadsPerCore:2
                   slurmd: Message aggregation disabled
                   slurmd: topology NONE plugin loaded
                   slurmd: route default plugin loaded
                   slurmd: CPU frequency setting not configured for this        
           node
                   slurmd: debug:  Resource spec: No specialized cores          
         configured by default on this node
                   slurmd: debug:  Resource spec: Reserved system memory        
           limit not configured for this node
                   slurmd: debug:  Reading cgroup.conf file                   
/software/slurm/etc/cgroup.conf
                   slurmd: debug2: hwloc_topology_init
                   slurmd: debug2: hwloc_topology_load
                   slurmd: debug:  CPUs:64 Boards:1 Sockets:2                   
CoresPerSocket:16 ThreadsPerCore:2
                   slurmd: debug:  Reading cgroup.conf file                   
/software/slurm/etc/cgroup.conf
                   slurmd: debug:  task/cgroup: loaded
                   slurmd: debug:  Munge authentication plugin loaded
                   slurmd: debug:  spank: opening plugin stack                  
 /software/slurm/etc/plugstack.conf
                   slurmd: Munge cryptographic signature plugin loaded
                   slurmd: slurmd version 17.11.5 started
                   slurmd: debug:  Job accounting gather NOT_INVOKED            
       plugin loaded
                   slurmd: debug:  job_container none plugin loaded
                   slurmd: debug:  switch NONE plugin loaded
                   slurmd: slurmd started on Mon, 07 May 2018 23:54:31          
         +0200
                   slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2        
           Memory=64062 TmpDisk=187611 Uptime=1827335                   
CPUSpecList=(null) FeaturesAvail=(null)                   FeaturesActive=(null)
                   slurmd: debug:  AcctGatherEnergy NONE plugin loaded
                   slurmd: debug:  AcctGatherProfile NONE plugin loaded
                   slurmd: debug:  AcctGatherInterconnect NONE plugin           
        loaded
                   slurmd: debug:  AcctGatherFilesystem NONE plugin             
      loaded
                   slurmd: debug2: No acct_gather.conf file                   
(/software/slurm/etc/acct_gather.conf)
                   slurmd: debug2: got this type of message 4005
                   slurmd: debug2: Processing RPC:                   
REQUEST_BATCH_JOB_LAUNCH
                   slurmd: debug2: _group_cache_lookup_internal: no entry       
            found for andreas
                   slurmd: _run_prolog: run job script took usec=5
                   slurmd: _run_prolog: prolog with lock for job 100 ran        
           for 0 seconds
                   slurmd: Launching batch job 100 for UID 1003
                   slurmd: debug2: got this type of message 6011
                   slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
                   slurmd: debug:  _rpc_terminate_job, uid = 1001
                   slurmd: debug:  credential for job 100 revoked
                   slurmd: debug2: No steps in jobid 100 to send signal         
          999
                   slurmd: debug2: No steps in jobid 100 to send signal         
          18
                   slurmd: debug2: No steps in jobid 100 to send signal         
          15
                   slurmd: debug2: set revoke expiration for jobid 100 to       
            1525730207 UTS
                   slurmd: debug2: got this type of message 1008
                 
                 Here, job 100 would be a submission script with                
 something like:
                 
                 #!/bin/bash -l
                   #SBATCH --job-name=FSPMXX
                   #SBATCH --output=/storage/andreas/camp3.out
                   #SBATCH --error=/storage/andreas/camp3.err
                   #SBATCH --nodes=1
                   #SBATCH --cpus-per-task=1 --tasks-per-node=32                
   --ntasks-per-core=1
                   ######## #SBATCH -pccm
                   
                   srun
                 
                 This produces in camp3.err:
                 
 /var/log/slurm/tmp/slurmd.stromboli001.spool/job00101/slurm_script: line       
            9: 144905 Segmentation fault      (core dumped) srun
                 
                 I tried to recompile pmix and slurm with debug options,        
         but I cannot get to seem any more information than this.
                 
                 I don't think the MPI integration can be broken per se         
        as jobs run through salloc+srun seem to work fine.
                 
                 My understanding of the inner workings of slurm is             
    virtually nonexistent, so I'll be grateful for any clue                 you 
may offer.
                 
                 Andreas (UZH, Switzerland)
                                         
     

Reply via email to