It would really help if you told us what version of OMPI you are using, and 
what version of SLURM.


On Jul 6, 2010, at 12:16 PM, David Roundy wrote:

> Hi all,
> 
> I'm running into trouble running an openmpi job under slurm.  I
> imagine the trouble may be in my slurm configuration, but since the
> error itself involves mpirun crashing, I thought I'd best ask here
> first.  The error message I get is:
> 
> --------------------------------------------------------------------------
> All nodes which are allocated for this job are already filled.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
> 
> This shows up when I run my MPI job with the following script:
> 
> #!/bin/sh
> set -ev
> hostname
> mpirun pw.x < pw.in > pw.out 2> errors_pw
> (end of submit.sh)
> 
> if I submit using
> 
> sbatch -c 2 submit.sh
> 
> If I use "-N 2" instead of "-c 2", the job runs fine, but runs on two
> separate nodes, rather than two separate cores on a single node (which
> makes it extremely slow).  I know that the problem is related somehow
> to the environment variables that are passed to openmpi by slurm,
> since I can fix the crash by changing my script to read:
> 
> #!/bin/sh
> set -ev
> hostname
> # clear SLURM environment variables
> for i in `env | awk -F= '/SLURM/ {print $1}' | grep SLURM`; do
>  echo unsetting $i
>  unset $i
> done
> mpirun -np 2 pw.x < pw.in > pw.out 2> errors_pw
> 
> So you can see that I just clear all the environment variables and
> then specify the number of processors to use manually.  I suppose I
> could use a bisection approach to figure out which environment
> variable is triggering this crash, and then could either edit my
> script to just modify that variable, or could figure out how to make
> slurm pass things differently.  But I thought that before entering
> upon this laborious process, it'd be worth asking on the list to see
> if anyone has a suggestion as to what might be going wrong? I'll be
> happy to provide my slurm config (or anything else that seems useful)
> if you think that would be helpful!
> -- 
> David Roundy
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to