It would really help if you told us what version of OMPI you are using, and what version of SLURM.
On Jul 6, 2010, at 12:16 PM, David Roundy wrote: > Hi all, > > I'm running into trouble running an openmpi job under slurm. I > imagine the trouble may be in my slurm configuration, but since the > error itself involves mpirun crashing, I thought I'd best ask here > first. The error message I get is: > > -------------------------------------------------------------------------- > All nodes which are allocated for this job are already filled. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > mpirun: clean termination accomplished > > This shows up when I run my MPI job with the following script: > > #!/bin/sh > set -ev > hostname > mpirun pw.x < pw.in > pw.out 2> errors_pw > (end of submit.sh) > > if I submit using > > sbatch -c 2 submit.sh > > If I use "-N 2" instead of "-c 2", the job runs fine, but runs on two > separate nodes, rather than two separate cores on a single node (which > makes it extremely slow). I know that the problem is related somehow > to the environment variables that are passed to openmpi by slurm, > since I can fix the crash by changing my script to read: > > #!/bin/sh > set -ev > hostname > # clear SLURM environment variables > for i in `env | awk -F= '/SLURM/ {print $1}' | grep SLURM`; do > echo unsetting $i > unset $i > done > mpirun -np 2 pw.x < pw.in > pw.out 2> errors_pw > > So you can see that I just clear all the environment variables and > then specify the number of processors to use manually. I suppose I > could use a bisection approach to figure out which environment > variable is triggering this crash, and then could either edit my > script to just modify that variable, or could figure out how to make > slurm pass things differently. But I thought that before entering > upon this laborious process, it'd be worth asking on the list to see > if anyone has a suggestion as to what might be going wrong? I'll be > happy to provide my slurm config (or anything else that seems useful) > if you think that would be helpful! > -- > David Roundy > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users