Ah yes, It's the versions of each that are packaged in debian testing, which are openmpi 1.4.1 and slurm 2.1.9.
David On Tue, Jul 6, 2010 at 11:38 AM, Ralph Castain <r...@open-mpi.org> wrote: > It would really help if you told us what version of OMPI you are using, and > what version of SLURM. > > > On Jul 6, 2010, at 12:16 PM, David Roundy wrote: > >> Hi all, >> >> I'm running into trouble running an openmpi job under slurm. I >> imagine the trouble may be in my slurm configuration, but since the >> error itself involves mpirun crashing, I thought I'd best ask here >> first. The error message I get is: >> >> -------------------------------------------------------------------------- >> All nodes which are allocated for this job are already filled. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to >> launch so we are aborting. >> >> There may be more information reported by the environment (see above). >> >> This may be because the daemon was unable to find all the needed shared >> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the >> location of the shared libraries on the remote nodes and this will >> automatically be forwarded to the remote nodes. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -------------------------------------------------------------------------- >> mpirun: clean termination accomplished >> >> This shows up when I run my MPI job with the following script: >> >> #!/bin/sh >> set -ev >> hostname >> mpirun pw.x < pw.in > pw.out 2> errors_pw >> (end of submit.sh) >> >> if I submit using >> >> sbatch -c 2 submit.sh >> >> If I use "-N 2" instead of "-c 2", the job runs fine, but runs on two >> separate nodes, rather than two separate cores on a single node (which >> makes it extremely slow). I know that the problem is related somehow >> to the environment variables that are passed to openmpi by slurm, >> since I can fix the crash by changing my script to read: >> >> #!/bin/sh >> set -ev >> hostname >> # clear SLURM environment variables >> for i in `env | awk -F= '/SLURM/ {print $1}' | grep SLURM`; do >> echo unsetting $i >> unset $i >> done >> mpirun -np 2 pw.x < pw.in > pw.out 2> errors_pw >> >> So you can see that I just clear all the environment variables and >> then specify the number of processors to use manually. I suppose I >> could use a bisection approach to figure out which environment >> variable is triggering this crash, and then could either edit my >> script to just modify that variable, or could figure out how to make >> slurm pass things differently. But I thought that before entering >> upon this laborious process, it'd be worth asking on the list to see >> if anyone has a suggestion as to what might be going wrong? I'll be >> happy to provide my slurm config (or anything else that seems useful) >> if you think that would be helpful! >> -- >> David Roundy >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- David Roundy