Ah yes,

It's the versions of each that are packaged in debian testing, which
are openmpi 1.4.1 and slurm 2.1.9.

David

On Tue, Jul 6, 2010 at 11:38 AM, Ralph Castain <r...@open-mpi.org> wrote:
> It would really help if you told us what version of OMPI you are using, and 
> what version of SLURM.
>
>
> On Jul 6, 2010, at 12:16 PM, David Roundy wrote:
>
>> Hi all,
>>
>> I'm running into trouble running an openmpi job under slurm.  I
>> imagine the trouble may be in my slurm configuration, but since the
>> error itself involves mpirun crashing, I thought I'd best ask here
>> first.  The error message I get is:
>>
>> --------------------------------------------------------------------------
>> All nodes which are allocated for this job are already filled.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> mpirun: clean termination accomplished
>>
>> This shows up when I run my MPI job with the following script:
>>
>> #!/bin/sh
>> set -ev
>> hostname
>> mpirun pw.x < pw.in > pw.out 2> errors_pw
>> (end of submit.sh)
>>
>> if I submit using
>>
>> sbatch -c 2 submit.sh
>>
>> If I use "-N 2" instead of "-c 2", the job runs fine, but runs on two
>> separate nodes, rather than two separate cores on a single node (which
>> makes it extremely slow).  I know that the problem is related somehow
>> to the environment variables that are passed to openmpi by slurm,
>> since I can fix the crash by changing my script to read:
>>
>> #!/bin/sh
>> set -ev
>> hostname
>> # clear SLURM environment variables
>> for i in `env | awk -F= '/SLURM/ {print $1}' | grep SLURM`; do
>>  echo unsetting $i
>>  unset $i
>> done
>> mpirun -np 2 pw.x < pw.in > pw.out 2> errors_pw
>>
>> So you can see that I just clear all the environment variables and
>> then specify the number of processors to use manually.  I suppose I
>> could use a bisection approach to figure out which environment
>> variable is triggering this crash, and then could either edit my
>> script to just modify that variable, or could figure out how to make
>> slurm pass things differently.  But I thought that before entering
>> upon this laborious process, it'd be worth asking on the list to see
>> if anyone has a suggestion as to what might be going wrong? I'll be
>> happy to provide my slurm config (or anything else that seems useful)
>> if you think that would be helpful!
>> --
>> David Roundy
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
David Roundy

Reply via email to