Re: [OMPI users] job termination on grid

Vladimir Yamshchikov Tue, 30 Apr 2013 16:34:45 -0400

I asked grid IT and they said they had to kill it as the job was
overloading nodes. They saw loads up to 180 instead of close to 12 on
12-core nodes. They think that blastx is not an openmpi application, so openMPI
is spawning between 64-96 blastx processes, each of which is then starting
up 96 worker threads. Or if blastx can work with openmpi, my blastx synthax
mpirun syntax is wrong. Any advice?


I was advised earlier to use –pe openmpi [ARG} , where  ARG =
number_of_processes x number_of_threads , and then pass desired number of
threads as ‘ mpirun –np $NSLOTS cpus-per-proc [ number_of_threads]’. When I
did that, I got an error that more threads were requested than number of
physical cores.





On Tue, Apr 30, 2013 at 2:35 PM, Reuti <re...@staff.uni-marburg.de> wrote:

> Hi,
>
> Am 30.04.2013 um 21:26 schrieb Vladimir Yamshchikov:
>
> > My recent job started normally but after a few hours of running died
> with the following message:
> >
> >
> --------------------------------------------------------------------------
> > A daemon (pid 19390) died unexpectedly with status 137 while attempting
> > to launch so we are aborting.
>
> I wonder why it rose the failure only after running for hours. As 137 =
> 128 + 9 it was killed, maybe by the queuing system due to the set time
> limit? If you check the accouting, what is the output of:
>
> $ qacct -j <job_id>
>
> -- Reuti
>
>
> > There may be more information reported by the environment (see above).
> >
> > This may be because the daemon was unable to find all the needed shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have
> the
> > location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> >
> --------------------------------------------------------------------------
> >
> --------------------------------------------------------------------------
> > mpirun noticed that the job aborted, but has no info as to the process
> > that caused that situation.
> >
> > The scheduling script is below:
> >
> > #$ -S /bin/bash
> > #$ -cwd
> > #$ -N SC3blastx_64-96thr
> > #$ -pe openmpi* 64-96
> > #$ -l h_rt=24:00:00,vf=3G
> > #$ -j y
> > #$ -M yaxi...@gmail.com
> > #$ -m eas
> > #
> > # Load the appropriate module files
> > # Should be loaded already
> > #$ -V
> >
> > mpirun -np $NSLOTS blastx -query
> $UABGRID_SCRATCH/SC/AdQ30/fasta/SC1-IS4-Ind1-153ngFr1sep1run1R1AdQ30.fasta
> -db nr -out
> $UABGRID_SCRATCH/SC/blastx/SC/SC1-IS4-Ind1-153ngFr1sep1run1R1AdQ30.out
> -evalue 0.001 -max_intron_length 100000 -outfmt 5 -num_alignments 20
> -lcase_masking -num_threads $NSLOTS
> >
> > What caused this termination? It does not seem scheduling problem as the
> program run several hours with 96 threads. My $LD_LIBRARY_PATH does have
> /share/apps/openmpi/1.6.4-gcc/lib entry, so how else should I modify it?
> >
> > Vladimir
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] job termination on grid

Reply via email to