On Apr 30, 2013, at 1:34 PM, Vladimir Yamshchikov <yaxi...@gmail.com> wrote:
> I asked grid IT and they said they had to kill it as the job was overloading > nodes. They saw loads up to 180 instead of close to 12 on 12-core nodes. They > think that blastx is not an openmpi application, so openMPI is spawning > between 64-96 blastx processes, each of which is then starting up 96 worker > threads. Or if blastx can work with openmpi, my blastx synthax mpirun syntax > is wrong. Any advice? > I was advised earlier to use –pe openmpi [ARG} , where ARG = > number_of_processes x number_of_threads , and then pass desired number of > threads as ‘ mpirun –np $NSLOTS cpus-per-proc [ number_of_threads]’. When I > did that, I got an error that more threads were requested than number of > physical cores. > How many threads are you trying to launch?? If it is a 12-core node, then you can't have more than 12 - sounds like you are trying to startup 96! > > > > > > On Tue, Apr 30, 2013 at 2:35 PM, Reuti <re...@staff.uni-marburg.de> wrote: > Hi, > > Am 30.04.2013 um 21:26 schrieb Vladimir Yamshchikov: > > > My recent job started normally but after a few hours of running died with > > the following message: > > > > -------------------------------------------------------------------------- > > A daemon (pid 19390) died unexpectedly with status 137 while attempting > > to launch so we are aborting. > > I wonder why it rose the failure only after running for hours. As 137 = 128 + > 9 it was killed, maybe by the queuing system due to the set time limit? If > you check the accouting, what is the output of: > > $ qacct -j <job_id> > > -- Reuti > > > > There may be more information reported by the environment (see above). > > > > This may be because the daemon was unable to find all the needed shared > > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > > location of the shared libraries on the remote nodes and this will > > automatically be forwarded to the remote nodes. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > mpirun noticed that the job aborted, but has no info as to the process > > that caused that situation. > > > > The scheduling script is below: > > > > #$ -S /bin/bash > > #$ -cwd > > #$ -N SC3blastx_64-96thr > > #$ -pe openmpi* 64-96 > > #$ -l h_rt=24:00:00,vf=3G > > #$ -j y > > #$ -M yaxi...@gmail.com > > #$ -m eas > > # > > # Load the appropriate module files > > # Should be loaded already > > #$ -V > > > > mpirun -np $NSLOTS blastx -query > > $UABGRID_SCRATCH/SC/AdQ30/fasta/SC1-IS4-Ind1-153ngFr1sep1run1R1AdQ30.fasta > > -db nr -out > > $UABGRID_SCRATCH/SC/blastx/SC/SC1-IS4-Ind1-153ngFr1sep1run1R1AdQ30.out > > -evalue 0.001 -max_intron_length 100000 -outfmt 5 -num_alignments 20 > > -lcase_masking -num_threads $NSLOTS > > > > What caused this termination? It does not seem scheduling problem as the > > program run several hours with 96 threads. My $LD_LIBRARY_PATH does have > > /share/apps/openmpi/1.6.4-gcc/lib entry, so how else should I modify it? > > > > Vladimir > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users