On Apr 30, 2013, at 1:54 PM, Vladimir Yamshchikov <yaxi...@gmail.com> wrote:
> This is the question I am trying to answer - how many threads I can use with > blastx on a grid? If I could request resources by_node, use -pernode option > to have one process per node, and then specify the correct number of threads > for each node. But I cannot, resurces (slots) are requested per-core > (per_process), I don't believe that is true - resources are requested for the entire job, not for each process > so I was instructed to request total number of slots. However, as allocated > cores are spread across the nodes, looks like it messes scheduling up causing > overload. I suggest you look at the SGE documentation - I don't think you are using it correctly > > > On Tue, Apr 30, 2013 at 3:46 PM, Ralph Castain <r...@open-mpi.org> wrote: > > On Apr 30, 2013, at 1:34 PM, Vladimir Yamshchikov <yaxi...@gmail.com> wrote: > >> I asked grid IT and they said they had to kill it as the job was overloading >> nodes. They saw loads up to 180 instead of close to 12 on 12-core nodes. >> They think that blastx is not an openmpi application, so openMPI is spawning >> between 64-96 blastx processes, each of which is then starting up 96 worker >> threads. Or if blastx can work with openmpi, my blastx synthax mpirun syntax >> is wrong. Any advice? >> I was advised earlier to use –pe openmpi [ARG} , where ARG = >> number_of_processes x number_of_threads , and then pass desired number of >> threads as ‘ mpirun –np $NSLOTS cpus-per-proc [ number_of_threads]’. When I >> did that, I got an error that more threads were requested than number of >> physical cores. >> > > How many threads are you trying to launch?? If it is a 12-core node, then you > can't have more than 12 - sounds like you are trying to startup 96! > >> >> >> >> >> >> On Tue, Apr 30, 2013 at 2:35 PM, Reuti <re...@staff.uni-marburg.de> wrote: >> Hi, >> >> Am 30.04.2013 um 21:26 schrieb Vladimir Yamshchikov: >> >> > My recent job started normally but after a few hours of running died with >> > the following message: >> > >> > -------------------------------------------------------------------------- >> > A daemon (pid 19390) died unexpectedly with status 137 while attempting >> > to launch so we are aborting. >> >> I wonder why it rose the failure only after running for hours. As 137 = 128 >> + 9 it was killed, maybe by the queuing system due to the set time limit? If >> you check the accouting, what is the output of: >> >> $ qacct -j <job_id> >> >> -- Reuti >> >> >> > There may be more information reported by the environment (see above). >> > >> > This may be because the daemon was unable to find all the needed shared >> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the >> > location of the shared libraries on the remote nodes and this will >> > automatically be forwarded to the remote nodes. >> > -------------------------------------------------------------------------- >> > -------------------------------------------------------------------------- >> > mpirun noticed that the job aborted, but has no info as to the process >> > that caused that situation. >> > >> > The scheduling script is below: >> > >> > #$ -S /bin/bash >> > #$ -cwd >> > #$ -N SC3blastx_64-96thr >> > #$ -pe openmpi* 64-96 >> > #$ -l h_rt=24:00:00,vf=3G >> > #$ -j y >> > #$ -M yaxi...@gmail.com >> > #$ -m eas >> > # >> > # Load the appropriate module files >> > # Should be loaded already >> > #$ -V >> > >> > mpirun -np $NSLOTS blastx -query >> > $UABGRID_SCRATCH/SC/AdQ30/fasta/SC1-IS4-Ind1-153ngFr1sep1run1R1AdQ30.fasta >> > -db nr -out >> > $UABGRID_SCRATCH/SC/blastx/SC/SC1-IS4-Ind1-153ngFr1sep1run1R1AdQ30.out >> > -evalue 0.001 -max_intron_length 100000 -outfmt 5 -num_alignments 20 >> > -lcase_masking -num_threads $NSLOTS >> > >> > What caused this termination? It does not seem scheduling problem as the >> > program run several hours with 96 threads. My $LD_LIBRARY_PATH does have >> > /share/apps/openmpi/1.6.4-gcc/lib entry, so how else should I modify it? >> > >> > Vladimir >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users