Reuti I discovered what was the error. I forgot include the '-fopenmp' when I compiled the objects in the Makefile, so the program worked but it didn't divide the job in threads. Now the program is working and I can use until 15 cores for machine in the queue one.q. Anyway i would like to try implement your advice. Well I'm not alone in the cluster so i must implement your second suggestion. The steps are a) Use '$ qconf -mp orte' to change the allocation rule to 8 b) Set '#$ -pe orte 80' in the script c) I'm not sure how to do this step. I'd appreciate your help here. I can add some lines to the script to determine the PE_HOSTFILE path and contents, but i don't know how alter it echo "PE_HOSTFILE:"echo $PE_HOSTFILEechoecho "cat PE_HOSTFILE:"cat $PE_HOSTFILE Thanks for take a time for answer this emails, your advices had been very useful PS: The version of SGE is OGS/GE 2011.11p1
Oscar Fabian Mojica Ladino Geologist M.S. in Geophysics > From: re...@staff.uni-marburg.de > Date: Fri, 15 Aug 2014 20:38:12 +0200 > To: us...@open-mpi.org > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program > > Hi, > > Am 15.08.2014 um 19:56 schrieb Oscar Mojica: > > > Yes, my installation of Open MPI is SGE-aware. I got the following > > > > [oscar@compute-1-2 ~]$ ompi_info | grep grid > > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.2) > > Fine. > > > > I'm a bit slow and I didn't understand the las part of your message. So i > > made a test trying to solve my doubts. > > This is the cluster configuration: There are some machines turned off but > > that is no problem > > > > [oscar@aguia free-noise]$ qhost > > HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO > > SWAPUS > > ------------------------------------------------------------------------------- > > global - - - - - - > > - > > compute-1-10 linux-x64 16 0.97 23.6G 558.6M 996.2M > > 0.0 > > compute-1-11 linux-x64 16 - 23.6G - 996.2M > > - > > compute-1-12 linux-x64 16 0.97 23.6G 561.1M 996.2M > > 0.0 > > compute-1-13 linux-x64 16 0.99 23.6G 558.7M 996.2M > > 0.0 > > compute-1-14 linux-x64 16 1.00 23.6G 555.1M 996.2M > > 0.0 > > compute-1-15 linux-x64 16 0.97 23.6G 555.5M 996.2M > > 0.0 > > compute-1-16 linux-x64 8 0.00 15.7G 296.9M 1000.0M > > 0.0 > > compute-1-17 linux-x64 8 0.00 15.7G 299.4M 1000.0M > > 0.0 > > compute-1-18 linux-x64 8 - 15.7G - 1000.0M > > - > > compute-1-19 linux-x64 8 - 15.7G - 996.2M > > - > > compute-1-2 linux-x64 16 1.19 23.6G 468.1M 1000.0M > > 0.0 > > compute-1-20 linux-x64 8 0.04 15.7G 297.2M 1000.0M > > 0.0 > > compute-1-21 linux-x64 8 - 15.7G - 1000.0M > > - > > compute-1-22 linux-x64 8 0.00 15.7G 297.2M 1000.0M > > 0.0 > > compute-1-23 linux-x64 8 0.16 15.7G 299.6M 1000.0M > > 0.0 > > compute-1-24 linux-x64 8 0.00 15.7G 291.5M 996.2M > > 0.0 > > compute-1-25 linux-x64 8 0.04 15.7G 293.4M 996.2M > > 0.0 > > compute-1-26 linux-x64 8 - 15.7G - 1000.0M > > - > > compute-1-27 linux-x64 8 0.00 15.7G 297.0M 1000.0M > > 0.0 > > compute-1-29 linux-x64 8 - 15.7G - 1000.0M > > - > > compute-1-3 linux-x64 16 - 23.6G - 996.2M > > - > > compute-1-30 linux-x64 16 - 23.6G - 996.2M > > - > > compute-1-4 linux-x64 16 0.97 23.6G 571.6M 996.2M > > 0.0 > > compute-1-5 linux-x64 16 1.00 23.6G 559.6M 996.2M > > 0.0 > > compute-1-6 linux-x64 16 0.66 23.6G 403.1M 996.2M > > 0.0 > > compute-1-7 linux-x64 16 0.95 23.6G 402.7M 996.2M > > 0.0 > > compute-1-8 linux-x64 16 0.97 23.6G 556.8M 996.2M > > 0.0 > > compute-1-9 linux-x64 16 1.02 23.6G 566.0M 1000.0M > > 0.0 > > > > I ran my program using only MPI with 10 processors of the queue one.q which > > has 14 machines (compute-1-2 to compute-1-15). Whit 'qstat -t' I got: > > > > [oscar@aguia free-noise]$ qstat -t > > job-ID prior name user state submit/start at queue > > master ja-task-ID task-ID state cpu mem io > > stat failed > > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > 2726 0.50500 job oscar r 08/15/2014 12:38:21 > > one.q@compute-1-2.local MASTER r 00:49:12 > > 554.13753 0.09163 > > > > one.q@compute-1-2.local SLAVE > > 2726 0.50500 job oscar r 08/15/2014 12:38:21 > > one.q@compute-1-5.local SLAVE 1.compute-1-5 r > > 00:48:53 551.49022 0.09410 > > 2726 0.50500 job oscar r 08/15/2014 12:38:21 > > one.q@compute-1-9.local SLAVE 1.compute-1-9 r > > 00:50:00 564.22764 0.09409 > > 2726 0.50500 job oscar r 08/15/2014 12:38:21 > > one.q@compute-1-12.local SLAVE 1.compute-1-12 r > > 00:47:30 535.30379 0.09379 > > 2726 0.50500 job oscar r 08/15/2014 12:38:21 > > one.q@compute-1-13.local SLAVE 1.compute-1-13 r > > 00:49:51 561.69868 0.09379 > > 2726 0.50500 job oscar r 08/15/2014 12:38:21 > > one.q@compute-1-14.local SLAVE 1.compute-1-14 r > > 00:49:14 554.60818 0.09379 > > 2726 0.50500 job oscar r 08/15/2014 12:38:21 > > one.q@compute-1-10.local SLAVE 1.compute-1-10 r > > 00:49:59 562.95487 0.09349 > > 2726 0.50500 job oscar r 08/15/2014 12:38:21 > > one.q@compute-1-15.local SLAVE 1.compute-1-15 r > > 00:50:01 563.27221 0.09361 > > 2726 0.50500 job oscar r 08/15/2014 12:38:21 > > one.q@compute-1-8.local SLAVE 1.compute-1-8 r > > 00:49:26 556.68431 0.09349 > > 2726 0.50500 job oscar r 08/15/2014 12:38:21 > > one.q@compute-1-4.local SLAVE 1.compute-1-4 r > > 00:49:27 556.87510 0.04967 > > Yes, here you got 10 slots (= cores) granted by SGE. So there is no free core > left inside the allocation of SGE to allow the use of additional cores for > your threads. If you use more cores than granted by SGE, it will > oversubscribe the machines. > > The issue is now: > > a) If you want 8 threads per MPI process, your job will use 80 cores in total > - for now SGE isn't aware of it. > > b) Although you specified $fill_up as allocation rule, it looks like > $round_robin. Is there more than one slot defined in the queue definition of > one.q to get exclusive access? > > c) What version of SGE are you using? Certain ones use cgroups or bind > processes directly to cores (although it usually needs to be requested by the > job: first line of `qconf -help`). > > > In case you are alone in the cluster, you could bypass the allocation with b) > (unless you are hit by c)). But having a mixture of users and jobs a > different handling would be necessary to handle this in a proper way IMO: > > a) having a PE with a fixed allocation rule of 8 > > b) requesting this PE with an overall slot count of 80 > > c) copy and alter the $PE_HOSTFILE to show only (granted core count per > machine) divided by (OMP_NUM_THREADS) per entry, change $PE_HOSTFILE so that > it points to the altered file > > d) Open MPI with a Tight Integration will now start only N process per > machine according to the altered hostfile, in your case one > > e) Your application can start the desired threads and you stay inside the > granted allocation > > -- Reuti > > > > I accessed to the MASTER processor with 'ssh compute-1-2.local' , and with > > $ ps -e f and got this, I'm showing only the last lines > > > > 2506 ? Ss 0:00 /usr/sbin/atd > > 2548 tty1 Ss+ 0:00 /sbin/mingetty /dev/tty1 > > 2550 tty2 Ss+ 0:00 /sbin/mingetty /dev/tty2 > > 2552 tty3 Ss+ 0:00 /sbin/mingetty /dev/tty3 > > 2554 tty4 Ss+ 0:00 /sbin/mingetty /dev/tty4 > > 2556 tty5 Ss+ 0:00 /sbin/mingetty /dev/tty5 > > 2558 tty6 Ss+ 0:00 /sbin/mingetty /dev/tty6 > > 3325 ? Sl 0:04 /opt/gridengine/bin/linux-x64/sge_execd > > 17688 ? S 0:00 \_ sge_shepherd-2726 -bg > > 17695 ? Ss 0:00 \_ -bash > > /opt/gridengine/default/spool/compute-1-2/job_scripts/2726 > > 17797 ? S 0:00 \_ /usr/bin/time -f %E > > /opt/openmpi/bin/mpirun -v -np 10 ./inverse.exe > > 17798 ? S 0:01 \_ /opt/openmpi/bin/mpirun -v -np > > 10 ./inverse.exe > > 17799 ? Sl 0:00 \_ > > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-5.local > > PATH=/opt/openmpi/bin:$PATH ; expo > > 17800 ? Sl 0:00 \_ > > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-9.local > > PATH=/opt/openmpi/bin:$PATH ; expo > > 17801 ? Sl 0:00 \_ > > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-12.local > > PATH=/opt/openmpi/bin:$PATH ; exp > > 17802 ? Sl 0:00 \_ > > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-13.local > > PATH=/opt/openmpi/bin:$PATH ; exp > > 17803 ? Sl 0:00 \_ > > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-14.local > > PATH=/opt/openmpi/bin:$PATH ; exp > > 17804 ? Sl 0:00 \_ > > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-10.local > > PATH=/opt/openmpi/bin:$PATH ; exp > > 17805 ? Sl 0:00 \_ > > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-15.local > > PATH=/opt/openmpi/bin:$PATH ; exp > > 17806 ? Sl 0:00 \_ > > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-8.local > > PATH=/opt/openmpi/bin:$PATH ; expo > > 17807 ? Sl 0:00 \_ > > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-4.local > > PATH=/opt/openmpi/bin:$PATH ; expo > > 17826 ? R 31:36 \_ ./inverse.exe > > 3429 ? Ssl 0:00 automount --pid-file /var/run/autofs.pid > > > > So the job is using the 10 machines, Until here is all right OK. Do you > > think that changing the "allocation_rule " to a number instead $fill_up the > > MPI processes would divide the work in that number of threads? > > > > Thanks a lot > > > > Oscar Fabian Mojica Ladino > > Geologist M.S. in Geophysics > > > > > > PS: I have another doubt, what is a slot? is a physical core? > > > > > > > From: re...@staff.uni-marburg.de > > > Date: Thu, 14 Aug 2014 23:54:22 +0200 > > > To: us...@open-mpi.org > > > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program > > > > > > Hi, > > > > > > I think this is a broader issue in case an MPI library is used in > > > conjunction with threads while running inside a queuing system. First: > > > whether your actual installation of Open MPI is SGE-aware you can check > > > with: > > > > > > $ ompi_info | grep grid > > > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.5) > > > > > > Then we can look at the definition of your PE: "allocation_rule > > > $fill_up". This means that SGE will grant you 14 slots in total in any > > > combination on the available machines, means 8+4+2 slots allocation is an > > > allowed combination like 4+4+3+3 and so on. Depending on the > > > SGE-awareness it's a question: will your application just start processes > > > on all nodes and completely disregard the granted allocation, or as the > > > other extreme does it stays on one and the same machine for all started > > > processes? On the master node of the parallel job you can issue: > > > > > > $ ps -e f > > > > > > (f w/o -) to have a look whether `ssh` or `qrsh -inhert ...` is used to > > > reach other machines and their requested process count. > > > > > > > > > Now to the common problem in such a set up: > > > > > > AFAICS: for now there is no way in the Open MPI + SGE combination to > > > specify the number of MPI processes and intended number of threads which > > > are automatically read by Open MPI while staying inside the granted slot > > > count and allocation. So it seems to be necessary to have the intended > > > number of threads being honored by Open MPI too. > > > > > > Hence specifying e.g. "allocation_rule 8" in such a setup while > > > requesting 32 processes, would for now start 32 processes by MPI already, > > > as Open MP reads the $PE_HOSTFILE and acts accordingly. > > > > > > Open MPI would have to read the generated machine file in a slightly > > > different way regarding threads: a) read the $PE_HOSTFILE, b) divide the > > > granted slots per machine by OMP_NUM_THREADS, c) throw an error in case > > > it's not divisible by OMP_NUM_THREADS. Then start one process per > > > quotient. > > > > > > Would this work for you? > > > > > > -- Reuti > > > > > > PS: This would also mean to have a couple of PEs in SGE having a fixed > > > "allocation_rule". While this works right now, an extension in SGE could > > > be "$fill_up_omp"/"$round_robin_omp" and using OMP_NUM_THREADS there too, > > > hence it must not be specified as an `export` in the job script but > > > either on the command line or inside the job script in #$ lines as job > > > requests. This would mean to collect slots in bunches of OMP_NUM_THREADS > > > on each machine to reach the overall specified slot count. Whether > > > OMP_NUM_THREADS or n times OMP_NUM_THREADS is allowed per machine needs > > > to be discussed. > > > > > > PS2: As Univa SGE can also supply a list of granted cores in the > > > $PE_HOSTFILE, it would be an extension to feed this to Open MPI to allow > > > any UGE aware binding. > > > > > > > > > Am 14.08.2014 um 21:52 schrieb Oscar Mojica: > > > > > > > Guys > > > > > > > > I changed the line to run the program in the script with both options > > > > /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v --bind-to-none -np > > > > $NSLOTS ./inverse.exe > > > > /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v --bind-to-socket -np > > > > $NSLOTS ./inverse.exe > > > > > > > > but I got the same results. When I use man mpirun appears: > > > > > > > > -bind-to-none, --bind-to-none > > > > Do not bind processes. (Default.) > > > > > > > > and the output of 'qconf -sp orte' is > > > > > > > > pe_name orte > > > > slots 9999 > > > > user_lists NONE > > > > xuser_lists NONE > > > > start_proc_args /bin/true > > > > stop_proc_args /bin/true > > > > allocation_rule $fill_up > > > > control_slaves TRUE > > > > job_is_first_task FALSE > > > > urgency_slots min > > > > accounting_summary TRUE > > > > > > > > I don't know if the installed Open MPI was compiled with '--with-sge'. > > > > How can i know that? > > > > before to think in an hybrid application i was using only MPI and the > > > > program used few processors (14). The cluster possesses 28 machines, 15 > > > > with 16 cores and 13 with 8 cores totalizing 344 units of processing. > > > > When I submitted the job (only MPI), the MPI processes were spread to > > > > the cores directly, for that reason I created a new queue with 14 > > > > machines trying to gain more time. the results were the same in both > > > > cases. In the last case i could prove that the processes were > > > > distributed to all machines correctly. > > > > > > > > What I must to do? > > > > Thanks > > > > > > > > Oscar Fabian Mojica Ladino > > > > Geologist M.S. in Geophysics > > > > > > > > > > > > > Date: Thu, 14 Aug 2014 10:10:17 -0400 > > > > > From: maxime.boissonnea...@calculquebec.ca > > > > > To: us...@open-mpi.org > > > > > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program > > > > > > > > > > Hi, > > > > > You DEFINITELY need to disable OpenMPI's new default binding. > > > > > Otherwise, > > > > > your N threads will run on a single core. --bind-to socket would be > > > > > my > > > > > recommendation for hybrid jobs. > > > > > > > > > > Maxime > > > > > > > > > > > > > > > Le 2014-08-14 10:04, Jeff Squyres (jsquyres) a écrit : > > > > > > I don't know much about OpenMP, but do you need to disable Open > > > > > > MPI's default bind-to-core functionality (I'm assuming you're using > > > > > > Open MPI 1.8.x)? > > > > > > > > > > > > You can try "mpirun --bind-to none ...", which will have Open MPI > > > > > > not bind MPI processes to cores, which might allow OpenMP to think > > > > > > that it can use all the cores, and therefore it will spawn > > > > > > num_cores threads...? > > > > > > > > > > > > > > > > > > On Aug 14, 2014, at 9:50 AM, Oscar Mojica <o_moji...@hotmail.com> > > > > > > wrote: > > > > > > > > > > > >> Hello everybody > > > > > >> > > > > > >> I am trying to run a hybrid mpi + openmp program in a cluster. I > > > > > >> created a queue with 14 machines, each one with 16 cores. The > > > > > >> program divides the work among the 14 processors with MPI and > > > > > >> within each processor a loop is also divided into 8 threads for > > > > > >> example, using openmp. The problem is that when I submit the job > > > > > >> to the queue the MPI processes don't divide the work into threads > > > > > >> and the program prints the number of threads that are working > > > > > >> within each process as one. > > > > > >> > > > > > >> I made a simple test program that uses openmp and I logged in one > > > > > >> machine of the fourteen. I compiled it using gfortran -fopenmp > > > > > >> program.f -o exe, set the OMP_NUM_THREADS environment variable > > > > > >> equal to 8 and when I ran directly in the terminal the loop was > > > > > >> effectively divided among the cores and for example in this case > > > > > >> the program printed the number of threads equal to 8 > > > > > >> > > > > > >> This is my Makefile > > > > > >> > > > > > >> # Start of the makefile > > > > > >> # Defining variables > > > > > >> objects = inv_grav3d.o funcpdf.o gr3dprm.o fdjac.o dsvd.o > > > > > >> #f90comp = /opt/openmpi/bin/mpif90 > > > > > >> f90comp = /usr/bin/mpif90 > > > > > >> #switch = -O3 > > > > > >> executable = inverse.exe > > > > > >> # Makefile > > > > > >> all : $(executable) > > > > > >> $(executable) : $(objects) > > > > > >> $(f90comp) -fopenmp -g -O -o $(executable) $(objects) > > > > > >> rm $(objects) > > > > > >> %.o: %.f > > > > > >> $(f90comp) -c $< > > > > > >> # Cleaning everything > > > > > >> clean: > > > > > >> rm $(executable) > > > > > >> # rm $(objects) > > > > > >> # End of the makefile > > > > > >> > > > > > >> and the script that i am using is > > > > > >> > > > > > >> #!/bin/bash > > > > > >> #$ -cwd > > > > > >> #$ -j y > > > > > >> #$ -S /bin/bash > > > > > >> #$ -pe orte 14 > > > > > >> #$ -N job > > > > > >> #$ -q new.q > > > > > >> > > > > > >> export OMP_NUM_THREADS=8 > > > > > >> /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v -np $NSLOTS > > > > > >> ./inverse.exe > > > > > >> > > > > > >> am I forgetting something? > > > > > >> > > > > > >> Thanks, > > > > > >> > > > > > >> Oscar Fabian Mojica Ladino > > > > > >> Geologist M.S. in Geophysics > > > > > >> _______________________________________________ > > > > > >> users mailing list > > > > > >> us...@open-mpi.org > > > > > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > >> Link to this post: > > > > > >> http://www.open-mpi.org/community/lists/users/2014/08/25016.php > > > > > > > > > > > > > > > > > > > > > -- > > > > > --------------------------------- > > > > > Maxime Boissonneault > > > > > Analyste de calcul - Calcul Québec, Université Laval > > > > > Ph. D. en physique > > > > > > > > > > _______________________________________________ > > > > > users mailing list > > > > > us...@open-mpi.org > > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > Link to this post: > > > > > http://www.open-mpi.org/community/lists/users/2014/08/25020.php > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > Link to this post: > > > > http://www.open-mpi.org/community/lists/users/2014/08/25032.php > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > > Link to this post: > > > http://www.open-mpi.org/community/lists/users/2014/08/25034.php > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2014/08/25037.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25038.php