Re: [OMPI users] Running a hybrid MPI+openMP program

Oscar Mojica Tue, 19 Aug 2014 13:06:53 -0400 (EDT)

Reuti
I discovered what was the error. I forgot include the '-fopenmp' when I 
compiled the objects in the Makefile, so the program worked but it didn't 
divide the job in threads. Now the program is working and I can use until 15 
cores for machine in the queue one.q.
Anyway i would like to try implement your advice. Well I'm not alone in the 
cluster so i must implement your second suggestion. The steps are
a) Use '$ qconf -mp orte' to change the allocation rule to 8
b) Set '#$ -pe orte 80' in the script
c) I'm not sure how to do this step. I'd appreciate your help here. I can add 
some lines to the script to determine the PE_HOSTFILE path and contents, but i 
don't know how alter it 
echo "PE_HOSTFILE:"echo $PE_HOSTFILEechoecho "cat PE_HOSTFILE:"cat $PE_HOSTFILE 
Thanks for take a time for answer this emails, your advices had been very useful
PS: The version of SGE is   OGS/GE 2011.11p1


Oscar Fabian Mojica Ladino
Geologist M.S. in  Geophysics


> From: re...@staff.uni-marburg.de
> Date: Fri, 15 Aug 2014 20:38:12 +0200
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> 
> Hi,
> 
> Am 15.08.2014 um 19:56 schrieb Oscar Mojica:
> 
> > Yes, my installation of Open MPI is SGE-aware. I got the following
> > 
> > [oscar@compute-1-2 ~]$ ompi_info | grep grid
> >                  MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.2)
> 
> Fine.
> 
> 
> > I'm a bit slow and I didn't understand the las part of your message. So i 
> > made a test trying to solve my doubts.
> > This is the cluster configuration: There are some machines turned off but 
> > that is no problem
> > 
> > [oscar@aguia free-noise]$ qhost
> > HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  
> > SWAPUS
> > -------------------------------------------------------------------------------
> > global                  -               -     -       -       -       -     
> >   -
> > compute-1-10            linux-x64      16  0.97   23.6G  558.6M  996.2M     
> > 0.0
> > compute-1-11            linux-x64      16     -   23.6G       -  996.2M     
> >   -
> > compute-1-12            linux-x64      16  0.97   23.6G  561.1M  996.2M     
> > 0.0
> > compute-1-13            linux-x64      16  0.99   23.6G  558.7M  996.2M     
> > 0.0
> > compute-1-14            linux-x64      16  1.00   23.6G  555.1M  996.2M     
> > 0.0
> > compute-1-15            linux-x64      16  0.97   23.6G  555.5M  996.2M     
> > 0.0
> > compute-1-16            linux-x64       8  0.00   15.7G  296.9M 1000.0M     
> > 0.0
> > compute-1-17            linux-x64       8  0.00   15.7G  299.4M 1000.0M     
> > 0.0
> > compute-1-18            linux-x64       8     -   15.7G       - 1000.0M     
> >   -
> > compute-1-19            linux-x64       8     -   15.7G       -  996.2M     
> >   -
> > compute-1-2             linux-x64      16  1.19   23.6G  468.1M 1000.0M     
> > 0.0
> > compute-1-20            linux-x64       8  0.04   15.7G  297.2M 1000.0M     
> > 0.0
> > compute-1-21            linux-x64       8     -   15.7G       - 1000.0M     
> >   -
> > compute-1-22            linux-x64       8  0.00   15.7G  297.2M 1000.0M     
> > 0.0
> > compute-1-23            linux-x64       8  0.16   15.7G  299.6M 1000.0M     
> > 0.0
> > compute-1-24            linux-x64       8  0.00   15.7G  291.5M  996.2M     
> > 0.0
> > compute-1-25            linux-x64       8  0.04   15.7G  293.4M  996.2M     
> > 0.0
> > compute-1-26            linux-x64       8     -   15.7G       - 1000.0M     
> >   -
> > compute-1-27            linux-x64       8  0.00   15.7G  297.0M 1000.0M     
> > 0.0
> > compute-1-29            linux-x64       8     -   15.7G       - 1000.0M     
> >   -
> > compute-1-3             linux-x64      16     -   23.6G       -  996.2M     
> >   -
> > compute-1-30            linux-x64      16     -   23.6G       -  996.2M     
> >   -
> > compute-1-4             linux-x64      16  0.97   23.6G  571.6M  996.2M     
> > 0.0
> > compute-1-5             linux-x64      16  1.00   23.6G  559.6M  996.2M     
> > 0.0
> > compute-1-6             linux-x64      16  0.66   23.6G  403.1M  996.2M     
> > 0.0
> > compute-1-7             linux-x64      16  0.95   23.6G  402.7M  996.2M     
> > 0.0
> > compute-1-8             linux-x64      16  0.97   23.6G  556.8M  996.2M     
> > 0.0
> > compute-1-9             linux-x64      16  1.02   23.6G  566.0M 1000.0M     
> > 0.0 
> > 
> > I ran my program using only MPI with 10 processors of the queue one.q which 
> > has 14 machines (compute-1-2 to compute-1-15). Whit 'qstat -t' I got:
> > 
> > [oscar@aguia free-noise]$ qstat -t
> > job-ID  prior   name       user         state submit/start at     queue     
> >                      master ja-task-ID task-ID state cpu        mem     io  
> >     stat failed 
> > -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> > one.q@compute-1-2.local        MASTER                        r     00:49:12 
> > 554.13753 0.09163 
> >                                                                   
> > one.q@compute-1-2.local        SLAVE         
> >    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> > one.q@compute-1-5.local        SLAVE            1.compute-1-5 r     
> > 00:48:53 551.49022 0.09410 
> >    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> > one.q@compute-1-9.local        SLAVE            1.compute-1-9 r     
> > 00:50:00 564.22764 0.09409 
> >    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> > one.q@compute-1-12.local       SLAVE            1.compute-1-12 r     
> > 00:47:30 535.30379 0.09379 
> >    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> > one.q@compute-1-13.local       SLAVE            1.compute-1-13 r     
> > 00:49:51 561.69868 0.09379 
> >    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> > one.q@compute-1-14.local       SLAVE            1.compute-1-14 r     
> > 00:49:14 554.60818 0.09379 
> >    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> > one.q@compute-1-10.local       SLAVE            1.compute-1-10 r     
> > 00:49:59 562.95487 0.09349 
> >    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> > one.q@compute-1-15.local       SLAVE            1.compute-1-15 r     
> > 00:50:01 563.27221 0.09361 
> >    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> > one.q@compute-1-8.local        SLAVE            1.compute-1-8 r     
> > 00:49:26 556.68431 0.09349 
> >    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> > one.q@compute-1-4.local        SLAVE            1.compute-1-4 r     
> > 00:49:27 556.87510 0.04967 
> 
> Yes, here you got 10 slots (= cores) granted by SGE. So there is no free core 
> left inside the allocation of SGE to allow the use of additional cores for 
> your threads. If you use more cores than granted by SGE, it will 
> oversubscribe the machines.
> 
> The issue is now:
> 
> a) If you want 8 threads per MPI process, your job will use 80 cores in total 
> - for now SGE isn't aware of it.
> 
> b) Although you specified $fill_up as allocation rule, it looks like 
> $round_robin. Is there more than one slot defined in the queue definition of 
> one.q to get exclusive access?
> 
> c) What version of SGE are you using? Certain ones use cgroups or bind 
> processes directly to cores (although it usually needs to be requested by the 
> job: first line of `qconf -help`).
> 
> 
> In case you are alone in the cluster, you could bypass the allocation with b) 
> (unless you are hit by c)). But having a mixture of users and jobs a 
> different handling would be necessary to handle this in a proper way IMO:
> 
> a) having a PE with a fixed allocation rule of 8
> 
> b) requesting this PE with an overall slot count of 80
> 
> c) copy and alter the $PE_HOSTFILE to show only (granted core count per 
> machine) divided by (OMP_NUM_THREADS) per entry, change $PE_HOSTFILE so that 
> it points to the altered file
> 
> d) Open MPI with a Tight Integration will now start only N process per 
> machine according to the altered hostfile, in your case one
> 
> e) Your application can start the desired threads and you stay inside the 
> granted allocation
> 
> -- Reuti
> 
> 
> > I accessed to the MASTER processor with 'ssh compute-1-2.local' , and with 
> > $ ps -e f and got this, I'm showing only the last lines  
> > 
> >  2506 ?        Ss     0:00 /usr/sbin/atd
> >  2548 tty1     Ss+    0:00 /sbin/mingetty /dev/tty1
> >  2550 tty2     Ss+    0:00 /sbin/mingetty /dev/tty2
> >  2552 tty3     Ss+    0:00 /sbin/mingetty /dev/tty3
> >  2554 tty4     Ss+    0:00 /sbin/mingetty /dev/tty4
> >  2556 tty5     Ss+    0:00 /sbin/mingetty /dev/tty5
> >  2558 tty6     Ss+    0:00 /sbin/mingetty /dev/tty6
> >  3325 ?        Sl     0:04 /opt/gridengine/bin/linux-x64/sge_execd
> > 17688 ?        S      0:00  \_ sge_shepherd-2726 -bg
> > 17695 ?        Ss     0:00      \_ -bash 
> > /opt/gridengine/default/spool/compute-1-2/job_scripts/2726
> > 17797 ?        S      0:00          \_ /usr/bin/time -f %E 
> > /opt/openmpi/bin/mpirun -v -np 10 ./inverse.exe
> > 17798 ?        S      0:01              \_ /opt/openmpi/bin/mpirun -v -np 
> > 10 ./inverse.exe
> > 17799 ?        Sl     0:00                  \_ 
> > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-5.local   
> >   PATH=/opt/openmpi/bin:$PATH ; expo
> > 17800 ?        Sl     0:00                  \_ 
> > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-9.local   
> >   PATH=/opt/openmpi/bin:$PATH ; expo
> > 17801 ?        Sl     0:00                  \_ 
> > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-12.local  
> >    PATH=/opt/openmpi/bin:$PATH ; exp
> > 17802 ?        Sl     0:00                  \_ 
> > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-13.local  
> >    PATH=/opt/openmpi/bin:$PATH ; exp
> > 17803 ?        Sl     0:00                  \_ 
> > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-14.local  
> >    PATH=/opt/openmpi/bin:$PATH ; exp
> > 17804 ?        Sl     0:00                  \_ 
> > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-10.local  
> >    PATH=/opt/openmpi/bin:$PATH ; exp
> > 17805 ?        Sl     0:00                  \_ 
> > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-15.local  
> >    PATH=/opt/openmpi/bin:$PATH ; exp
> > 17806 ?        Sl     0:00                  \_ 
> > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-8.local   
> >   PATH=/opt/openmpi/bin:$PATH ; expo
> > 17807 ?        Sl     0:00                  \_ 
> > /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-4.local   
> >   PATH=/opt/openmpi/bin:$PATH ; expo
> > 17826 ?        R     31:36                  \_ ./inverse.exe
> >  3429 ?        Ssl    0:00 automount --pid-file /var/run/autofs.pid 
> > 
> > So the job is using the 10 machines, Until here is all right OK. Do you 
> > think that changing the "allocation_rule " to a number instead $fill_up the 
> > MPI processes would divide the work in that number of threads?
> > 
> > Thanks a lot 
> > 
> > Oscar Fabian Mojica Ladino
> > Geologist M.S. in  Geophysics
> > 
> > 
> > PS: I have another doubt, what  is a slot? is a physical core?
> > 
> > 
> > > From: re...@staff.uni-marburg.de
> > > Date: Thu, 14 Aug 2014 23:54:22 +0200
> > > To: us...@open-mpi.org
> > > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> > > 
> > > Hi,
> > > 
> > > I think this is a broader issue in case an MPI library is used in 
> > > conjunction with threads while running inside a queuing system. First: 
> > > whether your actual installation of Open MPI is SGE-aware you can check 
> > > with:
> > > 
> > > $ ompi_info | grep grid
> > > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.5)
> > > 
> > > Then we can look at the definition of your PE: "allocation_rule 
> > > $fill_up". This means that SGE will grant you 14 slots in total in any 
> > > combination on the available machines, means 8+4+2 slots allocation is an 
> > > allowed combination like 4+4+3+3 and so on. Depending on the 
> > > SGE-awareness it's a question: will your application just start processes 
> > > on all nodes and completely disregard the granted allocation, or as the 
> > > other extreme does it stays on one and the same machine for all started 
> > > processes? On the master node of the parallel job you can issue:
> > > 
> > > $ ps -e f
> > > 
> > > (f w/o -) to have a look whether `ssh` or `qrsh -inhert ...` is used to 
> > > reach other machines and their requested process count.
> > > 
> > > 
> > > Now to the common problem in such a set up:
> > > 
> > > AFAICS: for now there is no way in the Open MPI + SGE combination to 
> > > specify the number of MPI processes and intended number of threads which 
> > > are automatically read by Open MPI while staying inside the granted slot 
> > > count and allocation. So it seems to be necessary to have the intended 
> > > number of threads being honored by Open MPI too.
> > > 
> > > Hence specifying e.g. "allocation_rule 8" in such a setup while 
> > > requesting 32 processes, would for now start 32 processes by MPI already, 
> > > as Open MP reads the $PE_HOSTFILE and acts accordingly.
> > > 
> > > Open MPI would have to read the generated machine file in a slightly 
> > > different way regarding threads: a) read the $PE_HOSTFILE, b) divide the 
> > > granted slots per machine by OMP_NUM_THREADS, c) throw an error in case 
> > > it's not divisible by OMP_NUM_THREADS. Then start one process per 
> > > quotient.
> > > 
> > > Would this work for you?
> > > 
> > > -- Reuti
> > > 
> > > PS: This would also mean to have a couple of PEs in SGE having a fixed 
> > > "allocation_rule". While this works right now, an extension in SGE could 
> > > be "$fill_up_omp"/"$round_robin_omp" and using OMP_NUM_THREADS there too, 
> > > hence it must not be specified as an `export` in the job script but 
> > > either on the command line or inside the job script in #$ lines as job 
> > > requests. This would mean to collect slots in bunches of OMP_NUM_THREADS 
> > > on each machine to reach the overall specified slot count. Whether 
> > > OMP_NUM_THREADS or n times OMP_NUM_THREADS is allowed per machine needs 
> > > to be discussed.
> > > 
> > > PS2: As Univa SGE can also supply a list of granted cores in the 
> > > $PE_HOSTFILE, it would be an extension to feed this to Open MPI to allow 
> > > any UGE aware binding.
> > > 
> > > 
> > > Am 14.08.2014 um 21:52 schrieb Oscar Mojica:
> > > 
> > > > Guys
> > > > 
> > > > I changed the line to run the program in the script with both options
> > > > /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v --bind-to-none -np 
> > > > $NSLOTS ./inverse.exe
> > > > /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v --bind-to-socket -np 
> > > > $NSLOTS ./inverse.exe
> > > > 
> > > > but I got the same results. When I use man mpirun appears:
> > > > 
> > > > -bind-to-none, --bind-to-none
> > > > Do not bind processes. (Default.)
> > > > 
> > > > and the output of 'qconf -sp orte' is
> > > > 
> > > > pe_name orte
> > > > slots 9999
> > > > user_lists NONE
> > > > xuser_lists NONE
> > > > start_proc_args /bin/true
> > > > stop_proc_args /bin/true
> > > > allocation_rule $fill_up
> > > > control_slaves TRUE
> > > > job_is_first_task FALSE
> > > > urgency_slots min
> > > > accounting_summary TRUE
> > > > 
> > > > I don't know if the installed Open MPI was compiled with '--with-sge'. 
> > > > How can i know that?
> > > > before to think in an hybrid application i was using only MPI and the 
> > > > program used few processors (14). The cluster possesses 28 machines, 15 
> > > > with 16 cores and 13 with 8 cores totalizing 344 units of processing. 
> > > > When I submitted the job (only MPI), the MPI processes were spread to 
> > > > the cores directly, for that reason I created a new queue with 14 
> > > > machines trying to gain more time. the results were the same in both 
> > > > cases. In the last case i could prove that the processes were 
> > > > distributed to all machines correctly.
> > > > 
> > > > What I must to do?
> > > > Thanks 
> > > > 
> > > > Oscar Fabian Mojica Ladino
> > > > Geologist M.S. in Geophysics
> > > > 
> > > > 
> > > > > Date: Thu, 14 Aug 2014 10:10:17 -0400
> > > > > From: maxime.boissonnea...@calculquebec.ca
> > > > > To: us...@open-mpi.org
> > > > > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> > > > > 
> > > > > Hi,
> > > > > You DEFINITELY need to disable OpenMPI's new default binding. 
> > > > > Otherwise, 
> > > > > your N threads will run on a single core. --bind-to socket would be 
> > > > > my 
> > > > > recommendation for hybrid jobs.
> > > > > 
> > > > > Maxime
> > > > > 
> > > > > 
> > > > > Le 2014-08-14 10:04, Jeff Squyres (jsquyres) a écrit :
> > > > > > I don't know much about OpenMP, but do you need to disable Open 
> > > > > > MPI's default bind-to-core functionality (I'm assuming you're using 
> > > > > > Open MPI 1.8.x)?
> > > > > >
> > > > > > You can try "mpirun --bind-to none ...", which will have Open MPI 
> > > > > > not bind MPI processes to cores, which might allow OpenMP to think 
> > > > > > that it can use all the cores, and therefore it will spawn 
> > > > > > num_cores threads...?
> > > > > >
> > > > > >
> > > > > > On Aug 14, 2014, at 9:50 AM, Oscar Mojica <o_moji...@hotmail.com> 
> > > > > > wrote:
> > > > > >
> > > > > >> Hello everybody
> > > > > >>
> > > > > >> I am trying to run a hybrid mpi + openmp program in a cluster. I 
> > > > > >> created a queue with 14 machines, each one with 16 cores. The 
> > > > > >> program divides the work among the 14 processors with MPI and 
> > > > > >> within each processor a loop is also divided into 8 threads for 
> > > > > >> example, using openmp. The problem is that when I submit the job 
> > > > > >> to the queue the MPI processes don't divide the work into threads 
> > > > > >> and the program prints the number of threads that are working 
> > > > > >> within each process as one.
> > > > > >>
> > > > > >> I made a simple test program that uses openmp and I logged in one 
> > > > > >> machine of the fourteen. I compiled it using gfortran -fopenmp 
> > > > > >> program.f -o exe, set the OMP_NUM_THREADS environment variable 
> > > > > >> equal to 8 and when I ran directly in the terminal the loop was 
> > > > > >> effectively divided among the cores and for example in this case 
> > > > > >> the program printed the number of threads equal to 8
> > > > > >>
> > > > > >> This is my Makefile
> > > > > >> 
> > > > > >> # Start of the makefile
> > > > > >> # Defining variables
> > > > > >> objects = inv_grav3d.o funcpdf.o gr3dprm.o fdjac.o dsvd.o
> > > > > >> #f90comp = /opt/openmpi/bin/mpif90
> > > > > >> f90comp = /usr/bin/mpif90
> > > > > >> #switch = -O3
> > > > > >> executable = inverse.exe
> > > > > >> # Makefile
> > > > > >> all : $(executable)
> > > > > >> $(executable) : $(objects) 
> > > > > >> $(f90comp) -fopenmp -g -O -o $(executable) $(objects)
> > > > > >> rm $(objects)
> > > > > >> %.o: %.f
> > > > > >> $(f90comp) -c $<
> > > > > >> # Cleaning everything
> > > > > >> clean:
> > > > > >> rm $(executable)
> > > > > >> #  rm $(objects)
> > > > > >> # End of the makefile
> > > > > >>
> > > > > >> and the script that i am using is
> > > > > >>
> > > > > >> #!/bin/bash
> > > > > >> #$ -cwd
> > > > > >> #$ -j y
> > > > > >> #$ -S /bin/bash
> > > > > >> #$ -pe orte 14
> > > > > >> #$ -N job
> > > > > >> #$ -q new.q
> > > > > >>
> > > > > >> export OMP_NUM_THREADS=8
> > > > > >> /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v -np $NSLOTS 
> > > > > >> ./inverse.exe
> > > > > >>
> > > > > >> am I forgetting something?
> > > > > >>
> > > > > >> Thanks,
> > > > > >>
> > > > > >> Oscar Fabian Mojica Ladino
> > > > > >> Geologist M.S. in Geophysics
> > > > > >> _______________________________________________
> > > > > >> users mailing list
> > > > > >> us...@open-mpi.org
> > > > > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > > >> Link to this post: 
> > > > > >> http://www.open-mpi.org/community/lists/users/2014/08/25016.php
> > > > > >
> > > > > 
> > > > > 
> > > > > -- 
> > > > > ---------------------------------
> > > > > Maxime Boissonneault
> > > > > Analyste de calcul - Calcul Québec, Université Laval
> > > > > Ph. D. en physique
> > > > > 
> > > > > _______________________________________________
> > > > > users mailing list
> > > > > us...@open-mpi.org
> > > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > > Link to this post: 
> > > > > http://www.open-mpi.org/community/lists/users/2014/08/25020.php
> > > > _______________________________________________
> > > > users mailing list
> > > > us...@open-mpi.org
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > Link to this post: 
> > > > http://www.open-mpi.org/community/lists/users/2014/08/25032.php
> > > 
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > Link to this post: 
> > > http://www.open-mpi.org/community/lists/users/2014/08/25034.php
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2014/08/25037.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25038.php

Re: [OMPI users] Running a hybrid MPI+openMP program

Reply via email to