Re: [OMPI users] Running a hybrid MPI+openMP program

Reuti Fri, 15 Aug 2014 14:38:22 -0400 (EDT)

Hi,

Am 15.08.2014 um 19:56 schrieb Oscar Mojica:


> Yes, my installation of Open MPI is SGE-aware. I got the following
> 
> [oscar@compute-1-2 ~]$ ompi_info | grep grid
>                  MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.2)

Fine.


> I'm a bit slow and I didn't understand the las part of your message. So i 
> made a test trying to solve my doubts.
> This is the cluster configuration: There are some machines turned off but 
> that is no problem
> 
> [oscar@aguia free-noise]$ qhost
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  
> SWAPUS
> -------------------------------------------------------------------------------
> global                  -               -     -       -       -       -       
> -
> compute-1-10            linux-x64      16  0.97   23.6G  558.6M  996.2M     
> 0.0
> compute-1-11            linux-x64      16     -   23.6G       -  996.2M       
> -
> compute-1-12            linux-x64      16  0.97   23.6G  561.1M  996.2M     
> 0.0
> compute-1-13            linux-x64      16  0.99   23.6G  558.7M  996.2M     
> 0.0
> compute-1-14            linux-x64      16  1.00   23.6G  555.1M  996.2M     
> 0.0
> compute-1-15            linux-x64      16  0.97   23.6G  555.5M  996.2M     
> 0.0
> compute-1-16            linux-x64       8  0.00   15.7G  296.9M 1000.0M     
> 0.0
> compute-1-17            linux-x64       8  0.00   15.7G  299.4M 1000.0M     
> 0.0
> compute-1-18            linux-x64       8     -   15.7G       - 1000.0M       
> -
> compute-1-19            linux-x64       8     -   15.7G       -  996.2M       
> -
> compute-1-2             linux-x64      16  1.19   23.6G  468.1M 1000.0M     
> 0.0
> compute-1-20            linux-x64       8  0.04   15.7G  297.2M 1000.0M     
> 0.0
> compute-1-21            linux-x64       8     -   15.7G       - 1000.0M       
> -
> compute-1-22            linux-x64       8  0.00   15.7G  297.2M 1000.0M     
> 0.0
> compute-1-23            linux-x64       8  0.16   15.7G  299.6M 1000.0M     
> 0.0
> compute-1-24            linux-x64       8  0.00   15.7G  291.5M  996.2M     
> 0.0
> compute-1-25            linux-x64       8  0.04   15.7G  293.4M  996.2M     
> 0.0
> compute-1-26            linux-x64       8     -   15.7G       - 1000.0M       
> -
> compute-1-27            linux-x64       8  0.00   15.7G  297.0M 1000.0M     
> 0.0
> compute-1-29            linux-x64       8     -   15.7G       - 1000.0M       
> -
> compute-1-3             linux-x64      16     -   23.6G       -  996.2M       
> -
> compute-1-30            linux-x64      16     -   23.6G       -  996.2M       
> -
> compute-1-4             linux-x64      16  0.97   23.6G  571.6M  996.2M     
> 0.0
> compute-1-5             linux-x64      16  1.00   23.6G  559.6M  996.2M     
> 0.0
> compute-1-6             linux-x64      16  0.66   23.6G  403.1M  996.2M     
> 0.0
> compute-1-7             linux-x64      16  0.95   23.6G  402.7M  996.2M     
> 0.0
> compute-1-8             linux-x64      16  0.97   23.6G  556.8M  996.2M     
> 0.0
> compute-1-9             linux-x64      16  1.02   23.6G  566.0M 1000.0M     
> 0.0 
> 
> I ran my program using only MPI with 10 processors of the queue one.q which 
> has 14 machines (compute-1-2 to compute-1-15). Whit 'qstat -t' I got:
> 
> [oscar@aguia free-noise]$ qstat -t
> job-ID  prior   name       user         state submit/start at     queue       
>                    master ja-task-ID task-ID state cpu        mem     io      
> stat failed 
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
>    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> [email protected]        MASTER                        r     00:49:12 
> 554.13753 0.09163 
>                                                                   
> [email protected]        SLAVE         
>    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> [email protected]        SLAVE            1.compute-1-5 r     00:48:53 
> 551.49022 0.09410 
>    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> [email protected]        SLAVE            1.compute-1-9 r     00:50:00 
> 564.22764 0.09409 
>    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> [email protected]       SLAVE            1.compute-1-12 r     00:47:30 
> 535.30379 0.09379 
>    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> [email protected]       SLAVE            1.compute-1-13 r     00:49:51 
> 561.69868 0.09379 
>    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> [email protected]       SLAVE            1.compute-1-14 r     00:49:14 
> 554.60818 0.09379 
>    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> [email protected]       SLAVE            1.compute-1-10 r     00:49:59 
> 562.95487 0.09349 
>    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> [email protected]       SLAVE            1.compute-1-15 r     00:50:01 
> 563.27221 0.09361 
>    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> [email protected]        SLAVE            1.compute-1-8 r     00:49:26 
> 556.68431 0.09349 
>    2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
> [email protected]        SLAVE            1.compute-1-4 r     00:49:27 
> 556.87510 0.04967 

Yes, here you got 10 slots (= cores) granted by SGE. So there is no free core 
left inside the allocation of SGE to allow the use of additional cores for your 
threads. If you use more cores than granted by SGE, it will oversubscribe the 
machines.

The issue is now:

a) If you want 8 threads per MPI process, your job will use 80 cores in total - 
for now SGE isn't aware of it.

b) Although you specified $fill_up as allocation rule, it looks like 
$round_robin. Is there more than one slot defined in the queue definition of 
one.q to get exclusive access?

c) What version of SGE are you using? Certain ones use cgroups or bind 
processes directly to cores (although it usually needs to be requested by the 
job: first line of `qconf -help`).


In case you are alone in the cluster, you could bypass the allocation with b) 
(unless you are hit by c)). But having a mixture of users and jobs a different 
handling would be necessary to handle this in a proper way IMO:

a) having a PE with a fixed allocation rule of 8

b) requesting this PE with an overall slot count of 80

c) copy and alter the $PE_HOSTFILE to show only (granted core count per 
machine) divided by (OMP_NUM_THREADS) per entry, change $PE_HOSTFILE so that it 
points to the altered file

d) Open MPI with a Tight Integration will now start only N process per machine 
according to the altered hostfile, in your case one

e) Your application can start the desired threads and you stay inside the 
granted allocation

-- Reuti


> I accessed to the MASTER processor with 'ssh compute-1-2.local' , and with $ 
> ps -e f and got this, I'm showing only the last lines  
> 
>  2506 ?        Ss     0:00 /usr/sbin/atd
>  2548 tty1     Ss+    0:00 /sbin/mingetty /dev/tty1
>  2550 tty2     Ss+    0:00 /sbin/mingetty /dev/tty2
>  2552 tty3     Ss+    0:00 /sbin/mingetty /dev/tty3
>  2554 tty4     Ss+    0:00 /sbin/mingetty /dev/tty4
>  2556 tty5     Ss+    0:00 /sbin/mingetty /dev/tty5
>  2558 tty6     Ss+    0:00 /sbin/mingetty /dev/tty6
>  3325 ?        Sl     0:04 /opt/gridengine/bin/linux-x64/sge_execd
> 17688 ?        S      0:00  \_ sge_shepherd-2726 -bg
> 17695 ?        Ss     0:00      \_ -bash 
> /opt/gridengine/default/spool/compute-1-2/job_scripts/2726
> 17797 ?        S      0:00          \_ /usr/bin/time -f %E 
> /opt/openmpi/bin/mpirun -v -np 10 ./inverse.exe
> 17798 ?        S      0:01              \_ /opt/openmpi/bin/mpirun -v -np 10 
> ./inverse.exe
> 17799 ?        Sl     0:00                  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-5.local     
> PATH=/opt/openmpi/bin:$PATH ; expo
> 17800 ?        Sl     0:00                  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-9.local     
> PATH=/opt/openmpi/bin:$PATH ; expo
> 17801 ?        Sl     0:00                  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-12.local    
>  PATH=/opt/openmpi/bin:$PATH ; exp
> 17802 ?        Sl     0:00                  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-13.local    
>  PATH=/opt/openmpi/bin:$PATH ; exp
> 17803 ?        Sl     0:00                  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-14.local    
>  PATH=/opt/openmpi/bin:$PATH ; exp
> 17804 ?        Sl     0:00                  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-10.local    
>  PATH=/opt/openmpi/bin:$PATH ; exp
> 17805 ?        Sl     0:00                  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-15.local    
>  PATH=/opt/openmpi/bin:$PATH ; exp
> 17806 ?        Sl     0:00                  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-8.local     
> PATH=/opt/openmpi/bin:$PATH ; expo
> 17807 ?        Sl     0:00                  \_ 
> /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-4.local     
> PATH=/opt/openmpi/bin:$PATH ; expo
> 17826 ?        R     31:36                  \_ ./inverse.exe
>  3429 ?        Ssl    0:00 automount --pid-file /var/run/autofs.pid 
> 
> So the job is using the 10 machines, Until here is all right OK. Do you think 
> that changing the "allocation_rule " to a number instead $fill_up the MPI 
> processes would divide the work in that number of threads?
> 
> Thanks a lot 
> 
> Oscar Fabian Mojica Ladino
> Geologist M.S. in  Geophysics
> 
> 
> PS: I have another doubt, what  is a slot? is a physical core?
> 
> 
> > From: [email protected]
> > Date: Thu, 14 Aug 2014 23:54:22 +0200
> > To: [email protected]
> > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> > 
> > Hi,
> > 
> > I think this is a broader issue in case an MPI library is used in 
> > conjunction with threads while running inside a queuing system. First: 
> > whether your actual installation of Open MPI is SGE-aware you can check 
> > with:
> > 
> > $ ompi_info | grep grid
> > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.5)
> > 
> > Then we can look at the definition of your PE: "allocation_rule $fill_up". 
> > This means that SGE will grant you 14 slots in total in any combination on 
> > the available machines, means 8+4+2 slots allocation is an allowed 
> > combination like 4+4+3+3 and so on. Depending on the SGE-awareness it's a 
> > question: will your application just start processes on all nodes and 
> > completely disregard the granted allocation, or as the other extreme does 
> > it stays on one and the same machine for all started processes? On the 
> > master node of the parallel job you can issue:
> > 
> > $ ps -e f
> > 
> > (f w/o -) to have a look whether `ssh` or `qrsh -inhert ...` is used to 
> > reach other machines and their requested process count.
> > 
> > 
> > Now to the common problem in such a set up:
> > 
> > AFAICS: for now there is no way in the Open MPI + SGE combination to 
> > specify the number of MPI processes and intended number of threads which 
> > are automatically read by Open MPI while staying inside the granted slot 
> > count and allocation. So it seems to be necessary to have the intended 
> > number of threads being honored by Open MPI too.
> > 
> > Hence specifying e.g. "allocation_rule 8" in such a setup while requesting 
> > 32 processes, would for now start 32 processes by MPI already, as Open MP 
> > reads the $PE_HOSTFILE and acts accordingly.
> > 
> > Open MPI would have to read the generated machine file in a slightly 
> > different way regarding threads: a) read the $PE_HOSTFILE, b) divide the 
> > granted slots per machine by OMP_NUM_THREADS, c) throw an error in case 
> > it's not divisible by OMP_NUM_THREADS. Then start one process per quotient.
> > 
> > Would this work for you?
> > 
> > -- Reuti
> > 
> > PS: This would also mean to have a couple of PEs in SGE having a fixed 
> > "allocation_rule". While this works right now, an extension in SGE could be 
> > "$fill_up_omp"/"$round_robin_omp" and using OMP_NUM_THREADS there too, 
> > hence it must not be specified as an `export` in the job script but either 
> > on the command line or inside the job script in #$ lines as job requests. 
> > This would mean to collect slots in bunches of OMP_NUM_THREADS on each 
> > machine to reach the overall specified slot count. Whether OMP_NUM_THREADS 
> > or n times OMP_NUM_THREADS is allowed per machine needs to be discussed.
> > 
> > PS2: As Univa SGE can also supply a list of granted cores in the 
> > $PE_HOSTFILE, it would be an extension to feed this to Open MPI to allow 
> > any UGE aware binding.
> > 
> > 
> > Am 14.08.2014 um 21:52 schrieb Oscar Mojica:
> > 
> > > Guys
> > > 
> > > I changed the line to run the program in the script with both options
> > > /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v --bind-to-none -np 
> > > $NSLOTS ./inverse.exe
> > > /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v --bind-to-socket -np 
> > > $NSLOTS ./inverse.exe
> > > 
> > > but I got the same results. When I use man mpirun appears:
> > > 
> > > -bind-to-none, --bind-to-none
> > > Do not bind processes. (Default.)
> > > 
> > > and the output of 'qconf -sp orte' is
> > > 
> > > pe_name orte
> > > slots 9999
> > > user_lists NONE
> > > xuser_lists NONE
> > > start_proc_args /bin/true
> > > stop_proc_args /bin/true
> > > allocation_rule $fill_up
> > > control_slaves TRUE
> > > job_is_first_task FALSE
> > > urgency_slots min
> > > accounting_summary TRUE
> > > 
> > > I don't know if the installed Open MPI was compiled with '--with-sge'. 
> > > How can i know that?
> > > before to think in an hybrid application i was using only MPI and the 
> > > program used few processors (14). The cluster possesses 28 machines, 15 
> > > with 16 cores and 13 with 8 cores totalizing 344 units of processing. 
> > > When I submitted the job (only MPI), the MPI processes were spread to the 
> > > cores directly, for that reason I created a new queue with 14 machines 
> > > trying to gain more time. the results were the same in both cases. In the 
> > > last case i could prove that the processes were distributed to all 
> > > machines correctly.
> > > 
> > > What I must to do?
> > > Thanks 
> > > 
> > > Oscar Fabian Mojica Ladino
> > > Geologist M.S. in Geophysics
> > > 
> > > 
> > > > Date: Thu, 14 Aug 2014 10:10:17 -0400
> > > > From: [email protected]
> > > > To: [email protected]
> > > > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> > > > 
> > > > Hi,
> > > > You DEFINITELY need to disable OpenMPI's new default binding. 
> > > > Otherwise, 
> > > > your N threads will run on a single core. --bind-to socket would be my 
> > > > recommendation for hybrid jobs.
> > > > 
> > > > Maxime
> > > > 
> > > > 
> > > > Le 2014-08-14 10:04, Jeff Squyres (jsquyres) a écrit :
> > > > > I don't know much about OpenMP, but do you need to disable Open MPI's 
> > > > > default bind-to-core functionality (I'm assuming you're using Open 
> > > > > MPI 1.8.x)?
> > > > >
> > > > > You can try "mpirun --bind-to none ...", which will have Open MPI not 
> > > > > bind MPI processes to cores, which might allow OpenMP to think that 
> > > > > it can use all the cores, and therefore it will spawn num_cores 
> > > > > threads...?
> > > > >
> > > > >
> > > > > On Aug 14, 2014, at 9:50 AM, Oscar Mojica <[email protected]> 
> > > > > wrote:
> > > > >
> > > > >> Hello everybody
> > > > >>
> > > > >> I am trying to run a hybrid mpi + openmp program in a cluster. I 
> > > > >> created a queue with 14 machines, each one with 16 cores. The 
> > > > >> program divides the work among the 14 processors with MPI and within 
> > > > >> each processor a loop is also divided into 8 threads for example, 
> > > > >> using openmp. The problem is that when I submit the job to the queue 
> > > > >> the MPI processes don't divide the work into threads and the program 
> > > > >> prints the number of threads that are working within each process as 
> > > > >> one.
> > > > >>
> > > > >> I made a simple test program that uses openmp and I logged in one 
> > > > >> machine of the fourteen. I compiled it using gfortran -fopenmp 
> > > > >> program.f -o exe, set the OMP_NUM_THREADS environment variable equal 
> > > > >> to 8 and when I ran directly in the terminal the loop was 
> > > > >> effectively divided among the cores and for example in this case the 
> > > > >> program printed the number of threads equal to 8
> > > > >>
> > > > >> This is my Makefile
> > > > >> 
> > > > >> # Start of the makefile
> > > > >> # Defining variables
> > > > >> objects = inv_grav3d.o funcpdf.o gr3dprm.o fdjac.o dsvd.o
> > > > >> #f90comp = /opt/openmpi/bin/mpif90
> > > > >> f90comp = /usr/bin/mpif90
> > > > >> #switch = -O3
> > > > >> executable = inverse.exe
> > > > >> # Makefile
> > > > >> all : $(executable)
> > > > >> $(executable) : $(objects)   
> > > > >> $(f90comp) -fopenmp -g -O -o $(executable) $(objects)
> > > > >> rm $(objects)
> > > > >> %.o: %.f
> > > > >> $(f90comp) -c $<
> > > > >> # Cleaning everything
> > > > >> clean:
> > > > >> rm $(executable)
> > > > >> #    rm $(objects)
> > > > >> # End of the makefile
> > > > >>
> > > > >> and the script that i am using is
> > > > >>
> > > > >> #!/bin/bash
> > > > >> #$ -cwd
> > > > >> #$ -j y
> > > > >> #$ -S /bin/bash
> > > > >> #$ -pe orte 14
> > > > >> #$ -N job
> > > > >> #$ -q new.q
> > > > >>
> > > > >> export OMP_NUM_THREADS=8
> > > > >> /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v -np $NSLOTS 
> > > > >> ./inverse.exe
> > > > >>
> > > > >> am I forgetting something?
> > > > >>
> > > > >> Thanks,
> > > > >>
> > > > >> Oscar Fabian Mojica Ladino
> > > > >> Geologist M.S. in Geophysics
> > > > >> _______________________________________________
> > > > >> users mailing list
> > > > >> [email protected]
> > > > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > >> Link to this post: 
> > > > >> http://www.open-mpi.org/community/lists/users/2014/08/25016.php
> > > > >
> > > > 
> > > > 
> > > > -- 
> > > > ---------------------------------
> > > > Maxime Boissonneault
> > > > Analyste de calcul - Calcul Québec, Université Laval
> > > > Ph. D. en physique
> > > > 
> > > > _______________________________________________
> > > > users mailing list
> > > > [email protected]
> > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > > Link to this post: 
> > > > http://www.open-mpi.org/community/lists/users/2014/08/25020.php
> > > _______________________________________________
> > > users mailing list
> > > [email protected]
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > Link to this post: 
> > > http://www.open-mpi.org/community/lists/users/2014/08/25032.php
> > 
> > _______________________________________________
> > users mailing list
> > [email protected]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2014/08/25034.php
> _______________________________________________
> users mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25037.php

Re: [OMPI users] Running a hybrid MPI+openMP program

Reply via email to