Re: [OMPI users] Running a hybrid MPI+openMP program

Oscar Mojica Fri, 15 Aug 2014 13:56:59 -0400 (EDT)

Hi ReutiYes, my installation of Open MPI is SGE-aware. I got the 
following[oscar@compute-1-2 ~]$ ompi_info | grep grid                 MCA ras: 
gridengine (MCA v2.0, API v2.0, Component v1.6.2)I'm a bit slow and I didn't 
understand the las part of your message. So i made a test trying to solve my 
doubts.This is the cluster configuration: There are some machines turned off 
but that is no problem[oscar@aguia free-noise]$ qhostHOSTNAME                
ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  
SWAPUS-------------------------------------------------------------------------------global
                  -               -     -       -       -       -       
-compute-1-10            linux-x64      16  0.97   23.6G  558.6M  996.2M     
0.0compute-1-11            linux-x64      16     -   23.6G       -  996.2M      
 -compute-1-12            linux-x64      16  0.97   23.6G  561.1M  996.2M     
0.0compute-1-13            linux-x64      16  0.99   23.6G  558.7M  996.2M     
0.0compute-1-14            linux-x64      16  1.00   23.6G  555.1M  996.2M     
0.0compute-1-15            linux-x64      16  0.97   23.6G  555.5M  996.2M     
0.0compute-1-16            linux-x64       8  0.00   15.7G  296.9M 1000.0M     
0.0compute-1-17            linux-x64       8  0.00   15.7G  299.4M 1000.0M     
0.0compute-1-18            linux-x64       8     -   15.7G       - 1000.0M      
 -compute-1-19            linux-x64       8     -   15.7G       -  996.2M       
-compute-1-2             linux-x64      16  1.19   23.6G  468.1M 1000.0M     
0.0compute-1-20            linux-x64       8  0.04   15.7G  297.2M 1000.0M     
0.0compute-1-21            linux-x64       8     -   15.7G       - 1000.0M      
 -compute-1-22            linux-x64       8  0.00   15.7G  297.2M 1000.0M     
0.0compute-1-23            linux-x64       8  0.16   15.7G  299.6M 1000.0M     
0.0compute-1-24            linux-x64       8  0.00   15.7G  291.5M  996.2M     
0.0compute-1-25            linux-x64       8  0.04   15.7G  293.4M  996.2M     
0.0compute-1-26            linux-x64       8     -   15.7G       - 1000.0M      
 -compute-1-27            linux-x64       8  0.00   15.7G  297.0M 1000.0M     
0.0compute-1-29            linux-x64       8     -   15.7G       - 1000.0M      
 -compute-1-3             linux-x64      16     -   23.6G       -  996.2M       
-compute-1-30            linux-x64      16     -   23.6G       -  996.2M       
-compute-1-4             linux-x64      16  0.97   23.6G  571.6M  996.2M     
0.0compute-1-5             linux-x64      16  1.00   23.6G  559.6M  996.2M     
0.0compute-1-6             linux-x64      16  0.66   23.6G  403.1M  996.2M     
0.0compute-1-7             linux-x64      16  0.95   23.6G  402.7M  996.2M     
0.0compute-1-8             linux-x64      16  0.97   23.6G  556.8M  996.2M     
0.0compute-1-9             linux-x64      16  1.02   23.6G  566.0M 1000.0M     
0.0 I ran my program using only MPI with 10 processors of the queue one.q which 
has 14 machines (compute-1-2 to compute-1-15). Whit 'qstat -t' I 
got:[oscar@aguia free-noise]$ qstat -tjob-ID  prior   name       user         
state submit/start at     queue                          master ja-task-ID 
task-ID state cpu        mem     io      stat failed 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
   2726 0.50500 job        oscar        r     08/15/2014 12:38:21 
one.q@compute-1-2.local        MASTER                        r     00:49:12 
554.13753 0.09163                                                               
    one.q@compute-1-2.local        SLAVE            2726 0.50500 job        
oscar        r     08/15/2014 12:38:21 one.q@compute-1-5.local        SLAVE     
       1.compute-1-5 r     00:48:53 551.49022 0.09410    2726 0.50500 job       
 oscar        r     08/15/2014 12:38:21 one.q@compute-1-9.local        SLAVE    
        1.compute-1-9 r     00:50:00 564.22764 0.09409    2726 0.50500 job      
  oscar        r     08/15/2014 12:38:21 one.q@compute-1-12.local       SLAVE   
         1.compute-1-12 r     00:47:30 535.30379 0.09379    2726 0.50500 job    
    oscar        r     08/15/2014 12:38:21 one.q@compute-1-13.local       SLAVE 
           1.compute-1-13 r     00:49:51 561.69868 0.09379    2726 0.50500 job  
      oscar        r     08/15/2014 12:38:21 one.q@compute-1-14.local       
SLAVE            1.compute-1-14 r     00:49:14 554.60818 0.09379    2726 
0.50500 job        oscar        r     08/15/2014 12:38:21 
one.q@compute-1-10.local       SLAVE            1.compute-1-10 r     00:49:59 
562.95487 0.09349    2726 0.50500 job        oscar        r     08/15/2014 
12:38:21 one.q@compute-1-15.local       SLAVE            1.compute-1-15 r     
00:50:01 563.27221 0.09361    2726 0.50500 job        oscar        r     
08/15/2014 12:38:21 one.q@compute-1-8.local        SLAVE            
1.compute-1-8 r     00:49:26 556.68431 0.09349    2726 0.50500 job        oscar 
       r     08/15/2014 12:38:21 one.q@compute-1-4.local        SLAVE           
 1.compute-1-4 r     00:49:27 556.87510 0.04967 I accessed to the MASTER 
processor with 'ssh compute-1-2.local' , and with $ ps -e f and got this, I'm 
showing only the last lines   2506 ?        Ss     0:00 /usr/sbin/atd 2548 tty1 
    Ss+    0:00 /sbin/mingetty /dev/tty1 2550 tty2     Ss+    0:00 
/sbin/mingetty /dev/tty2 2552 tty3     Ss+    0:00 /sbin/mingetty /dev/tty3 
2554 tty4     Ss+    0:00 /sbin/mingetty /dev/tty4 2556 tty5     Ss+    0:00 
/sbin/mingetty /dev/tty5 2558 tty6     Ss+    0:00 /sbin/mingetty /dev/tty6 
3325 ?        Sl     0:04 /opt/gridengine/bin/linux-x64/sge_execd17688 ?        
S      0:00  \_ sge_shepherd-2726 -bg17695 ?        Ss     0:00      \_ -bash 
/opt/gridengine/default/spool/compute-1-2/job_scripts/272617797 ?        S      
0:00          \_ /usr/bin/time -f %E /opt/openmpi/bin/mpirun -v -np 10 
./inverse.exe17798 ?        S      0:01              \_ /opt/openmpi/bin/mpirun 
-v -np 10 ./inverse.exe17799 ?        Sl     0:00                  \_ 
/opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-5.local     
PATH=/opt/openmpi/bin:$PATH ; expo17800 ?        Sl     0:00                  
\_ /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-9.local    
 PATH=/opt/openmpi/bin:$PATH ; expo17801 ?        Sl     0:00                  
\_ /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-12.local   
  PATH=/opt/openmpi/bin:$PATH ; exp17802 ?        Sl     0:00                  
\_ /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-13.local   
  PATH=/opt/openmpi/bin:$PATH ; exp17803 ?        Sl     0:00                  
\_ /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-14.local   
  PATH=/opt/openmpi/bin:$PATH ; exp17804 ?        Sl     0:00                  
\_ /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-10.local   
  PATH=/opt/openmpi/bin:$PATH ; exp17805 ?        Sl     0:00                  
\_ /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-15.local   
  PATH=/opt/openmpi/bin:$PATH ; exp17806 ?        Sl     0:00                  
\_ /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-8.local    
 PATH=/opt/openmpi/bin:$PATH ; expo17807 ?        Sl     0:00                  
\_ /opt/gridengine/bin/linux-x64/qrsh -inherit -nostdin -V compute-1-4.local    
 PATH=/opt/openmpi/bin:$PATH ; expo17826 ?        R     31:36                  
\_ ./inverse.exe 3429 ?        Ssl    0:00 automount --pid-file 
/var/run/autofs.pid So the job is using the 10 machines, Until here is all 
right OK. Do you think that changing the "allocation_rule " to a number instead 
$fill_up the MPI processes would divide the work in that number of 
threads?Thanks a lot


Oscar Fabian Mojica LadinoGeologist M.S. in  Geophysics

PS: I have another doubt, what  is a slot? is a physical core?


> From: re...@staff.uni-marburg.de
> Date: Thu, 14 Aug 2014 23:54:22 +0200
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> 
> Hi,
> 
> I think this is a broader issue in case an MPI library is used in conjunction 
> with threads while running inside a queuing system. First: whether your 
> actual installation of Open MPI is SGE-aware you can check with:
> 
> $ ompi_info | grep grid
>                  MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.5)
> 
> Then we can look at the definition of your PE: "allocation_rule    $fill_up". 
> This means that SGE will grant you 14 slots in total in any combination on 
> the available machines, means 8+4+2 slots allocation is an allowed 
> combination like 4+4+3+3 and so on. Depending on the SGE-awareness it's a 
> question: will your application just start processes on all nodes and 
> completely disregard the granted allocation, or as the other extreme does it 
> stays on one and the same machine for all started processes? On the master 
> node of the parallel job you can issue:
> 
> $ ps -e f
> 
> (f w/o -) to have a look whether `ssh` or `qrsh -inhert ...` is used to reach 
> other machines and their requested process count.
> 
> 
> Now to the common problem in such a set up:
> 
> AFAICS: for now there is no way in the Open MPI + SGE combination to specify 
> the number of MPI processes and intended number of threads which are 
> automatically read by Open MPI while staying inside the granted slot count 
> and allocation. So it seems to be necessary to have the intended number of 
> threads being honored by Open MPI too.
> 
> Hence specifying e.g. "allocation_rule 8" in such a setup while requesting 32 
> processes, would for now start 32 processes by MPI already, as Open MP reads 
> the $PE_HOSTFILE and acts accordingly.
> 
> Open MPI would have to read the generated machine file in a slightly 
> different way regarding threads: a) read the $PE_HOSTFILE, b) divide the 
> granted slots per machine by OMP_NUM_THREADS, c) throw an error in case it's 
> not divisible by OMP_NUM_THREADS. Then start one process per quotient.
> 
> Would this work for you?
> 
> -- Reuti
> 
> PS: This would also mean to have a couple of PEs in SGE having a fixed 
> "allocation_rule". While this works right now, an extension in SGE could be 
> "$fill_up_omp"/"$round_robin_omp" and using  OMP_NUM_THREADS there too, hence 
> it must not be specified as an `export` in the job script but either on the 
> command line or inside the job script in #$ lines as job requests. This would 
> mean to collect slots in bunches of OMP_NUM_THREADS on each machine to reach 
> the overall specified slot count. Whether OMP_NUM_THREADS or n times 
> OMP_NUM_THREADS is allowed per machine needs to be discussed.
>  
> PS2: As Univa SGE can also supply a list of granted cores in the 
> $PE_HOSTFILE, it would be an extension to feed this to Open MPI to allow any 
> UGE aware binding.
> 
> 
> Am 14.08.2014 um 21:52 schrieb Oscar Mojica:
> 
> > Guys
> > 
> > I changed the line to run the program in the script with both options
> > /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v --bind-to-none -np $NSLOTS 
> > ./inverse.exe
> > /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v --bind-to-socket -np 
> > $NSLOTS ./inverse.exe
> > 
> > but I got the same results. When I use man mpirun appears:
> > 
> >        -bind-to-none, --bind-to-none
> >               Do not bind processes.  (Default.)
> > 
> > and the output of 'qconf -sp orte' is
> > 
> > pe_name            orte
> > slots              9999
> > user_lists         NONE
> > xuser_lists        NONE
> > start_proc_args    /bin/true
> > stop_proc_args     /bin/true
> > allocation_rule    $fill_up
> > control_slaves     TRUE
> > job_is_first_task  FALSE
> > urgency_slots      min
> > accounting_summary TRUE
> > 
> > I don't know if the installed Open MPI was compiled with '--with-sge'. How 
> > can i know that?
> > before to think in an hybrid application i was using only MPI and the 
> > program used few processors (14). The cluster possesses 28 machines, 15 
> > with 16 cores and 13 with 8 cores totalizing 344 units of processing. When 
> > I submitted the job (only MPI), the MPI processes were spread to the cores 
> > directly, for that reason I created a new queue with 14 machines trying to 
> > gain more time.  the results were the same in both cases. In the last case 
> > i could prove that the processes were distributed to all machines correctly.
> > 
> > What I must to do?
> > Thanks 
> > 
> > Oscar Fabian Mojica Ladino
> > Geologist M.S. in  Geophysics
> > 
> > 
> > > Date: Thu, 14 Aug 2014 10:10:17 -0400
> > > From: maxime.boissonnea...@calculquebec.ca
> > > To: us...@open-mpi.org
> > > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> > > 
> > > Hi,
> > > You DEFINITELY need to disable OpenMPI's new default binding. Otherwise, 
> > > your N threads will run on a single core. --bind-to socket would be my 
> > > recommendation for hybrid jobs.
> > > 
> > > Maxime
> > > 
> > > 
> > > Le 2014-08-14 10:04, Jeff Squyres (jsquyres) a écrit :
> > > > I don't know much about OpenMP, but do you need to disable Open MPI's 
> > > > default bind-to-core functionality (I'm assuming you're using Open MPI 
> > > > 1.8.x)?
> > > >
> > > > You can try "mpirun --bind-to none ...", which will have Open MPI not 
> > > > bind MPI processes to cores, which might allow OpenMP to think that it 
> > > > can use all the cores, and therefore it will spawn num_cores threads...?
> > > >
> > > >
> > > > On Aug 14, 2014, at 9:50 AM, Oscar Mojica <o_moji...@hotmail.com> wrote:
> > > >
> > > >> Hello everybody
> > > >>
> > > >> I am trying to run a hybrid mpi + openmp program in a cluster. I 
> > > >> created a queue with 14 machines, each one with 16 cores. The program 
> > > >> divides the work among the 14 processors with MPI and within each 
> > > >> processor a loop is also divided into 8 threads for example, using 
> > > >> openmp. The problem is that when I submit the job to the queue the MPI 
> > > >> processes don't divide the work into threads and the program prints 
> > > >> the number of threads that are working within each process as one.
> > > >>
> > > >> I made a simple test program that uses openmp and I logged in one 
> > > >> machine of the fourteen. I compiled it using gfortran -fopenmp 
> > > >> program.f -o exe, set the OMP_NUM_THREADS environment variable equal 
> > > >> to 8 and when I ran directly in the terminal the loop was effectively 
> > > >> divided among the cores and for example in this case the program 
> > > >> printed the number of threads equal to 8
> > > >>
> > > >> This is my Makefile
> > > >> 
> > > >> # Start of the makefile
> > > >> # Defining variables
> > > >> objects = inv_grav3d.o funcpdf.o gr3dprm.o fdjac.o dsvd.o
> > > >> #f90comp = /opt/openmpi/bin/mpif90
> > > >> f90comp = /usr/bin/mpif90
> > > >> #switch = -O3
> > > >> executable = inverse.exe
> > > >> # Makefile
> > > >> all : $(executable)
> > > >> $(executable) : $(objects)     
> > > >> $(f90comp) -fopenmp -g -O -o $(executable) $(objects)
> > > >> rm $(objects)
> > > >> %.o: %.f
> > > >> $(f90comp) -c $<
> > > >> # Cleaning everything
> > > >> clean:
> > > >> rm $(executable)
> > > >> #      rm $(objects)
> > > >> # End of the makefile
> > > >>
> > > >> and the script that i am using is
> > > >>
> > > >> #!/bin/bash
> > > >> #$ -cwd
> > > >> #$ -j y
> > > >> #$ -S /bin/bash
> > > >> #$ -pe orte 14
> > > >> #$ -N job
> > > >> #$ -q new.q
> > > >>
> > > >> export OMP_NUM_THREADS=8
> > > >> /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v -np $NSLOTS 
> > > >> ./inverse.exe
> > > >>
> > > >> am I forgetting something?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Oscar Fabian Mojica Ladino
> > > >> Geologist M.S. in Geophysics
> > > >> _______________________________________________
> > > >> users mailing list
> > > >> us...@open-mpi.org
> > > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > >> Link to this post: 
> > > >> http://www.open-mpi.org/community/lists/users/2014/08/25016.php
> > > >
> > > 
> > > 
> > > -- 
> > > ---------------------------------
> > > Maxime Boissonneault
> > > Analyste de calcul - Calcul Québec, Université Laval
> > > Ph. D. en physique
> > > 
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > Link to this post: 
> > > http://www.open-mpi.org/community/lists/users/2014/08/25020.php
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2014/08/25032.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25034.php

Re: [OMPI users] Running a hybrid MPI+openMP program

Reply via email to