Re: [OMPI users] Running a hybrid MPI+openMP program

tmishima Wed, 20 Aug 2014 19:56:36 -0400 (EDT)

Reuti,

Sorry for confusing you. Under the managed condition, actually
-np option is not necessary. So, this cmd line also works for me
with Torque.


$ qsub -l nodes=10:ppn=N
$ mpirun -map-by slot:pe=N ./inverse.exe

At least, Ralph confirmed it worked with Slurm and I comfirmed
with Torque as shown below:

[mishima@manage ~]$ qsub -I -l nodes=4:ppn=8
qsub: waiting for job 8798.manage.cluster to start
qsub: job 8798.manage.cluster ready

[mishima@node09 ~]$ cat $PBS_NODEFILE
node09
node09
node09
node09
node09
node09
node09
node09
node10
node10
node10
node10
node10
node10
node10
node10
node11
node11
node11
node11
node11
node11
node11
node11
node12
node12
node12
node12
node12
node12
node12
node12
[mishima@node09 ~]$ mpirun -map-by slot:pe=8 -display-map
~/mis/openmpi/demos/myprog
 Data for JOB [8050,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: node09  Num slots: 8    Max slots: 0    Num procs: 1
        Process OMPI jobid: [8050,1] App: 0 Process rank: 0

 Data for node: node10  Num slots: 8    Max slots: 0    Num procs: 1
        Process OMPI jobid: [8050,1] App: 0 Process rank: 1

 Data for node: node11  Num slots: 8    Max slots: 0    Num procs: 1
        Process OMPI jobid: [8050,1] App: 0 Process rank: 2

 Data for node: node12  Num slots: 8    Max slots: 0    Num procs: 1
        Process OMPI jobid: [8050,1] App: 0 Process rank: 3

 =============================================================
Hello world from process 0 of 4
Hello world from process 2 of 4
Hello world from process 3 of 4
Hello world from process 1 of 4
[mishima@node09 ~]$ mpirun -map-by slot:pe=4 -display-map
~/mis/openmpi/demos/myprog
 Data for JOB [8056,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: node09  Num slots: 8    Max slots: 0    Num procs: 2
        Process OMPI jobid: [8056,1] App: 0 Process rank: 0
        Process OMPI jobid: [8056,1] App: 0 Process rank: 1

 Data for node: node10  Num slots: 8    Max slots: 0    Num procs: 2
        Process OMPI jobid: [8056,1] App: 0 Process rank: 2
        Process OMPI jobid: [8056,1] App: 0 Process rank: 3

 Data for node: node11  Num slots: 8    Max slots: 0    Num procs: 2
        Process OMPI jobid: [8056,1] App: 0 Process rank: 4
        Process OMPI jobid: [8056,1] App: 0 Process rank: 5

 Data for node: node12  Num slots: 8    Max slots: 0    Num procs: 2
        Process OMPI jobid: [8056,1] App: 0 Process rank: 6
        Process OMPI jobid: [8056,1] App: 0 Process rank: 7

 =============================================================
Hello world from process 1 of 8
Hello world from process 0 of 8
Hello world from process 2 of 8
Hello world from process 3 of 8
Hello world from process 4 of 8
Hello world from process 5 of 8
Hello world from process 6 of 8
Hello world from process 7 of 8

I don't know why it dosen't work with SGE. Could you show me
your output adding -display-map and -mca rmaps_base_verbose 5 options?

By the way, the option -map-by ppr:N:node or ppr:N:socket might be
useful for your purpose. The ppr can reduce the slot counts given
by RM without binding and allocate N procs by the specified resource.

[mishima@node09 ~]$ mpirun -map-by ppr:1:node -display-map
~/mis/openmpi/demos/myprog
 Data for JOB [7913,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: node09  Num slots: 8    Max slots: 0    Num procs: 1
        Process OMPI jobid: [7913,1] App: 0 Process rank: 0

 Data for node: node10  Num slots: 8    Max slots: 0    Num procs: 1
        Process OMPI jobid: [7913,1] App: 0 Process rank: 1

 Data for node: node11  Num slots: 8    Max slots: 0    Num procs: 1
        Process OMPI jobid: [7913,1] App: 0 Process rank: 2

 Data for node: node12  Num slots: 8    Max slots: 0    Num procs: 1
        Process OMPI jobid: [7913,1] App: 0 Process rank: 3

 =============================================================
Hello world from process 0 of 4
Hello world from process 2 of 4
Hello world from process 1 of 4
Hello world from process 3 of 4

Tetsuya


> Hi,
>
> Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp:
>
> > Reuti,
> >
> > If you want to allocate 10 procs with N threads, the Torque
> > script below should work for you:
> >
> > qsub -l nodes=10:ppn=N
> > mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe
>
> I played around with giving -np 10 in addition to a Tight Integration.
The slot count is not really divided I think, but only 10 out of the
granted maximum is used (while on each of the listed
> machines an `orted` is started). Due to the fixed allocation this is of
course the result we want to achieve as it subtracts bunches of 8 from the
given list of machines resp. slots. In SGE it's
> sufficient to use and AFAICS it works (without touching the $PE_HOSTFILE
any longer):
>
> ===
> export OMP_NUM_THREADS=8
> mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS /
$OMP_NUM_THREADS") ./inverse.exe
> ===
>
> and submit with:
>
> $ qsub -pe orte 80 job.sh
>
> as the variables are distributed to the slave nodes by SGE already.
>
> Nevertheless, using -np in addition to the Tight Integration gives a
taste of a kind of half-tight integration in some way. And would not work
for us because "--bind-to none" can't be used in such a
> command (see below) and throws an error.
>
>
> > Then, the openmpi automatically reduces the logical slot count to 10
> > by dividing real slot count 10N by binding width of N.
> >
> > I don't know why you want to use pe=N without binding, but
unfortunately
> > the openmpi allocates successive cores to each process so far when you
> > use pe option - it forcibly bind_to core.
>
> In a shared cluster with many users and different MPI libraries in use,
only the queuingsystem could know which job got which cores granted. This
avoids any oversubscription of cores, while others
> are idle.
>
> -- Reuti
>
>
> > Tetsuya
> >
> >
> >> Hi,
> >>
> >> Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima:
> >>
> >>> Reuti and Oscar,
> >>>
> >>> I'm a Torque user and I myself have never used SGE, so I hesitated to
> > join
> >>> the discussion.
> >>>
> >>> From my experience with the Torque, the openmpi 1.8 series has
already
> >>> resolved the issue you pointed out in combining MPI with OpenMP.
> >>>
> >>> Please try to add --map-by slot:pe=8 option, if you want to use 8
> > threads.
> >>> Then, then openmpi 1.8 should allocate processes properly without any
> > modification
> >>> of the hostfile provided by the Torque.
> >>>
> >>> In your case(8 threads and 10 procs):
> >>>
> >>> # you have to request 80 slots using SGE command before mpirun
> >>> mpirun --map-by slot:pe=8 -np 10 ./inverse.exe
> >>
> >> Thx for pointing me to this option, for now I can't get it working
though
> > (in fact, I want to use it without binding essentially). This allows to
> > tell Open MPI to bind more cores to each of the MPI
> >> processes - ok, but does it lower the slot count granted by Torque
too? I
> > mean, was your submission command like:
> >>
> >> $ qsub -l nodes=10:ppn=8 ...
> >>
> >> so that Torque knows, that it should grant and remember this slot
count
> > of a total of 80 for the correct accounting?
> >>
> >> -- Reuti
> >>
> >>
> >>> where you can omit --bind-to option because --bind-to core is assumed
> >>> as default when pe=N is provided by the user.
> >>> Regards,
> >>> Tetsuya
> >>>
> >>>> Hi,
> >>>>
> >>>> Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
> >>>>
> >>>>> I discovered what was the error. I forgot include the '-fopenmp'
when
> > I compiled the objects in the Makefile, so the program worked but it
didn't
> > divide the job
> >>> in threads. Now the program is working and I can use until 15 cores
for
> > machine in the queue one.q.
> >>>>>
> >>>>> Anyway i would like to try implement your advice. Well I'm not
alone
> > in the cluster so i must implement your second suggestion. The steps
are
> >>>>>
> >>>>> a) Use '$ qconf -mp orte' to change the allocation rule to 8
> >>>>
> >>>> The number of slots defined in your used one.q was also increased to
8
> > (`qconf -sq one.q`)?
> >>>>
> >>>>
> >>>>> b) Set '#$ -pe orte 80' in the script
> >>>>
> >>>> Fine.
> >>>>
> >>>>
> >>>>> c) I'm not sure how to do this step. I'd appreciate your help here.
I
> > can add some lines to the script to determine the PE_HOSTFILE path and
> > contents, but i
> >>> don't know how alter it
> >>>>
> >>>> For now you can put in your jobscript (just after OMP_NUM_THREAD is
> > exported):
> >>>>
> >>>> awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads;
> > print }' $PE_HOSTFILE > $TMPDIR/machines
> >>>> export PE_HOSTFILE=$TMPDIR/machines
> >>>>
> >>>> =============
> >>>>
> >>>> Unfortunately noone stepped into this discussion, as in my opinion
> > it's a much broader issue which targets all users who want to combine
MPI
> > with OpenMP. The
> >>> queuingsystem should get a proper request for the overall amount of
> > slots the user needs. For now this will be forwarded to Open MPI and it
> > will use this
> >>> information to start the appropriate number of processes (which was
an
> > achievement for the Tight Integration out-of-the-box of course) and
ignores
> > any setting of
> >>> OMP_NUM_THREADS. So, where should the generated list of machines be
> > adjusted; there are several options:
> >>>>
> >>>> a) The PE of the queuingsystem should do it:
> >>>>
> >>>> + a one time setup for the admin
> >>>> + in SGE the "start_proc_args" of the PE could alter the
$PE_HOSTFILE
> >>>> - the "start_proc_args" would need to know the number of threads,
i.e.
> > OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the
jobscript
> > (tricky scanning
> >>> of the submitted jobscript for OMP_NUM_THREADS would be too nasty)
> >>>> - limits to use inside the jobscript calls to libraries behaving in
> > the same way as Open MPI only
> >>>>
> >>>>
> >>>> b) The particular queue should do it in a queue prolog:
> >>>>
> >>>> same as a) I think
> >>>>
> >>>>
> >>>> c) The user should do it
> >>>>
> >>>> + no change in the SGE installation
> >>>> - each and every user must include it in all the jobscripts to
adjust
> > the list and export the pointer to the $PE_HOSTFILE, but he could
change it
> > forth and back
> >>> for different steps of the jobscript though
> >>>>
> >>>>
> >>>> d) Open MPI should do it
> >>>>
> >>>> + no change in the SGE installation
> >>>> + no change to the jobscript
> >>>> + OMP_NUM_THREADS can be altered for different steps of the
jobscript
> > while staying inside the granted allocation automatically
> >>>> o should MKL_NUM_THREADS be covered too (does it use OMP_NUM_THREADS
> > already)?
> >>>>
> >>>> -- Reuti
> >>>>
> >>>>
> >>>>> echo "PE_HOSTFILE:"
> >>>>> echo $PE_HOSTFILE
> >>>>> echo
> >>>>> echo "cat PE_HOSTFILE:"
> >>>>> cat $PE_HOSTFILE
> >>>>>
> >>>>> Thanks for take a time for answer this emails, your advices had
been
> > very useful
> >>>>>
> >>>>> PS: The version of SGE is   OGS/GE 2011.11p1
> >>>>>
> >>>>>
> >>>>> Oscar Fabian Mojica Ladino
> >>>>> Geologist M.S. in  Geophysics
> >>>>>
> >>>>>
> >>>>>> From: re...@staff.uni-marburg.de
> >>>>>> Date: Fri, 15 Aug 2014 20:38:12 +0200
> >>>>>> To: us...@open-mpi.org
> >>>>>> Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Am 15.08.2014 um 19:56 schrieb Oscar Mojica:
> >>>>>>
> >>>>>>> Yes, my installation of Open MPI is SGE-aware. I got the
following
> >>>>>>>
> >>>>>>> [oscar@compute-1-2 ~]$ ompi_info | grep grid
> >>>>>>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.2)
> >>>>>>
> >>>>>> Fine.
> >>>>>>
> >>>>>>
> >>>>>>> I'm a bit slow and I didn't understand the las part of your
> > message. So i made a test trying to solve my doubts.
> >>>>>>> This is the cluster configuration: There are some machines turned
> > off but that is no problem
> >>>>>>>
> >>>>>>> [oscar@aguia free-noise]$ qhost
> >>>>>>> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
> >>>>>>>
> >
-------------------------------------------------------------------------------

> >
> >>>>>>> global - - - - - - -
> >>>>>>> compute-1-10 linux-x64 16 0.97 23.6G 558.6M 996.2M 0.0
> >>>>>>> compute-1-11 linux-x64 16 - 23.6G - 996.2M -
> >>>>>>> compute-1-12 linux-x64 16 0.97 23.6G 561.1M 996.2M 0.0
> >>>>>>> compute-1-13 linux-x64 16 0.99 23.6G 558.7M 996.2M 0.0
> >>>>>>> compute-1-14 linux-x64 16 1.00 23.6G 555.1M 996.2M 0.0
> >>>>>>> compute-1-15 linux-x64 16 0.97 23.6G 555.5M 996.2M 0.0
> >>>>>>> compute-1-16 linux-x64 8 0.00 15.7G 296.9M 1000.0M 0.0
> >>>>>>> compute-1-17 linux-x64 8 0.00 15.7G 299.4M 1000.0M 0.0
> >>>>>>> compute-1-18 linux-x64 8 - 15.7G - 1000.0M -
> >>>>>>> compute-1-19 linux-x64 8 - 15.7G - 996.2M -
> >>>>>>> compute-1-2 linux-x64 16 1.19 23.6G 468.1M 1000.0M 0.0
> >>>>>>> compute-1-20 linux-x64 8 0.04 15.7G 297.2M 1000.0M 0.0
> >>>>>>> compute-1-21 linux-x64 8 - 15.7G - 1000.0M -
> >>>>>>> compute-1-22 linux-x64 8 0.00 15.7G 297.2M 1000.0M 0.0
> >>>>>>> compute-1-23 linux-x64 8 0.16 15.7G 299.6M 1000.0M 0.0
> >>>>>>> compute-1-24 linux-x64 8 0.00 15.7G 291.5M 996.2M 0.0
> >>>>>>> compute-1-25 linux-x64 8 0.04 15.7G 293.4M 996.2M 0.0
> >>>>>>> compute-1-26 linux-x64 8 - 15.7G - 1000.0M -
> >>>>>>> compute-1-27 linux-x64 8 0.00 15.7G 297.0M 1000.0M 0.0
> >>>>>>> compute-1-29 linux-x64 8 - 15.7G - 1000.0M -
> >>>>>>> compute-1-3 linux-x64 16 - 23.6G - 996.2M -
> >>>>>>> compute-1-30 linux-x64 16 - 23.6G - 996.2M -
> >>>>>>> compute-1-4 linux-x64 16 0.97 23.6G 571.6M 996.2M 0.0
> >>>>>>> compute-1-5 linux-x64 16 1.00 23.6G 559.6M 996.2M 0.0
> >>>>>>> compute-1-6 linux-x64 16 0.66 23.6G 403.1M 996.2M 0.0
> >>>>>>> compute-1-7 linux-x64 16 0.95 23.6G 402.7M 996.2M 0.0
> >>>>>>> compute-1-8 linux-x64 16 0.97 23.6G 556.8M 996.2M 0.0
> >>>>>>> compute-1-9 linux-x64 16 1.02 23.6G 566.0M 1000.0M 0.0
> >>>>>>>
> >>>>>>> I ran my program using only MPI with 10 processors of the queue
> > one.q which has 14 machines (compute-1-2 to compute-1-15). Whit 'qstat
-t'
> > I got:
> >>>>>>>
> >>>>>>> [oscar@aguia free-noise]$ qstat -t
> >>>>>>> job-ID prior name user state submit/start at queue master
> > ja-task-ID task-ID state cpu mem io stat failed
> >>>>>>>
> >>>
> >
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

> >
> >>> ----
> >>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
> > one.q@compute-1-2.local MASTER r 00:49:12 554.13753 0.09163
> >>>>>>> one.q@compute-1-2.local SLAVE
> >>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
> > one.q@compute-1-5.local SLAVE 1.compute-1-5 r 00:48:53 551.49022
0.09410
> >>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
> > one.q@compute-1-9.local SLAVE 1.compute-1-9 r 00:50:00 564.22764
0.09409
> >>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
> > one.q@compute-1-12.local SLAVE 1.compute-1-12 r 00:47:30 535.30379
0.09379
> >>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
> > one.q@compute-1-13.local SLAVE 1.compute-1-13 r 00:49:51 561.69868
0.09379
> >>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
> > one.q@compute-1-14.local SLAVE 1.compute-1-14 r 00:49:14 554.60818
0.09379
> >>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
> > one.q@compute-1-10.local SLAVE 1.compute-1-10 r 00:49:59 562.95487
0.09349
> >>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
> > one.q@compute-1-15.local SLAVE 1.compute-1-15 r 00:50:01 563.27221
0.09361
> >>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
> > one.q@compute-1-8.local SLAVE 1.compute-1-8 r 00:49:26 556.68431
0.09349
> >>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
> > one.q@compute-1-4.local SLAVE 1.compute-1-4 r 00:49:27 556.87510
0.04967
> >>>>>>
> >>>>>> Yes, here you got 10 slots (= cores) granted by SGE. So there is
no
> > free core left inside the allocation of SGE to allow the use of
additional
> > cores for your
> >>> threads. If you use more cores than granted by SGE, it will
> > oversubscribe the machines.
> >>>>>>
> >>>>>> The issue is now:
> >>>>>>
> >>>>>> a) If you want 8 threads per MPI process, your job will use 80
cores
> > in total - for now SGE isn't aware of it.
> >>>>>>
> >>>>>> b) Although you specified $fill_up as allocation rule, it looks
like
> > $round_robin. Is there more than one slot defined in the queue
definition
> > of one.q to get
> >>> exclusive access?
> >>>>>>
> >>>>>> c) What version of SGE are you using? Certain ones use cgroups or
> > bind processes directly to cores (although it usually needs to be
requested
> > by the job:
> >>> first line of `qconf -help`).
> >>>>>>
> >>>>>>
> >>>>>> In case you are alone in the cluster, you could bypass the
> > allocation with b) (unless you are hit by c)). But having a mixture of
> > users and jobs a different
> >>> handling would be necessary to handle this in a proper way IMO:
> >>>>>>
> >>>>>> a) having a PE with a fixed allocation rule of 8
> >>>>>>
> >>>>>> b) requesting this PE with an overall slot count of 80
> >>>>>>
> >>>>>> c) copy and alter the $PE_HOSTFILE to show only (granted core
count
> > per machine) divided by (OMP_NUM_THREADS) per entry, change
$PE_HOSTFILE so
> > that it points
> >>> to the altered file
> >>>>>>
> >>>>>> d) Open MPI with a Tight Integration will now start only N process
> > per machine according to the altered hostfile, in your case one
> >>>>>>
> >>>>>> e) Your application can start the desired threads and you stay
> > inside the granted allocation
> >>>>>>
> >>>>>> -- Reuti
> >>>>>>
> >>>>>>
> >>>>>>> I accessed to the MASTER processor with 'ssh compute-1-2.local' ,
> > and with $ ps -e f and got this, I'm showing only the last lines
> >>>>>>>
> >>>>>>> 2506 ? Ss 0:00 /usr/sbin/atd
> >>>>>>> 2548 tty1 Ss+ 0:00 /sbin/mingetty /dev/tty1
> >>>>>>> 2550 tty2 Ss+ 0:00 /sbin/mingetty /dev/tty2
> >>>>>>> 2552 tty3 Ss+ 0:00 /sbin/mingetty /dev/tty3
> >>>>>>> 2554 tty4 Ss+ 0:00 /sbin/mingetty /dev/tty4
> >>>>>>> 2556 tty5 Ss+ 0:00 /sbin/mingetty /dev/tty5
> >>>>>>> 2558 tty6 Ss+ 0:00 /sbin/mingetty /dev/tty6
> >>>>>>> 3325 ? Sl 0:04 /opt/gridengine/bin/linux-x64/sge_execd
> >>>>>>> 17688 ? S 0:00 \_ sge_shepherd-2726 -bg
> >>>>>>> 17695 ? Ss 0:00 \_
> > -bash /opt/gridengine/default/spool/compute-1-2/job_scripts/2726
> >>>>>>> 17797 ? S 0:00 \_ /usr/bin/time -f %E /opt/openmpi/bin/mpirun -v
> > -np 10 ./inverse.exe
> >>>>>>> 17798 ? S 0:01 \_ /opt/openmpi/bin/mpirun -v -np 10 ./inverse.exe
> >>>>>>> 17799 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
> > -nostdin -V compute-1-5.local PATH=/opt/openmpi/bin:$PATH ; expo
> >>>>>>> 17800 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
> > -nostdin -V compute-1-9.local PATH=/opt/openmpi/bin:$PATH ; expo
> >>>>>>> 17801 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
> > -nostdin -V compute-1-12.local PATH=/opt/openmpi/bin:$PATH ; exp
> >>>>>>> 17802 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
> > -nostdin -V compute-1-13.local PATH=/opt/openmpi/bin:$PATH ; exp
> >>>>>>> 17803 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
> > -nostdin -V compute-1-14.local PATH=/opt/openmpi/bin:$PATH ; exp
> >>>>>>> 17804 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
> > -nostdin -V compute-1-10.local PATH=/opt/openmpi/bin:$PATH ; exp
> >>>>>>> 17805 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
> > -nostdin -V compute-1-15.local PATH=/opt/openmpi/bin:$PATH ; exp
> >>>>>>> 17806 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
> > -nostdin -V compute-1-8.local PATH=/opt/openmpi/bin:$PATH ; expo
> >>>>>>> 17807 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit>
-nostdin -V compute-1-4.local PATH=/opt/openmpi/bin:$PATH ; expo
> >>>>>>> 17826 ? R 31:36 \_ ./inverse.exe
> >>>>>>> 3429 ? Ssl 0:00 automount --pid-file /var/run/autofs.pid
> >>>>>>>
> >>>>>>> So the job is using the 10 machines, Until here is all right OK.
Do
> > you think that changing the "allocation_rule " to a number instead
$fill_up
> > the MPI
> >>> processes would divide the work in that number of threads?
> >>>>>>>
> >>>>>>> Thanks a lot
> >>>>>>>
> >>>>>>> Oscar Fabian Mojica Ladino
> >>>>>>> Geologist M.S. in Geophysics
> >>>>>>>
> >>>>>>>
> >>>>>>> PS: I have another doubt, what is a slot? is a physical core?
> >>>>>>>
> >>>>>>>
> >>>>>>>> From: re...@staff.uni-marburg.de
> >>>>>>>> Date: Thu, 14 Aug 2014 23:54:22 +0200
> >>>>>>>> To: us...@open-mpi.org
> >>>>>>>> Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I think this is a broader issue in case an MPI library is used
in
> > conjunction with threads while running inside a queuing system. First:
> > whether your
> >>> actual installation of Open MPI is SGE-aware you can check with:
> >>>>>>>>
> >>>>>>>> $ ompi_info | grep grid
> >>>>>>>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.5)
> >>>>>>>>
> >>>>>>>> Then we can look at the definition of your PE: "allocation_rule
> > $fill_up". This means that SGE will grant you 14 slots in total in any
> > combination on the
> >>> available machines, means 8+4+2 slots allocation is an allowed
> > combination like 4+4+3+3 and so on. Depending on the SGE-awareness it's
a
> > question: will your
> >>> application just start processes on all nodes and completely
disregard
> > the granted allocation, or as the other extreme does it stays on one
and
> > the same machine
> >>> for all started processes? On the master node of the parallel job you
> > can issue:
> >>>>>>>>
> >>>>>>>> $ ps -e f
> >>>>>>>>
> >>>>>>>> (f w/o -) to have a look whether `ssh` or `qrsh -inhert ...` is
> > used to reach other machines and their requested process count.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Now to the common problem in such a set up:
> >>>>>>>>
> >>>>>>>> AFAICS: for now there is no way in the Open MPI + SGE
combination
> > to specify the number of MPI processes and intended number of threads
which
> > are
> >>> automatically read by Open MPI while staying inside the granted slot
> > count and allocation. So it seems to be necessary to have the intended
> > number of threads being
> >>> honored by Open MPI too.
> >>>>>>>>
> >>>>>>>> Hence specifying e.g. "allocation_rule 8" in such a setup while
> > requesting 32 processes, would for now start 32 processes by MPI
already,
> > as Open MP reads > the $PE_HOSTFILE and acts accordingly.
> >>>>>>>>
> >>>>>>>> Open MPI would have to read the generated machine file in a
> > slightly different way regarding threads: a) read the $PE_HOSTFILE, b)
> > divide the granted
> >>> slots per machine by OMP_NUM_THREADS, c) throw an error in case it's
> > not divisible by OMP_NUM_THREADS. Then start one process per quotient.
> >>>>>>>>
> >>>>>>>> Would this work for you?
> >>>>>>>>
> >>>>>>>> -- Reuti
> >>>>>>>>
> >>>>>>>> PS: This would also mean to have a couple of PEs in SGE having a
> > fixed "allocation_rule". While this works right now, an extension in
SGE
> > could be
> >>> "$fill_up_omp"/"$round_robin_omp" and using OMP_NUM_THREADS there
too,
> > hence it must not be specified as an `export` in the job script but
either
> > on the command
> >>> line or inside the job script in #$ lines as job requests. This would
> > mean to collect slots in bunches of OMP_NUM_THREADS on each machine to
> > reach the overall
> >>> specified slot count. Whether OMP_NUM_THREADS or n times
> > OMP_NUM_THREADS is allowed per machine needs to be discussed.
> >>>>>>>>
> >>>>>>>> PS2: As Univa SGE can also supply a list of granted cores in the
> > $PE_HOSTFILE, it would be an extension to feed this to Open MPI to
allow
> > any UGE aware
> >>> binding.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Am 14.08.2014 um 21:52 schrieb Oscar Mojica:
> >>>>>>>>
> >>>>>>>>> Guys
> >>>>>>>>>
> >>>>>>>>> I changed the line to run the program in the script with both
> > options
> >>>>>>>>> /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v --bind-to-none
> > -np $NSLOTS ./inverse.exe
> >>>>>>>>> /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v
--bind-to-socket
> > -np $NSLOTS ./inverse.exe
> >>>>>>>>>
> >>>>>>>>> but I got the same results. When I use man mpirun appears:
> >>>>>>>>>
> >>>>>>>>> -bind-to-none, --bind-to-none
> >>>>>>>>> Do not bind processes. (Default.)
> >>>>>>>>>
> >>>>>>>>> and the output of 'qconf -sp orte' is
> >>>>>>>>>
> >>>>>>>>> pe_name orte
> >>>>>>>>> slots 9999
> >>>>>>>>> user_lists NONE
> >>>>>>>>> xuser_lists NONE
> >>>>>>>>> start_proc_args /bin/true
> >>>>>>>>> stop_proc_args /bin/true
> >>>>>>>>> allocation_rule $fill_up
> >>>>>>>>> control_slaves TRUE
> >>>>>>>>> job_is_first_task FALSE
> >>>>>>>>> urgency_slots min
> >>>>>>>>> accounting_summary TRUE
> >>>>>>>>>
> >>>>>>>>> I don't know if the installed Open MPI was compiled with
> > '--with-sge'. How can i know that?
> >>>>>>>>> before to think in an hybrid application i was using only MPI
and
> > the program used few processors (14). The cluster possesses 28
machines, 15
> > with 16
> >>> cores and 13 with 8 cores totalizing 344 units of processing. When I
> > submitted the job (only MPI), the MPI processes were spread to the
cores
> > directly, for that
> >>> reason I created a new queue with 14 machines trying to gain more
time.
> > the results were the same in both cases. In the last case i could prove
> > that the processes
> >>> were distributed to all machines correctly.
> >>>>>>>>>
> >>>>>>>>> What I must to do?
> >>>>>>>>> Thanks
> >>>>>>>>>
> >>>>>>>>> Oscar Fabian Mojica Ladino
> >>>>>>>>> Geologist M.S. in Geophysics
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Date: Thu, 14 Aug 2014 10:10:17 -0400
> >>>>>>>>>> From: maxime.boissonnea...@calculquebec.ca
> >>>>>>>>>> To: us...@open-mpi.org
> >>>>>>>>>> Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>> You DEFINITELY need to disable OpenMPI's new default binding.
> > Otherwise,
> >>>>>>>>>> your N threads will run on a single core. --bind-to socket
would
> > be my
> >>>>>>>>>> recommendation for hybrid jobs.
> >>>>>>>>>>
> >>>>>>>>>> Maxime
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Le 2014-08-14 10:04, Jeff Squyres (jsquyres) a 馗rit :
> >>>>>>>>>>> I don't know much about OpenMP, but do you need to disable
Open
> > MPI's default bind-to-core functionality (I'm assuming you're using
Open
> > MPI 1.8.x)?
> >>>>>>>>>>>
> >>>>>>>>>>> You can try "mpirun --bind-to none ...", which will have Open
> > MPI not bind MPI processes to cores, which might allow OpenMP to think
that
> > it can use
> >>> all the cores, and therefore it will spawn num_cores threads...?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Aug 14, 2014, at 9:50 AM, Oscar Mojica
> > <o_moji...@hotmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hello everybody
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am trying to run a hybrid mpi + openmp program in a
cluster.
> > I created a queue with 14 machines, each one with 16 cores. The program
> > divides the
> >>> work among the 14 processors with MPI and within each processor a
loop
> > is also divided into 8 threads for example, using openmp. The problem
is
> > that when I submit
> >>> the job to the queue the MPI processes don't divide the work into
> > threads and the program prints the number of threads that are working
> > within each process as one.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I made a simple test program that uses openmp and I logged
in
> > one machine of the fourteen. I compiled it using gfortran -fopenmp
> > program.f -o exe,
> >>> set the OMP_NUM_THREADS environment variable equal to 8 and when I
ran
> > directly in the terminal the loop was effectively divided among the
cores
> > and for example in
> >>> this case the program printed the number of threads equal to 8
> >>>>>>>>>>>>
> >>>>>>>>>>>> This is my Makefile
> >>>>>>>>>>>>
> >>>>>>>>>>>> # Start of the makefile
> >>>>>>>>>>>> # Defining variables
> >>>>>>>>>>>> objects = inv_grav3d.o funcpdf.o gr3dprm.o fdjac.o dsvd.o
> >>>>>>>>>>>> #f90comp = /opt/openmpi/bin/mpif90
> >>>>>>>>>>>> f90comp = /usr/bin/mpif90
> >>>>>>>>>>>> #switch = -O3
> >>>>>>>>>>>> executable = inverse.exe
> >>>>>>>>>>>> # Makefile
> >>>>>>>>>>>> all : $(executable)
> >>>>>>>>>>>> $(executable) : $(objects)
> >>>>>>>>>>>> $(f90comp) -fopenmp -g -O -o $(executable) $(objects)
> >>>>>>>>>>>> rm $(objects)
> >>>>>>>>>>>> %.o: %.f
> >>>>>>>>>>>> $(f90comp) -c $<
> >>>>>>>>>>>> # Cleaning everything
> >>>>>>>>>>>> clean:
> >>>>>>>>>>>> rm $(executable)
> >>>>>>>>>>>> #        rm $(objects)
> >>>>>>>>>>>> # End of the makefile
> >>>>>>>>>>>>
> >>>>>>>>>>>> and the script that i am using is
> >>>>>>>>>>>>
> >>>>>>>>>>>> #!/bin/bash
> >>>>>>>>>>>> #$ -cwd
> >>>>>>>>>>>> #$ -j y
> >>>>>>>>>>>> #$ -S /bin/bash
> >>>>>>>>>>>> #$ -pe orte 14
> >>>>>>>>>>>> #$ -N job
> >>>>>>>>>>>> #$ -q new.q
> >>>>>>>>>>>>
> >>>>>>>>>>>> export OMP_NUM_THREADS=8
> >>>>>>>>>>>> /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v -np
> > $NSLOTS ./inverse.exe
> >>>>>>>>>>>>
> >>>>>>>>>>>> am I forgetting something?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Oscar Fabian Mojica Ladino
> >>>>>>>>>>>> Geologist M.S. in Geophysics
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> users mailing list
> >>>>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>>>> Subscription:
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>>>> Link to this post:
> > http://www.open-mpi.org/community/lists/users/2014/08/25016.php
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> ---------------------------------
> >>>>>>>>>> Maxime Boissonneault
> >>>>>>>>>> Analyste de calcul - Calcul Qu饕ec, Universit・Laval
> >>>>>>>>>> Ph. D. en physique
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> users mailing list
> >>>>>>>>>> us...@open-mpi.org
> >>>>>>>>>> Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>>> Link to this post:
> > http://www.open-mpi.org/community/lists/users/2014/08/25020.php
> >>>>>>>>> _______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> us...@open-mpi.org
> >>>>>>>>> Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>> Link to this post:
> > http://www.open-mpi.org/community/lists/users/2014/08/25032.php
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> us...@open-mpi.org
> >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>> Link to this post:
> > http://www.open-mpi.org/community/lists/users/2014/08/25034.php
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> us...@open-mpi.org
> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>> Link to this post:
> > http://www.open-mpi.org/community/lists/users/2014/08/25037.php
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> us...@open-mpi.org
> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>> Link to this post:
> > http://www.open-mpi.org/community/lists/users/2014/08/25038.php
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>> Link to this post:
> > http://www.open-mpi.org/community/lists/users/2014/08/25079.php
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>> Link to this post:
> > http://www.open-mpi.org/community/lists/users/2014/08/25080.php
> >>>
> >>> ----
> >>> Tetsuya Mishima  tmish...@jcity.maeda.co.jp
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> Link to this post:
> > http://www.open-mpi.org/community/lists/users/2014/08/25081.php
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
> > http://www.open-mpi.org/community/lists/users/2014/08/25083.php
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25084.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25087.php

Re: [OMPI users] Running a hybrid MPI+openMP program

Reply via email to