Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Tetsuya Mishima
Reuti and Oscar,

I'm a Torque user and I myself have never used SGE, so I hesitated to join 
the discussion.

>From my experience with the Torque, the openmpi 1.8 series has already 
resolved the issue you pointed out in combining MPI with OpenMP. 

Please try to add --map-by slot:pe=8 option, if you want to use 8 threads. 
Then, then openmpi 1.8 should allocate processes properly without any 
modification 
of the hostfile provided by the Torque.

In your case(8 threads and 10 procs):

# you have to request 80 slots using SGE command before mpirun 
mpirun --map-by slot:pe=8 -np 10 ./inverse.exe

where you can omit --bind-to option because --bind-to core is assumed
as default when pe=N is provided by the user.

Regards,
Tetsuya

>Hi,
>
>Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
>
>> I discovered what was the error. I forgot include the '-fopenmp' when I 
>> compiled the objects in the Makefile, so the program worked but it didn't 
>> divide the job 
in threads. Now the program is working and I can use until 15 cores for machine 
in the queue one.q.
>> 
>> Anyway i would like to try implement your advice. Well I'm not alone in the 
>> cluster so i must implement your second suggestion. The steps are
>> 
>> a) Use '$ qconf -mp orte' to change the allocation rule to 8
>
>The number of slots defined in your used one.q was also increased to 8 (`qconf 
>-sq one.q`)?
>
>
>> b) Set '#$ -pe orte 80' in the script
>
>Fine.
>
>
>> c) I'm not sure how to do this step. I'd appreciate your help here. I can 
>> add some lines to the script to determine the PE_HOSTFILE path and contents, 
>> but i 
don't know how alter it 
>
>For now you can put in your jobscript (just after OMP_NUM_THREAD is exported):
>
>awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; print }' 
>$PE_HOSTFILE > $TMPDIR/machines
>export PE_HOSTFILE=$TMPDIR/machines
>
>=
>
>Unfortunately noone stepped into this discussion, as in my opinion it's a much 
>broader issue which targets all users who want to combine MPI with OpenMP. The 
queuingsystem should get a proper request for the overall amount of slots the 
user needs. For now this will be forwarded to Open MPI and it will use this 
information to start the appropriate number of processes (which was an 
achievement for the Tight Integration out-of-the-box of course) and ignores any 
setting of 
OMP_NUM_THREADS. So, where should the generated list of machines be adjusted; 
there are several options:
>
>a) The PE of the queuingsystem should do it:
>
>+ a one time setup for the admin
>+ in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE
>- the "start_proc_args" would need to know the number of threads, i.e. 
>OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript 
>(tricky scanning 
of the submitted jobscript for OMP_NUM_THREADS would be too nasty)
>- limits to use inside the jobscript calls to libraries behaving in the same 
>way as Open MPI only
>
>
>b) The particular queue should do it in a queue prolog:
>
>same as a) I think
>
>
>c) The user should do it
>
>+ no change in the SGE installation
>- each and every user must include it in all the jobscripts to adjust the list 
>and export the pointer to the $PE_HOSTFILE, but he could change it forth and 
>back 
for different steps of the jobscript though
>
>
>d) Open MPI should do it
>
>+ no change in the SGE installation
>+ no change to the jobscript
>+ OMP_NUM_THREADS can be altered for different steps of the jobscript while 
>staying inside the granted allocation automatically
>o should MKL_NUM_THREADS be covered too (does it use OMP_NUM_THREADS already)?
>
>-- Reuti
>
>
>> echo "PE_HOSTFILE:"
>> echo $PE_HOSTFILE
>> echo
>> echo "cat PE_HOSTFILE:"
>> cat $PE_HOSTFILE 
>> 
>> Thanks for take a time for answer this emails, your advices had been very 
>> useful
>> 
>> PS: The version of SGE is   OGS/GE 2011.11p1
>> 
>> 
>> Oscar Fabian Mojica Ladino
>> Geologist M.S. in  Geophysics
>> 
>> 
>> > From: re...@staff.uni-marburg.de
>> > Date: Fri, 15 Aug 2014 20:38:12 +0200
>> > To: us...@open-mpi.org
>> > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
>> > 
>> > Hi,
>> > 
>> > Am 15.08.2014 um 19:56 schrieb Oscar Mojica:
>> > 
>> > > Yes, my installation of Open MPI is SGE-aware. I got the following
>> > > 
>> > > [oscar@compute-1-2 ~]$ ompi_info | grep grid
>> > > MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.2)
>> > 
>> > Fine.
>> > 
>> > 
>> > > I'm a bit slow and I didn't understand the las part of your message. So 
>> > > i made a test trying to solve my doubts.
>> > > This is the cluster configuration: There are some machines turned off 
>> > > but that is no problem
>> > > 
>> > > [oscar@aguia free-noise]$ qhost
>> > > HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
>> > > ---
>> > > global - - - - - - -
>> > > compute-1-10 linux-x64 16 0.97 23.6G 558.6M 996.2M 0.0
>> > > compute-1-11 

[OMPI users] Does multiple Irecv means concurrent receiving ?

2014-08-20 Thread Zhang,Lei(Ecom)
I have a performance problem with receiving. In a single master thread, I made 
several Irecv calls:

Irecv(buf1, ..., tag, ANY_SOURCE, COMM_WORLD)
Irecv(buf2, ..., tag, ANY_SOURCE, COMM_WORLD)
...
Irecv(bufn, ..., tag, ANY_SOURCE, COMM_WORLD)

all of which try to receive from any node for messages with the same tag.

Then, whenever any of the Irecv completes (using Testany), a separate thread is 
dispatched to work on the received message.
In my program, many nodes will send to this master thread.

However, I noticed that the speed of recv is almost unaffected no matter how 
many Irecv calls were made.
It seems that multiple Irecv calls does not mean concurrently receiving from 
many nodes.
By profiling the node running the master thread, I can see that the network 
input bandwidth is quite low.

Is my understanding correct ? or How to maximize the recv throughput of the 
master thread ?

Thanks !

Zhang Lei
@ Baidu, Inc.


Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Reuti
Hi,

Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima:

> Reuti and Oscar,
> 
> I'm a Torque user and I myself have never used SGE, so I hesitated to join 
> the discussion.
> 
> From my experience with the Torque, the openmpi 1.8 series has already 
> resolved the issue you pointed out in combining MPI with OpenMP. 
> 
> Please try to add --map-by slot:pe=8 option, if you want to use 8 threads. 
> Then, then openmpi 1.8 should allocate processes properly without any 
> modification 
> of the hostfile provided by the Torque.
> 
> In your case(8 threads and 10 procs):
> 
> # you have to request 80 slots using SGE command before mpirun 
> mpirun --map-by slot:pe=8 -np 10 ./inverse.exe

Thx for pointing me to this option, for now I can't get it working though (in 
fact, I want to use it without binding essentially). This allows to tell Open 
MPI to bind more cores to each of the MPI processes - ok, but does it lower the 
slot count granted by Torque too? I mean, was your submission command like:

$ qsub -l nodes=10:ppn=8 ...

so that Torque knows, that it should grant and remember this slot count of a 
total of 80 for the correct accounting?

-- Reuti


> where you can omit --bind-to option because --bind-to core is assumed
> as default when pe=N is provided by the user.
> Regards,
> Tetsuya
> 
>> Hi,
>> 
>> Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
>> 
>>> I discovered what was the error. I forgot include the '-fopenmp' when I 
>>> compiled the objects in the Makefile, so the program worked but it didn't 
>>> divide the job 
> in threads. Now the program is working and I can use until 15 cores for 
> machine in the queue one.q.
>>> 
>>> Anyway i would like to try implement your advice. Well I'm not alone in the 
>>> cluster so i must implement your second suggestion. The steps are
>>> 
>>> a) Use '$ qconf -mp orte' to change the allocation rule to 8
>> 
>> The number of slots defined in your used one.q was also increased to 8 
>> (`qconf -sq one.q`)?
>> 
>> 
>>> b) Set '#$ -pe orte 80' in the script
>> 
>> Fine.
>> 
>> 
>>> c) I'm not sure how to do this step. I'd appreciate your help here. I can 
>>> add some lines to the script to determine the PE_HOSTFILE path and 
>>> contents, but i 
> don't know how alter it 
>> 
>> For now you can put in your jobscript (just after OMP_NUM_THREAD is 
>> exported):
>> 
>> awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; print }' 
>> $PE_HOSTFILE > $TMPDIR/machines
>> export PE_HOSTFILE=$TMPDIR/machines
>> 
>> =
>> 
>> Unfortunately noone stepped into this discussion, as in my opinion it's a 
>> much broader issue which targets all users who want to combine MPI with 
>> OpenMP. The 
> queuingsystem should get a proper request for the overall amount of slots the 
> user needs. For now this will be forwarded to Open MPI and it will use this 
> information to start the appropriate number of processes (which was an 
> achievement for the Tight Integration out-of-the-box of course) and ignores 
> any setting of 
> OMP_NUM_THREADS. So, where should the generated list of machines be adjusted; 
> there are several options:
>> 
>> a) The PE of the queuingsystem should do it:
>> 
>> + a one time setup for the admin
>> + in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE
>> - the "start_proc_args" would need to know the number of threads, i.e. 
>> OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript 
>> (tricky scanning 
> of the submitted jobscript for OMP_NUM_THREADS would be too nasty)
>> - limits to use inside the jobscript calls to libraries behaving in the same 
>> way as Open MPI only
>> 
>> 
>> b) The particular queue should do it in a queue prolog:
>> 
>> same as a) I think
>> 
>> 
>> c) The user should do it
>> 
>> + no change in the SGE installation
>> - each and every user must include it in all the jobscripts to adjust the 
>> list and export the pointer to the $PE_HOSTFILE, but he could change it 
>> forth and back 
> for different steps of the jobscript though
>> 
>> 
>> d) Open MPI should do it
>> 
>> + no change in the SGE installation
>> + no change to the jobscript
>> + OMP_NUM_THREADS can be altered for different steps of the jobscript while 
>> staying inside the granted allocation automatically
>> o should MKL_NUM_THREADS be covered too (does it use OMP_NUM_THREADS 
>> already)?
>> 
>> -- Reuti
>> 
>> 
>>> echo "PE_HOSTFILE:"
>>> echo $PE_HOSTFILE
>>> echo
>>> echo "cat PE_HOSTFILE:"
>>> cat $PE_HOSTFILE 
>>> 
>>> Thanks for take a time for answer this emails, your advices had been very 
>>> useful
>>> 
>>> PS: The version of SGE is   OGS/GE 2011.11p1
>>> 
>>> 
>>> Oscar Fabian Mojica Ladino
>>> Geologist M.S. in  Geophysics
>>> 
>>> 
 From: re...@staff.uni-marburg.de
 Date: Fri, 15 Aug 2014 20:38:12 +0200
 To: us...@open-mpi.org
 Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
 
 Hi,
 
 Am 15.08.2014 um 19:56 schrieb Oscar Mojica:
 
> Yes, m

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread tmishima
Reuti,

If you want to allocate 10 procs with N threads, the Torque
script below should work for you:

qsub -l nodes=10:ppn=N
mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe

Then, the openmpi automatically reduces the logical slot count to 10
by dividing real slot count 10N by binding width of N.

I don't know why you want to use pe=N without binding, but unfortunately
the openmpi allocates successive cores to each process so far when you
use pe option - it forcibly bind_to core.

Tetsuya


> Hi,
>
> Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima:
>
> > Reuti and Oscar,
> >
> > I'm a Torque user and I myself have never used SGE, so I hesitated to
join
> > the discussion.
> >
> > From my experience with the Torque, the openmpi 1.8 series has already
> > resolved the issue you pointed out in combining MPI with OpenMP.
> >
> > Please try to add --map-by slot:pe=8 option, if you want to use 8
threads.
> > Then, then openmpi 1.8 should allocate processes properly without any
modification
> > of the hostfile provided by the Torque.
> >
> > In your case(8 threads and 10 procs):
> >
> > # you have to request 80 slots using SGE command before mpirun
> > mpirun --map-by slot:pe=8 -np 10 ./inverse.exe
>
> Thx for pointing me to this option, for now I can't get it working though
(in fact, I want to use it without binding essentially). This allows to
tell Open MPI to bind more cores to each of the MPI
> processes - ok, but does it lower the slot count granted by Torque too? I
mean, was your submission command like:
>
> $ qsub -l nodes=10:ppn=8 ...
>
> so that Torque knows, that it should grant and remember this slot count
of a total of 80 for the correct accounting?
>
> -- Reuti
>
>
> > where you can omit --bind-to option because --bind-to core is assumed
> > as default when pe=N is provided by the user.
> > Regards,
> > Tetsuya
> >
> >> Hi,
> >>
> >> Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
> >>
> >>> I discovered what was the error. I forgot include the '-fopenmp' when
I compiled the objects in the Makefile, so the program worked but it didn't
divide the job
> > in threads. Now the program is working and I can use until 15 cores for
machine in the queue one.q.
> >>>
> >>> Anyway i would like to try implement your advice. Well I'm not alone
in the cluster so i must implement your second suggestion. The steps are
> >>>
> >>> a) Use '$ qconf -mp orte' to change the allocation rule to 8
> >>
> >> The number of slots defined in your used one.q was also increased to 8
(`qconf -sq one.q`)?
> >>
> >>
> >>> b) Set '#$ -pe orte 80' in the script
> >>
> >> Fine.
> >>
> >>
> >>> c) I'm not sure how to do this step. I'd appreciate your help here. I
can add some lines to the script to determine the PE_HOSTFILE path and
contents, but i
> > don't know how alter it
> >>
> >> For now you can put in your jobscript (just after OMP_NUM_THREAD is
exported):
> >>
> >> awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads;
print }' $PE_HOSTFILE > $TMPDIR/machines
> >> export PE_HOSTFILE=$TMPDIR/machines
> >>
> >> =
> >>
> >> Unfortunately noone stepped into this discussion, as in my opinion
it's a much broader issue which targets all users who want to combine MPI
with OpenMP. The
> > queuingsystem should get a proper request for the overall amount of
slots the user needs. For now this will be forwarded to Open MPI and it
will use this
> > information to start the appropriate number of processes (which was an
achievement for the Tight Integration out-of-the-box of course) and ignores
any setting of
> > OMP_NUM_THREADS. So, where should the generated list of machines be
adjusted; there are several options:
> >>
> >> a) The PE of the queuingsystem should do it:
> >>
> >> + a one time setup for the admin
> >> + in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE
> >> - the "start_proc_args" would need to know the number of threads, i.e.
OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript
(tricky scanning
> > of the submitted jobscript for OMP_NUM_THREADS would be too nasty)
> >> - limits to use inside the jobscript calls to libraries behaving in
the same way as Open MPI only
> >>
> >>
> >> b) The particular queue should do it in a queue prolog:
> >>
> >> same as a) I think
> >>
> >>
> >> c) The user should do it
> >>
> >> + no change in the SGE installation
> >> - each and every user must include it in all the jobscripts to adjust
the list and export the pointer to the $PE_HOSTFILE, but he could change it
forth and back
> > for different steps of the jobscript though
> >>
> >>
> >> d) Open MPI should do it
> >>
> >> + no change in the SGE installation
> >> + no change to the jobscript
> >> + OMP_NUM_THREADS can be altered for different steps of the jobscript
while staying inside the granted allocation automatically
> >> o should MKL_NUM_THREADS be covered too (does it use OMP_NUM_THREADS
already)?
> >>
> >> -- Reuti
> >>
> >>
> >>> echo "PE_HOSTFILE:"
> >>> echo 

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Ralph Castain
Just to clarify: OMPI will bind the process to *all* N cores, not just to one.


On Aug 20, 2014, at 4:26 AM, tmish...@jcity.maeda.co.jp wrote:

> Reuti,
> 
> If you want to allocate 10 procs with N threads, the Torque
> script below should work for you:
> 
> qsub -l nodes=10:ppn=N
> mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe
> 
> Then, the openmpi automatically reduces the logical slot count to 10
> by dividing real slot count 10N by binding width of N.
> 
> I don't know why you want to use pe=N without binding, but unfortunately
> the openmpi allocates successive cores to each process so far when you
> use pe option - it forcibly bind_to core.
> 
> Tetsuya
> 
> 
>> Hi,
>> 
>> Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima:
>> 
>>> Reuti and Oscar,
>>> 
>>> I'm a Torque user and I myself have never used SGE, so I hesitated to
> join
>>> the discussion.
>>> 
>>> From my experience with the Torque, the openmpi 1.8 series has already
>>> resolved the issue you pointed out in combining MPI with OpenMP.
>>> 
>>> Please try to add --map-by slot:pe=8 option, if you want to use 8
> threads.
>>> Then, then openmpi 1.8 should allocate processes properly without any
> modification
>>> of the hostfile provided by the Torque.
>>> 
>>> In your case(8 threads and 10 procs):
>>> 
>>> # you have to request 80 slots using SGE command before mpirun
>>> mpirun --map-by slot:pe=8 -np 10 ./inverse.exe
>> 
>> Thx for pointing me to this option, for now I can't get it working though
> (in fact, I want to use it without binding essentially). This allows to
> tell Open MPI to bind more cores to each of the MPI
>> processes - ok, but does it lower the slot count granted by Torque too? I
> mean, was your submission command like:
>> 
>> $ qsub -l nodes=10:ppn=8 ...
>> 
>> so that Torque knows, that it should grant and remember this slot count
> of a total of 80 for the correct accounting?
>> 
>> -- Reuti
>> 
>> 
>>> where you can omit --bind-to option because --bind-to core is assumed
>>> as default when pe=N is provided by the user.
>>> Regards,
>>> Tetsuya
>>> 
 Hi,
 
 Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
 
> I discovered what was the error. I forgot include the '-fopenmp' when
> I compiled the objects in the Makefile, so the program worked but it didn't
> divide the job
>>> in threads. Now the program is working and I can use until 15 cores for
> machine in the queue one.q.
> 
> Anyway i would like to try implement your advice. Well I'm not alone
> in the cluster so i must implement your second suggestion. The steps are
> 
> a) Use '$ qconf -mp orte' to change the allocation rule to 8
 
 The number of slots defined in your used one.q was also increased to 8
> (`qconf -sq one.q`)?
 
 
> b) Set '#$ -pe orte 80' in the script
 
 Fine.
 
 
> c) I'm not sure how to do this step. I'd appreciate your help here. I
> can add some lines to the script to determine the PE_HOSTFILE path and
> contents, but i
>>> don't know how alter it
 
 For now you can put in your jobscript (just after OMP_NUM_THREAD is
> exported):
 
 awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads;
> print }' $PE_HOSTFILE > $TMPDIR/machines
 export PE_HOSTFILE=$TMPDIR/machines
 
 =
 
 Unfortunately noone stepped into this discussion, as in my opinion
> it's a much broader issue which targets all users who want to combine MPI
> with OpenMP. The
>>> queuingsystem should get a proper request for the overall amount of
> slots the user needs. For now this will be forwarded to Open MPI and it
> will use this
>>> information to start the appropriate number of processes (which was an
> achievement for the Tight Integration out-of-the-box of course) and ignores
> any setting of
>>> OMP_NUM_THREADS. So, where should the generated list of machines be
> adjusted; there are several options:
 
 a) The PE of the queuingsystem should do it:
 
 + a one time setup for the admin
 + in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE
 - the "start_proc_args" would need to know the number of threads, i.e.
> OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript
> (tricky scanning
>>> of the submitted jobscript for OMP_NUM_THREADS would be too nasty)
 - limits to use inside the jobscript calls to libraries behaving in
> the same way as Open MPI only
 
 
 b) The particular queue should do it in a queue prolog:
 
 same as a) I think
 
 
 c) The user should do it
 
 + no change in the SGE installation
 - each and every user must include it in all the jobscripts to adjust
> the list and export the pointer to the $PE_HOSTFILE, but he could change it
> forth and back
>>> for different steps of the jobscript though
 
 
 d) Open MPI should do it
 
 + no change in the SGE installation
 + no change to the jo

Re: [OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-20 Thread Timur Ismagilov

Hello!

As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still have the 
problem

a)
$ mpirun  -np 1 ./hello_c
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--
b)
$ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--

c)

$ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose 5 
-mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c
[compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
[compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set 
priority to 0
[compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
[compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set priority 
to 10
[compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
[compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set 
priority to 75
[compiler-2:14673] mca:base:select:( plm) Selected component [slurm]
[compiler-2:14673] mca: base: components_register: registering oob components
[compiler-2:14673] mca: base: components_register: found loaded component tcp
[compiler-2:14673] mca: base: components_register: component tcp register 
function successful
[compiler-2:14673] mca: base: components_open: opening oob components
[compiler-2:14673] mca: base: components_open: found loaded component tcp
[compiler-2:14673] mca: base: components_open: component tcp open function 
successful
[compiler-2:14673] mca:oob:select: checking available component tcp
[compiler-2:14673] mca:oob:select: Querying component [tcp]
[compiler-2:14673] oob:tcp: component_available called
[compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
[compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
[compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list of 
V4 connections
[compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
[compiler-2:14673] [[49095,0],0] TCP STARTUP
[compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0
[compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460
[compiler-2:14673] mca:oob:select: Adding component to end
[compiler-2:14673] mca:oob:select: Found 1 active transports
[compiler-2:14673] mca: base: components_register: registering rml components
[compiler-2:14673] mca: base: components_register: found loaded component oob
[compiler-2:14673] mca: base: components_register: component oob has no 
register or open function
[compiler-2:14673] mca: base: components_open: opening rml components
[compiler-2:14673] mca: base: components_open: found loaded component oob
[compiler-2:14673] mca: base: components_open: component oob open function 
successful
[compiler-2:14673] orte_rml_base_select: initializing rml component oob
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 10 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 12 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Reuti
Hi,

Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp:

> Reuti,
> 
> If you want to allocate 10 procs with N threads, the Torque
> script below should work for you:
> 
> qsub -l nodes=10:ppn=N
> mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe

I played around with giving -np 10 in addition to a Tight Integration. The slot 
count is not really divided I think, but only 10 out of the granted maximum is 
used (while on each of the listed machines an `orted` is started). Due to the 
fixed allocation this is of course the result we want to achieve as it 
subtracts bunches of 8 from the given list of machines resp. slots. In SGE it's 
sufficient to use and AFAICS it works (without touching the $PE_HOSTFILE any 
longer):

===
export OMP_NUM_THREADS=8
mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / 
$OMP_NUM_THREADS") ./inverse.exe
===

and submit with:

$ qsub -pe orte 80 job.sh

as the variables are distributed to the slave nodes by SGE already.

Nevertheless, using -np in addition to the Tight Integration gives a taste of a 
kind of half-tight integration in some way. And would not work for us because 
"--bind-to none" can't be used in such a command (see below) and throws an 
error.


> Then, the openmpi automatically reduces the logical slot count to 10
> by dividing real slot count 10N by binding width of N.
> 
> I don't know why you want to use pe=N without binding, but unfortunately
> the openmpi allocates successive cores to each process so far when you
> use pe option - it forcibly bind_to core.

In a shared cluster with many users and different MPI libraries in use, only 
the queuingsystem could know which job got which cores granted. This avoids any 
oversubscription of cores, while others are idle.

-- Reuti


> Tetsuya
> 
> 
>> Hi,
>> 
>> Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima:
>> 
>>> Reuti and Oscar,
>>> 
>>> I'm a Torque user and I myself have never used SGE, so I hesitated to
> join
>>> the discussion.
>>> 
>>> From my experience with the Torque, the openmpi 1.8 series has already
>>> resolved the issue you pointed out in combining MPI with OpenMP.
>>> 
>>> Please try to add --map-by slot:pe=8 option, if you want to use 8
> threads.
>>> Then, then openmpi 1.8 should allocate processes properly without any
> modification
>>> of the hostfile provided by the Torque.
>>> 
>>> In your case(8 threads and 10 procs):
>>> 
>>> # you have to request 80 slots using SGE command before mpirun
>>> mpirun --map-by slot:pe=8 -np 10 ./inverse.exe
>> 
>> Thx for pointing me to this option, for now I can't get it working though
> (in fact, I want to use it without binding essentially). This allows to
> tell Open MPI to bind more cores to each of the MPI
>> processes - ok, but does it lower the slot count granted by Torque too? I
> mean, was your submission command like:
>> 
>> $ qsub -l nodes=10:ppn=8 ...
>> 
>> so that Torque knows, that it should grant and remember this slot count
> of a total of 80 for the correct accounting?
>> 
>> -- Reuti
>> 
>> 
>>> where you can omit --bind-to option because --bind-to core is assumed
>>> as default when pe=N is provided by the user.
>>> Regards,
>>> Tetsuya
>>> 
 Hi,
 
 Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
 
> I discovered what was the error. I forgot include the '-fopenmp' when
> I compiled the objects in the Makefile, so the program worked but it didn't
> divide the job
>>> in threads. Now the program is working and I can use until 15 cores for
> machine in the queue one.q.
> 
> Anyway i would like to try implement your advice. Well I'm not alone
> in the cluster so i must implement your second suggestion. The steps are
> 
> a) Use '$ qconf -mp orte' to change the allocation rule to 8
 
 The number of slots defined in your used one.q was also increased to 8
> (`qconf -sq one.q`)?
 
 
> b) Set '#$ -pe orte 80' in the script
 
 Fine.
 
 
> c) I'm not sure how to do this step. I'd appreciate your help here. I
> can add some lines to the script to determine the PE_HOSTFILE path and
> contents, but i
>>> don't know how alter it
 
 For now you can put in your jobscript (just after OMP_NUM_THREAD is
> exported):
 
 awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads;
> print }' $PE_HOSTFILE > $TMPDIR/machines
 export PE_HOSTFILE=$TMPDIR/machines
 
 =
 
 Unfortunately noone stepped into this discussion, as in my opinion
> it's a much broader issue which targets all users who want to combine MPI
> with OpenMP. The
>>> queuingsystem should get a proper request for the overall amount of
> slots the user needs. For now this will be forwarded to Open MPI and it
> will use this
>>> information to start the appropriate number of processes (which was an
> achievement for the Tight Integration out-of-the-box of course) and ignores
> any setting of
>>> OMP_NUM_THREADS. So, where should the generate

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Ralph Castain

On Aug 20, 2014, at 6:58 AM, Reuti  wrote:

> Hi,
> 
> Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp:
> 
>> Reuti,
>> 
>> If you want to allocate 10 procs with N threads, the Torque
>> script below should work for you:
>> 
>> qsub -l nodes=10:ppn=N
>> mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe
> 
> I played around with giving -np 10 in addition to a Tight Integration. The 
> slot count is not really divided I think, but only 10 out of the granted 
> maximum is used (while on each of the listed machines an `orted` is started). 
> Due to the fixed allocation this is of course the result we want to achieve 
> as it subtracts bunches of 8 from the given list of machines resp. slots. In 
> SGE it's sufficient to use and AFAICS it works (without touching the 
> $PE_HOSTFILE any longer):
> 
> ===
> export OMP_NUM_THREADS=8
> mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / 
> $OMP_NUM_THREADS") ./inverse.exe
> ===
> 
> and submit with:
> 
> $ qsub -pe orte 80 job.sh
> 
> as the variables are distributed to the slave nodes by SGE already.
> 
> Nevertheless, using -np in addition to the Tight Integration gives a taste of 
> a kind of half-tight integration in some way. And would not work for us 
> because "--bind-to none" can't be used in such a command (see below) and 
> throws an error.
> 
> 
>> Then, the openmpi automatically reduces the logical slot count to 10
>> by dividing real slot count 10N by binding width of N.
>> 
>> I don't know why you want to use pe=N without binding, but unfortunately
>> the openmpi allocates successive cores to each process so far when you
>> use pe option - it forcibly bind_to core.
> 
> In a shared cluster with many users and different MPI libraries in use, only 
> the queuingsystem could know which job got which cores granted. This avoids 
> any oversubscription of cores, while others are idle.

FWIW: we detect the exterior binding constraint and work within it


> 
> -- Reuti
> 
> 
>> Tetsuya
>> 
>> 
>>> Hi,
>>> 
>>> Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima:
>>> 
 Reuti and Oscar,
 
 I'm a Torque user and I myself have never used SGE, so I hesitated to
>> join
 the discussion.
 
 From my experience with the Torque, the openmpi 1.8 series has already
 resolved the issue you pointed out in combining MPI with OpenMP.
 
 Please try to add --map-by slot:pe=8 option, if you want to use 8
>> threads.
 Then, then openmpi 1.8 should allocate processes properly without any
>> modification
 of the hostfile provided by the Torque.
 
 In your case(8 threads and 10 procs):
 
 # you have to request 80 slots using SGE command before mpirun
 mpirun --map-by slot:pe=8 -np 10 ./inverse.exe
>>> 
>>> Thx for pointing me to this option, for now I can't get it working though
>> (in fact, I want to use it without binding essentially). This allows to
>> tell Open MPI to bind more cores to each of the MPI
>>> processes - ok, but does it lower the slot count granted by Torque too? I
>> mean, was your submission command like:
>>> 
>>> $ qsub -l nodes=10:ppn=8 ...
>>> 
>>> so that Torque knows, that it should grant and remember this slot count
>> of a total of 80 for the correct accounting?
>>> 
>>> -- Reuti
>>> 
>>> 
 where you can omit --bind-to option because --bind-to core is assumed
 as default when pe=N is provided by the user.
 Regards,
 Tetsuya
 
> Hi,
> 
> Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
> 
>> I discovered what was the error. I forgot include the '-fopenmp' when
>> I compiled the objects in the Makefile, so the program worked but it didn't
>> divide the job
 in threads. Now the program is working and I can use until 15 cores for
>> machine in the queue one.q.
>> 
>> Anyway i would like to try implement your advice. Well I'm not alone
>> in the cluster so i must implement your second suggestion. The steps are
>> 
>> a) Use '$ qconf -mp orte' to change the allocation rule to 8
> 
> The number of slots defined in your used one.q was also increased to 8
>> (`qconf -sq one.q`)?
> 
> 
>> b) Set '#$ -pe orte 80' in the script
> 
> Fine.
> 
> 
>> c) I'm not sure how to do this step. I'd appreciate your help here. I
>> can add some lines to the script to determine the PE_HOSTFILE path and
>> contents, but i
 don't know how alter it
> 
> For now you can put in your jobscript (just after OMP_NUM_THREAD is
>> exported):
> 
> awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads;
>> print }' $PE_HOSTFILE > $TMPDIR/machines
> export PE_HOSTFILE=$TMPDIR/machines
> 
> =
> 
> Unfortunately noone stepped into this discussion, as in my opinion
>> it's a much broader issue which targets all users who want to combine MPI
>> with OpenMP. The
 queuingsystem should get a proper request for the overall amount of
>> slots the use

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Reuti
Am 20.08.2014 um 16:26 schrieb Ralph Castain:

> On Aug 20, 2014, at 6:58 AM, Reuti  wrote:
> 
>> Hi,
>> 
>> Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp:
>> 
>>> Reuti,
>>> 
>>> If you want to allocate 10 procs with N threads, the Torque
>>> script below should work for you:
>>> 
>>> qsub -l nodes=10:ppn=N
>>> mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe
>> 
>> I played around with giving -np 10 in addition to a Tight Integration. The 
>> slot count is not really divided I think, but only 10 out of the granted 
>> maximum is used (while on each of the listed machines an `orted` is 
>> started). Due to the fixed allocation this is of course the result we want 
>> to achieve as it subtracts bunches of 8 from the given list of machines 
>> resp. slots. In SGE it's sufficient to use and AFAICS it works (without 
>> touching the $PE_HOSTFILE any longer):
>> 
>> ===
>> export OMP_NUM_THREADS=8
>> mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / 
>> $OMP_NUM_THREADS") ./inverse.exe
>> ===
>> 
>> and submit with:
>> 
>> $ qsub -pe orte 80 job.sh
>> 
>> as the variables are distributed to the slave nodes by SGE already.
>> 
>> Nevertheless, using -np in addition to the Tight Integration gives a taste 
>> of a kind of half-tight integration in some way. And would not work for us 
>> because "--bind-to none" can't be used in such a command (see below) and 
>> throws an error.
>> 
>> 
>>> Then, the openmpi automatically reduces the logical slot count to 10
>>> by dividing real slot count 10N by binding width of N.
>>> 
>>> I don't know why you want to use pe=N without binding, but unfortunately
>>> the openmpi allocates successive cores to each process so far when you
>>> use pe option - it forcibly bind_to core.
>> 
>> In a shared cluster with many users and different MPI libraries in use, only 
>> the queuingsystem could know which job got which cores granted. This avoids 
>> any oversubscription of cores, while others are idle.
> 
> FWIW: we detect the exterior binding constraint and work within it

Aha, this is quite interesting - how do you do this: scanning the 
/proc//status or alike? What happens if you don't find enough free cores 
as they are used up by other applications already?

-- Reuti


>> -- Reuti
>> 
>> 
>>> Tetsuya
>>> 
>>> 
 Hi,
 
 Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima:
 
> Reuti and Oscar,
> 
> I'm a Torque user and I myself have never used SGE, so I hesitated to
>>> join
> the discussion.
> 
> From my experience with the Torque, the openmpi 1.8 series has already
> resolved the issue you pointed out in combining MPI with OpenMP.
> 
> Please try to add --map-by slot:pe=8 option, if you want to use 8
>>> threads.
> Then, then openmpi 1.8 should allocate processes properly without any
>>> modification
> of the hostfile provided by the Torque.
> 
> In your case(8 threads and 10 procs):
> 
> # you have to request 80 slots using SGE command before mpirun
> mpirun --map-by slot:pe=8 -np 10 ./inverse.exe
 
 Thx for pointing me to this option, for now I can't get it working though
>>> (in fact, I want to use it without binding essentially). This allows to
>>> tell Open MPI to bind more cores to each of the MPI
 processes - ok, but does it lower the slot count granted by Torque too? I
>>> mean, was your submission command like:
 
 $ qsub -l nodes=10:ppn=8 ...
 
 so that Torque knows, that it should grant and remember this slot count
>>> of a total of 80 for the correct accounting?
 
 -- Reuti
 
 
> where you can omit --bind-to option because --bind-to core is assumed
> as default when pe=N is provided by the user.
> Regards,
> Tetsuya
> 
>> Hi,
>> 
>> Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
>> 
>>> I discovered what was the error. I forgot include the '-fopenmp' when
>>> I compiled the objects in the Makefile, so the program worked but it didn't
>>> divide the job
> in threads. Now the program is working and I can use until 15 cores for
>>> machine in the queue one.q.
>>> 
>>> Anyway i would like to try implement your advice. Well I'm not alone
>>> in the cluster so i must implement your second suggestion. The steps are
>>> 
>>> a) Use '$ qconf -mp orte' to change the allocation rule to 8
>> 
>> The number of slots defined in your used one.q was also increased to 8
>>> (`qconf -sq one.q`)?
>> 
>> 
>>> b) Set '#$ -pe orte 80' in the script
>> 
>> Fine.
>> 
>> 
>>> c) I'm not sure how to do this step. I'd appreciate your help here. I
>>> can add some lines to the script to determine the PE_HOSTFILE path and
>>> contents, but i
> don't know how alter it
>> 
>> For now you can put in your jobscript (just after OMP_NUM_THREAD is
>>> exported):
>> 
>> awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_thre

Re: [OMPI users] No log_num_mtt in Ubuntu 14.04

2014-08-20 Thread rf
> "Mike" == Mike Dubman  writes:

Mike> so, it seems you have old ofed w/o this parameter.  Can you
Mike> install latest Mellanox ofed? or check which community ofed
Mike> has it?

Rio is using the kernel.org drivers that are part of Ubuntu/3.13.x and
log_num_mtt is not a parameter in those drivers. In fact log_num_mtt
has never been a parameter in the kernel.org sources (just checked the
git commit history). And it's not needed anymore either, since the
following commit (which is also part of OFED 3.12 btw; Mike, seems
Mellanox OFED is behind with this respect):
---
commit db5a7a65c05867cb6ff5cb6d556a0edfce631d2d
Author: Roland Dreier 
List-Post: users@lists.open-mpi.org
Date:   Mon Mar 5 10:05:28 2012 -0800

mlx4_core: Scale size of MTT table with system RAM

The current driver defaults to 1M MTT segments, where each segment holds
8 MTT entries.  This limits the total memory registered to 8M * PAGE_SIZE
which is 32GB with 4K pages.  Since systems that have much more memory
are pretty common now (at least among systems with InfiniBand hardware),
this limit ends up getting hit in practice quite a bit.

Handle this by having the driver allocate at least enough MTT entries to
cover 2 * totalram pages.

Signed-off-by: Roland Dreier 
---

The relevant code segment (drivers/net/ethernet/mellanox/mlx4/profile.c):

---
/*
 * We want to scale the number of MTTs with the size of the
 * system memory, since it makes sense to register a lot of
 * memory on a system with a lot of memory.  As a heuristic,
 * make sure we have enough MTTs to cover twice the system
 * memory (with PAGE_SIZE entries).
 *
 * This number has to be a power of two and fit into 32 bits
 * due to device limitations, so cap this at 2^31 as well.
 * That limits us to 8TB of memory registration per HCA with
 * 4KB pages, which is probably OK for the next few months.
 */
si_meminfo(&si);
request->num_mtt =
roundup_pow_of_two(max_t(unsigned, request->num_mtt,
 min(1UL << (31 - log_mtts_per_seg),
 si.totalram >> (log_mtts_per_seg - 
1;
---

So the point here is that OpenMPI should check the mlx4 driver versions
and not output false warnings when newer drivers are used. Didn't check
whether this is fixed in the OpenMPI code repositories yet. It's not
fixed in 1.8.2rc4 anyway (static uint64_t calculate_max_reg in
ompi/mca/btl/openib/btl_openib.c). Also, the OpenMPI FAQ should be
corrected accordingly.

Rio as a note for you: You can safely ignore the warning.

Cheers,

Roland

---
http://www.q-leap.com / http://qlustar.com
  --- HPC / Storage / Cloud Linux Cluster OS ---

Mike> On Tue, Aug 19, 2014 at 9:34 AM, Rio Yokota
Mike>  wrote:

>> Here is what "modinfo mlx4_core" gives
>>
>> filename:
>> 
/lib/modules/3.13.0-34-generic/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko
>> version: 2.2-1 license: Dual BSD/GPL description: Mellanox
>> ConnectX HCA low-level driver author: Roland Dreier srcversion:
>> 3AE29A0A6538EBBE9227361 alias:
>> pci:v15B3d1010sv*sd*bc*sc*i* alias:
>> pci:v15B3d100Fsv*sd*bc*sc*i* alias:
>> pci:v15B3d100Esv*sd*bc*sc*i* alias:
>> pci:v15B3d100Dsv*sd*bc*sc*i* alias:
>> pci:v15B3d100Csv*sd*bc*sc*i* alias:
>> pci:v15B3d100Bsv*sd*bc*sc*i* alias:
>> pci:v15B3d100Asv*sd*bc*sc*i* alias:
>> pci:v15B3d1009sv*sd*bc*sc*i* alias:
>> pci:v15B3d1008sv*sd*bc*sc*i* alias:
>> pci:v15B3d1007sv*sd*bc*sc*i* alias:
>> pci:v15B3d1006sv*sd*bc*sc*i* alias:
>> pci:v15B3d1005sv*sd*bc*sc*i* alias:
>> pci:v15B3d1004sv*sd*bc*sc*i* alias:
>> pci:v15B3d1003sv*sd*bc*sc*i* alias:
>> pci:v15B3d1002sv*sd*bc*sc*i* alias:
>> pci:v15B3d676Esv*sd*bc*sc*i* alias:
>> pci:v15B3d6746sv*sd*bc*sc*i* alias:
>> pci:v15B3d6764sv*sd*bc*sc*i* alias:
>> pci:v15B3d675Asv*sd*bc*sc*i* alias:
>> pci:v15B3d6372sv*sd*bc*sc*i* alias:
>> pci:v15B3d6750sv*sd*bc*sc*i* alias:
>> pci:v15B3d6368sv*sd*bc*sc*i* alias:
>> pci:v15B3d673Csv*sd*bc*sc*i* alias:
>> pci:v15B3d6732sv*sd*bc*sc*i* alias:
>> pci:v15B3d6354sv*sd*bc*sc*i* alias:
>> pci:v15B3d634Asv*sd*bc*sc*i* alias:
>> pci:v15B3d6340sv*sd*bc*sc*i* depends: intree: Y vermagic:
>> 3.13.0-34-generic SMP mod_unload modversions signer: Magrathea:
>> Glacier signing key sig_key:
>

Re: [OMPI users] No log_num_mtt in Ubuntu 14.04

2014-08-20 Thread Rio Yokota
Dear Roland,

Thank you so much. This was very helpful.

Best,
Rio

>> "Mike" == Mike Dubman  writes:
> 
>Mike> so, it seems you have old ofed w/o this parameter.  Can you
>Mike> install latest Mellanox ofed? or check which community ofed
>Mike> has it?
> 
> Rio is using the kernel.org drivers that are part of Ubuntu/3.13.x and
> log_num_mtt is not a parameter in those drivers. In fact log_num_mtt
> has never been a parameter in the kernel.org sources (just checked the
> git commit history). And it's not needed anymore either, since the
> following commit (which is also part of OFED 3.12 btw; Mike, seems
> Mellanox OFED is behind with this respect):
> ---
> commit db5a7a65c05867cb6ff5cb6d556a0edfce631d2d
> Author: Roland Dreier 
> Date:   Mon Mar 5 10:05:28 2012 -0800
> 
>mlx4_core: Scale size of MTT table with system RAM
> 
>The current driver defaults to 1M MTT segments, where each segment holds
>8 MTT entries.  This limits the total memory registered to 8M * PAGE_SIZE
>which is 32GB with 4K pages.  Since systems that have much more memory
>are pretty common now (at least among systems with InfiniBand hardware),
>this limit ends up getting hit in practice quite a bit.
> 
>Handle this by having the driver allocate at least enough MTT entries to
>cover 2 * totalram pages.
> 
>Signed-off-by: Roland Dreier 
> ---
> 
> The relevant code segment (drivers/net/ethernet/mellanox/mlx4/profile.c):
> 
> ---
>/*
> * We want to scale the number of MTTs with the size of the
> * system memory, since it makes sense to register a lot of
> * memory on a system with a lot of memory.  As a heuristic,
> * make sure we have enough MTTs to cover twice the system
> * memory (with PAGE_SIZE entries).
> *
> * This number has to be a power of two and fit into 32 bits
> * due to device limitations, so cap this at 2^31 as well.
> * That limits us to 8TB of memory registration per HCA with
> * 4KB pages, which is probably OK for the next few months.
> */
>si_meminfo(&si);
>request->num_mtt =
>roundup_pow_of_two(max_t(unsigned, request->num_mtt,
> min(1UL << (31 - log_mtts_per_seg),
> si.totalram >> (log_mtts_per_seg 
> - 1;
> ---
> 
> So the point here is that OpenMPI should check the mlx4 driver versions
> and not output false warnings when newer drivers are used. Didn't check
> whether this is fixed in the OpenMPI code repositories yet. It's not
> fixed in 1.8.2rc4 anyway (static uint64_t calculate_max_reg in
> ompi/mca/btl/openib/btl_openib.c). Also, the OpenMPI FAQ should be
> corrected accordingly.
> 
> Rio as a note for you: You can safely ignore the warning.
> 
> Cheers,
> 
> Roland
> 
> ---
> http://www.q-leap.com / http://qlustar.com
>  --- HPC / Storage / Cloud Linux Cluster OS ---
> 
>Mike> On Tue, Aug 19, 2014 at 9:34 AM, Rio Yokota
>Mike>  wrote:
> 
>>> Here is what "modinfo mlx4_core" gives
>>> 
>>> filename:
>>> /lib/modules/3.13.0-34-generic/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko
>>> version: 2.2-1 license: Dual BSD/GPL description: Mellanox
>>> ConnectX HCA low-level driver author: Roland Dreier srcversion:
>>> 3AE29A0A6538EBBE9227361 alias:
>>> pci:v15B3d1010sv*sd*bc*sc*i* alias:
>>> pci:v15B3d100Fsv*sd*bc*sc*i* alias:
>>> pci:v15B3d100Esv*sd*bc*sc*i* alias:
>>> pci:v15B3d100Dsv*sd*bc*sc*i* alias:
>>> pci:v15B3d100Csv*sd*bc*sc*i* alias:
>>> pci:v15B3d100Bsv*sd*bc*sc*i* alias:
>>> pci:v15B3d100Asv*sd*bc*sc*i* alias:
>>> pci:v15B3d1009sv*sd*bc*sc*i* alias:
>>> pci:v15B3d1008sv*sd*bc*sc*i* alias:
>>> pci:v15B3d1007sv*sd*bc*sc*i* alias:
>>> pci:v15B3d1006sv*sd*bc*sc*i* alias:
>>> pci:v15B3d1005sv*sd*bc*sc*i* alias:
>>> pci:v15B3d1004sv*sd*bc*sc*i* alias:
>>> pci:v15B3d1003sv*sd*bc*sc*i* alias:
>>> pci:v15B3d1002sv*sd*bc*sc*i* alias:
>>> pci:v15B3d676Esv*sd*bc*sc*i* alias:
>>> pci:v15B3d6746sv*sd*bc*sc*i* alias:
>>> pci:v15B3d6764sv*sd*bc*sc*i* alias:
>>> pci:v15B3d675Asv*sd*bc*sc*i* alias:
>>> pci:v15B3d6372sv*sd*bc*sc*i* alias:
>>> pci:v15B3d6750sv*sd*bc*sc*i* alias:
>>> pci:v15B3d6368sv*sd*bc*sc*i* alias:
>>> pci:v15B3d673Csv*sd*bc*sc*i* alias:
>>> pci:v15B3d6732sv*sd*bc*sc*i* alias:
>>> pci:v15B3d6354sv*sd*bc*sc*i* alias:
>>> pci:v15B3d634Asv*sd*bc*sc*i* alias:
>>> pci:v15B3d6340sv*sd*bc*sc*i* depends: intree: Y vermagic:
>>> 3.13.0-34-generic SMP mod_unload modversions signer: Magrathea:

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Ralph Castain

On Aug 20, 2014, at 9:04 AM, Reuti  wrote:

> Am 20.08.2014 um 16:26 schrieb Ralph Castain:
> 
>> On Aug 20, 2014, at 6:58 AM, Reuti  wrote:
>> 
>>> Hi,
>>> 
>>> Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp:
>>> 
 Reuti,
 
 If you want to allocate 10 procs with N threads, the Torque
 script below should work for you:
 
 qsub -l nodes=10:ppn=N
 mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe
>>> 
>>> I played around with giving -np 10 in addition to a Tight Integration. The 
>>> slot count is not really divided I think, but only 10 out of the granted 
>>> maximum is used (while on each of the listed machines an `orted` is 
>>> started). Due to the fixed allocation this is of course the result we want 
>>> to achieve as it subtracts bunches of 8 from the given list of machines 
>>> resp. slots. In SGE it's sufficient to use and AFAICS it works (without 
>>> touching the $PE_HOSTFILE any longer):
>>> 
>>> ===
>>> export OMP_NUM_THREADS=8
>>> mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / 
>>> $OMP_NUM_THREADS") ./inverse.exe
>>> ===
>>> 
>>> and submit with:
>>> 
>>> $ qsub -pe orte 80 job.sh
>>> 
>>> as the variables are distributed to the slave nodes by SGE already.
>>> 
>>> Nevertheless, using -np in addition to the Tight Integration gives a taste 
>>> of a kind of half-tight integration in some way. And would not work for us 
>>> because "--bind-to none" can't be used in such a command (see below) and 
>>> throws an error.
>>> 
>>> 
 Then, the openmpi automatically reduces the logical slot count to 10
 by dividing real slot count 10N by binding width of N.
 
 I don't know why you want to use pe=N without binding, but unfortunately
 the openmpi allocates successive cores to each process so far when you
 use pe option - it forcibly bind_to core.
>>> 
>>> In a shared cluster with many users and different MPI libraries in use, 
>>> only the queuingsystem could know which job got which cores granted. This 
>>> avoids any oversubscription of cores, while others are idle.
>> 
>> FWIW: we detect the exterior binding constraint and work within it
> 
> Aha, this is quite interesting - how do you do this: scanning the 
> /proc//status or alike? What happens if you don't find enough free cores 
> as they are used up by other applications already?
> 

Remember, when you use mpirun to launch, we launch our own daemons using the 
native launcher (e.g., qsub). So the external RM will bind our daemons to the 
specified cores on each node. We use hwloc to determine what cores our daemons 
are bound to, and then bind our own child processes to cores within that range.

If the cores we are bound to are the same on each node, then we will do this 
with no further instruction. However, if the cores are different on the 
individual nodes, then you need to add --hetero-nodes to your command line (as 
the nodes appear to be heterogeneous to us).

So it is up to the RM to set the constraint - we just live within it.


> -- Reuti
> 
> 
>>> -- Reuti
>>> 
>>> 
 Tetsuya
 
 
> Hi,
> 
> Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima:
> 
>> Reuti and Oscar,
>> 
>> I'm a Torque user and I myself have never used SGE, so I hesitated to
 join
>> the discussion.
>> 
>> From my experience with the Torque, the openmpi 1.8 series has already
>> resolved the issue you pointed out in combining MPI with OpenMP.
>> 
>> Please try to add --map-by slot:pe=8 option, if you want to use 8
 threads.
>> Then, then openmpi 1.8 should allocate processes properly without any
 modification
>> of the hostfile provided by the Torque.
>> 
>> In your case(8 threads and 10 procs):
>> 
>> # you have to request 80 slots using SGE command before mpirun
>> mpirun --map-by slot:pe=8 -np 10 ./inverse.exe
> 
> Thx for pointing me to this option, for now I can't get it working though
 (in fact, I want to use it without binding essentially). This allows to
 tell Open MPI to bind more cores to each of the MPI
> processes - ok, but does it lower the slot count granted by Torque too? I
 mean, was your submission command like:
> 
> $ qsub -l nodes=10:ppn=8 ...
> 
> so that Torque knows, that it should grant and remember this slot count
 of a total of 80 for the correct accounting?
> 
> -- Reuti
> 
> 
>> where you can omit --bind-to option because --bind-to core is assumed
>> as default when pe=N is provided by the user.
>> Regards,
>> Tetsuya
>> 
>>> Hi,
>>> 
>>> Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
>>> 
 I discovered what was the error. I forgot include the '-fopenmp' when
 I compiled the objects in the Makefile, so the program worked but it didn't
 divide the job
>> in threads. Now the program is working and I can use until 15 cores for
 machine in 

Re: [OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-20 Thread Ralph Castain
It was not yet fixed - but should be now.

On Aug 20, 2014, at 6:39 AM, Timur Ismagilov  wrote:

> Hello!
> 
> As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still have 
> the problem
> 
> a)
> $ mpirun  -np 1 ./hello_c
> 
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
> 
> b)
> $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
> 
> c)
> 
> $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose 5 
> -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c
> 
> [compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
> [compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set 
> priority to 0
> [compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
> [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set 
> priority to 10
> [compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
> [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set 
> priority to 75
> [compiler-2:14673] mca:base:select:( plm) Selected component [slurm]
> [compiler-2:14673] mca: base: components_register: registering oob components
> [compiler-2:14673] mca: base: components_register: found loaded component tcp
> [compiler-2:14673] mca: base: components_register: component tcp register 
> function successful
> [compiler-2:14673] mca: base: components_open: opening oob components
> [compiler-2:14673] mca: base: components_open: found loaded component tcp
> [compiler-2:14673] mca: base: components_open: component tcp open function 
> successful
> [compiler-2:14673] mca:oob:select: checking available component tcp
> [compiler-2:14673] mca:oob:select: Querying component [tcp]
> [compiler-2:14673] oob:tcp: component_available called
> [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
> [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list 
> of V4 connections
> [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
> [compiler-2:14673] [[49095,0],0] TCP STARTUP
> [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0
> [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460
> [compiler-2:14673] mca:oob:select: Adding component to end
> [compiler-2:14673] mca:oob:select: Found 1 active transports
> [compiler-2:14673] mca: base: components_register: registering rml components
> [compiler-2:14673] mca: base: components_register: found loaded component oob
> [compiler-2:14673] mca: base: components_register: component oob has no 
> register or open function
> [compiler-2:14673] mca: base: components_open: opening rml components
> [compiler-2:14673] mca: base: components_open: found loaded component oob
> [compiler-2:14673] mca: base: components_open: component oob open function 
> successful
> [compiler-2:14673] orte_rml_base_select: initializing rml component oob
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for peer 
> [[WILDCARD],WILDCARD]
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for peer 
> [[WILDCARD],WILDCARD]
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for peer 
> [[WILDCARD],WILDCARD]
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for peer 
> [[WILDCARD],WILDCARD]
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 for peer 
> [[WILDCARD],WILDCARD]
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0]

Re: [OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-20 Thread Mike Dubman
btw, we get same error in v1.8 branch as well.


On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain  wrote:

> It was not yet fixed - but should be now.
>
> On Aug 20, 2014, at 6:39 AM, Timur Ismagilov  wrote:
>
> Hello!
>
> As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still have
> the problem
>
> a)
> $ mpirun  -np 1 ./hello_c
>
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
>
> b)
> $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
>
> c)
>
> $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca
> plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1
> ./hello_c
>
> [compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
> [compiler-2:14673] mca:base:select:( plm) Query of component [isolated]
> set priority to 0
> [compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
> [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set
> priority to 10
> [compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
> [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set
> priority to 75
> [compiler-2:14673] mca:base:select:( plm) Selected component [slurm]
> [compiler-2:14673] mca: base: components_register: registering oob
> components
> [compiler-2:14673] mca: base: components_register: found loaded component
> tcp
> [compiler-2:14673] mca: base: components_register: component tcp register
> function successful
> [compiler-2:14673] mca: base: components_open: opening oob components
> [compiler-2:14673] mca: base: components_open: found loaded component tcp
> [compiler-2:14673] mca: base: components_open: component tcp open function
> successful
> [compiler-2:14673] mca:oob:select: checking available component tcp
> [compiler-2:14673] mca:oob:select: Querying component [tcp]
> [compiler-2:14673] oob:tcp: component_available called
> [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
> [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our
> list of V4 connections
> [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
> [compiler-2:14673] [[49095,0],0] TCP STARTUP
> [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0
> [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460
> [compiler-2:14673] mca:oob:select: Adding component to end
> [compiler-2:14673] mca:oob:select: Found 1 active transports
> [compiler-2:14673] mca: base: components_register: registering rml
> components
> [compiler-2:14673] mca: base: components_register: found loaded component
> oob
> [compiler-2:14673] mca: base: components_register: component oob has no
> register or open function
> [compiler-2:14673] mca: base: components_open: opening rml components
> [compiler-2:14673] mca: base: components_open: found loaded component oob
> [compiler-2:14673] mca: base: components_open: component oob open function
> successful
> [compiler-2:14673] orte_rml_base_select: initializing rml component oob
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for
> peer [[WILDCARD],WILDCARD]
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for
> peer [[WILDCARD],WILDCARD]
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for
> peer [[WILDCARD],WILDCARD]
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for
> peer [[WILDCARD],WILDCARD]
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 for peer
> [[WIL

Re: [OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-20 Thread Ralph Castain
yes, i know - it is cmr'd

On Aug 20, 2014, at 10:26 AM, Mike Dubman  wrote:

> btw, we get same error in v1.8 branch as well.
> 
> 
> On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain  wrote:
> It was not yet fixed - but should be now.
> 
> On Aug 20, 2014, at 6:39 AM, Timur Ismagilov  wrote:
> 
>> Hello!
>> 
>> As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still have 
>> the problem
>> 
>> a)
>> $ mpirun  -np 1 ./hello_c
>> 
>> --
>> An ORTE daemon has unexpectedly failed after launch and before
>> communicating back to mpirun. This could be caused by a number
>> of factors, including an inability to create a connection back
>> to mpirun due to a lack of common network interfaces and/or no
>> route found between them. Please check network connectivity
>> (including firewalls and network routing requirements).
>> --
>> 
>> b)
>> $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
>> --
>> An ORTE daemon has unexpectedly failed after launch and before
>> communicating back to mpirun. This could be caused by a number
>> of factors, including an inability to create a connection back
>> to mpirun due to a lack of common network interfaces and/or no
>> route found between them. Please check network connectivity
>> (including firewalls and network routing requirements).
>> --
>> 
>> c)
>> 
>> $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose 
>> 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c
>> 
>> [compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
>> [compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set 
>> priority to 0
>> [compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
>> [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set 
>> priority to 10
>> [compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
>> [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set 
>> priority to 75
>> [compiler-2:14673] mca:base:select:( plm) Selected component [slurm]
>> [compiler-2:14673] mca: base: components_register: registering oob components
>> [compiler-2:14673] mca: base: components_register: found loaded component tcp
>> [compiler-2:14673] mca: base: components_register: component tcp register 
>> function successful
>> [compiler-2:14673] mca: base: components_open: opening oob components
>> [compiler-2:14673] mca: base: components_open: found loaded component tcp
>> [compiler-2:14673] mca: base: components_open: component tcp open function 
>> successful
>> [compiler-2:14673] mca:oob:select: checking available component tcp
>> [compiler-2:14673] mca:oob:select: Querying component [tcp]
>> [compiler-2:14673] oob:tcp: component_available called
>> [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>> [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
>> [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>> [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
>> [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
>> [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list 
>> of V4 connections
>> [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
>> [compiler-2:14673] [[49095,0],0] TCP STARTUP
>> [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0
>> [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460
>> [compiler-2:14673] mca:oob:select: Adding component to end
>> [compiler-2:14673] mca:oob:select: Found 1 active transports
>> [compiler-2:14673] mca: base: components_register: registering rml components
>> [compiler-2:14673] mca: base: components_register: found loaded component oob
>> [compiler-2:14673] mca: base: components_register: component oob has no 
>> register or open function
>> [compiler-2:14673] mca: base: components_open: opening rml components
>> [compiler-2:14673] mca: base: components_open: found loaded component oob
>> [compiler-2:14673] mca: base: components_open: component oob open function 
>> successful
>> [compiler-2:14673] orte_rml_base_select: initializing rml component oob
>> [compiler-2:14673] [[49095,0],0] posting recv
>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:14673] [[49095,0],0] posting recv
>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:14673] [[49095,0],0] posting recv
>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:14673] [[49095,0],0] posting recv
>> [compiler-2:14673] [[49095,0],0] posting persist

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Oscar Mojica
Hi

Well, with qconf -sq one.q I got the following:

[oscar@aguia free-noise]$ qconf -sq one.q
qname one.q
hostlist compute-1-30.local compute-1-2.local compute-1-3.local 
\
  compute-1-4.local compute-1-5.local compute-1-6.local \
  compute-1-7.local compute-1-8.local compute-1-9.local \
  compute-1-10.local compute-1-11.local compute-1-12.local \
  compute-1-13.local compute-1-14.local compute-1-15.local
seq_no0
load_thresholds np_load_avg=1.75
suspend_thresholds  NONE
nsuspend  1
suspend_interval00:05:00
priority0
min_cpu_interval00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list   NONE
pe_list make mpich mpi orte
rerun FALSE
slots  1,[compute-1-30.local=1],[compute-1-2.local=1], \
  [compute-1-3.local=1],[compute-1-5.local=1], \
  [compute-1-8.local=1],[compute-1-6.local=1], \
  [compute-1-4.local=1],[compute-1-9.local=1], \
  [compute-1-11.local=1],[compute-1-7.local=1], \
  [compute-1-13.local=1],[compute-1-10.local=1], \
  [compute-1-15.local=1],[compute-1-12.local=1], \
  [compute-1-14.local=1]

the admin was who created this queue, so I have to speak to him to change the 
number of slots to number of threads that i wish to use. 

Then I could make use of: 
===
export OMP_NUM_THREADS=N 
mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS / 
$OMP_NUM_THREADS") ./inverse.exe
===
 
For now in my case this command line just would work for 10 processes and the 
work wouldn't be divided in threads, is it right?

can I set a maximum number of threads in the queue one.q (e.g. 15 ) and change 
the number in the 'export' for my convenience

I feel like a child hearing the adults speaking
Thanks I'm learning a lot   
  

Oscar Fabian Mojica Ladino
Geologist M.S. in  Geophysics


> From: re...@staff.uni-marburg.de
> Date: Tue, 19 Aug 2014 19:51:46 +0200
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> 
> Hi,
> 
> Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
> 
> > I discovered what was the error. I forgot include the '-fopenmp' when I 
> > compiled the objects in the Makefile, so the program worked but it didn't 
> > divide the job in threads. Now the program is working and I can use until 
> > 15 cores for machine in the queue one.q.
> > 
> > Anyway i would like to try implement your advice. Well I'm not alone in the 
> > cluster so i must implement your second suggestion. The steps are
> > 
> > a) Use '$ qconf -mp orte' to change the allocation rule to 8
> 
> The number of slots defined in your used one.q was also increased to 8 
> (`qconf -sq one.q`)?
> 
> 
> > b) Set '#$ -pe orte 80' in the script
> 
> Fine.
> 
> 
> > c) I'm not sure how to do this step. I'd appreciate your help here. I can 
> > add some lines to the script to determine the PE_HOSTFILE path and 
> > contents, but i don't know how alter it 
> 
> For now you can put in your jobscript (just after OMP_NUM_THREAD is exported):
> 
> awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads; print }' 
> $PE_HOSTFILE > $TMPDIR/machines
> export PE_HOSTFILE=$TMPDIR/machines
> 
> =
> 
> Unfortunately noone stepped into this discussion, as in my opinion it's a 
> much broader issue which targets all users who want to combine MPI with 
> OpenMP. The queuingsystem should get a proper request for the overall amount 
> of slots the user needs. For now this will be forwarded to Open MPI and it 
> will use this information to start the appropriate number of processes (which 
> was an achievement for the Tight Integration out-of-the-box of course) and 
> ignores any setting of OMP_NUM_THREADS. So, where should the generated list 
> of machines be adjusted; there are several options:
> 
> a) The PE of the queuingsystem should do it:
> 
> + a one time setup for the admin
> + in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE
> - the "start_proc_args" would need to know the number of threads, i.e. 
> OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript 
> (tricky scanning of the submitted jobscript for OMP_NUM_THREADS would be too 
> nasty)
> - limits to use inside the jobscript calls to libraries behaving in the same 
> way as Open MPI only
> 
> 
> b) The particular queue should do it in a queue prolog:
> 
> same as a) I think
> 
> 
> c) The user should do it
> 
> + no change in the SGE installation
> - each and every user must include it in all the jobscripts to adjust the 
> list and export the pointer to the $PE_HOSTFILE, but he could change it forth 
> and back for different steps of the jobscript though
> 
> 
> d) Open MPI should do 

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Reuti
Am 20.08.2014 um 19:05 schrieb Ralph Castain:

>> 
>> Aha, this is quite interesting - how do you do this: scanning the 
>> /proc//status or alike? What happens if you don't find enough free 
>> cores as they are used up by other applications already?
>> 
> 
> Remember, when you use mpirun to launch, we launch our own daemons using the 
> native launcher (e.g., qsub). So the external RM will bind our daemons to the 
> specified cores on each node. We use hwloc to determine what cores our 
> daemons are bound to, and then bind our own child processes to cores within 
> that range.

Thx for reminding me of this. Indeed, I mixed up two different aspects in this 
discussion.

a) What will happen in case no binding was done by the RM (hence Open MPI could 
use all cores) and two Open MPI jobs (or something completely different besides 
one Open MPI job) are running on the same node (due to the Tight Integration 
with two different Open MPI directories in /tmp and two `orted`, unique for 
each job)? Will the second Open MPI job know what the first Open MPI job used 
up already? Or will both use the same set of cores as "-bind-to none" can't be 
set in the given `mpiexec` command because of "-map-by 
slot:pe=$OMP_NUM_THREADS" was used - which triggers "-bind-to core" 
indispensable and can't be switched off? I see the same cores being used for 
both jobs.

Altering the machinefile instead: the processes are not bound to any core, and 
the OS takes care of a proper assignment.


> If the cores we are bound to are the same on each node, then we will do this 
> with no further instruction. However, if the cores are different on the 
> individual nodes, then you need to add --hetero-nodes to your command line 
> (as the nodes appear to be heterogeneous to us).

b) Aha, it's not about different type CPU types, but also same CPU type but 
different allocations between the nodes? It's not in the `mpiexec` man-page of 
1.8.1 though. I'll have a look at it.


> So it is up to the RM to set the constraint - we just live within it.

Fine.

-- Reuti

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Ralph Castain

On Aug 20, 2014, at 11:16 AM, Reuti  wrote:

> Am 20.08.2014 um 19:05 schrieb Ralph Castain:
> 
>>> 
>>> Aha, this is quite interesting - how do you do this: scanning the 
>>> /proc//status or alike? What happens if you don't find enough free 
>>> cores as they are used up by other applications already?
>>> 
>> 
>> Remember, when you use mpirun to launch, we launch our own daemons using the 
>> native launcher (e.g., qsub). So the external RM will bind our daemons to 
>> the specified cores on each node. We use hwloc to determine what cores our 
>> daemons are bound to, and then bind our own child processes to cores within 
>> that range.
> 
> Thx for reminding me of this. Indeed, I mixed up two different aspects in 
> this discussion.
> 
> a) What will happen in case no binding was done by the RM (hence Open MPI 
> could use all cores) and two Open MPI jobs (or something completely different 
> besides one Open MPI job) are running on the same node (due to the Tight 
> Integration with two different Open MPI directories in /tmp and two `orted`, 
> unique for each job)? Will the second Open MPI job know what the first Open 
> MPI job used up already? Or will both use the same set of cores as "-bind-to 
> none" can't be set in the given `mpiexec` command because of "-map-by 
> slot:pe=$OMP_NUM_THREADS" was used - which triggers "-bind-to core" 
> indispensable and can't be switched off? I see the same cores being used for 
> both jobs.

Yeah, each mpirun executes completely independently of the other, so they have 
no idea what the other is doing. So the cores will be overloaded. Multi-pe's 
requires bind-to-core otherwise there is no way to implement the request

> 
> Altering the machinefile instead: the processes are not bound to any core, 
> and the OS takes care of a proper assignment.
> 
> 
>> If the cores we are bound to are the same on each node, then we will do this 
>> with no further instruction. However, if the cores are different on the 
>> individual nodes, then you need to add --hetero-nodes to your command line 
>> (as the nodes appear to be heterogeneous to us).
> 
> b) Aha, it's not about different type CPU types, but also same CPU type but 
> different allocations between the nodes? It's not in the `mpiexec` man-page 
> of 1.8.1 though. I'll have a look at it.

The man page is probably a little out-of-date in this area - but yes, 
--hetero-nodes is required for *any* difference in the way the nodes appear to 
us (cpus, slot assignments, etc.). The 1.9 series may remove that requirement - 
still looking at it.

> 
> 
>> So it is up to the RM to set the constraint - we just live within it.
> 
> Fine.
> 
> -- Reuti
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25097.php



[OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-20 Thread Filippo Spiga
Dear Open MPI experts,

I have a problem that is related to the integration of OpenMPI, slurm and PMI 
interface. I spent some time today with a colleague of mine trying to figure 
out why we were not able to obtain all H5 profile files (generated by 
acct_gather_profile) using Open MPI. When I say "all" I mean if I run using 8 
nodes (e.g. tesla[121-128]) then I always systematically miss the file related 
to the first one (the first node in the allocation list, in this case tesla121).

By comparing which processes are spawn on the compute nodes, I discovered that 
mpirun running on tesla121 calls srun only to spawn remotely new MPI processes 
to the other 7 nodes (maybe this is obvious, for me it was not)...

fs395  617  0.0  0.0 106200  1504 ?S22:41   0:00 /bin/bash 
/var/spool/slurm-test/slurmd/job390044/slurm_script
fs395  629  0.1  0.0 194552  5288 ?Sl   22:41   0:00  \_ mpirun 
-bind-to socket --map-by ppr:1:socket --host 
tesla121,tesla122,tesla123,tesla124,tesla125,tesla126,tes
fs395  632  0.0  0.0 659740  9148 ?Sl   22:41   0:00  |   \_ srun 
--ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=7 
--nodelist=tesla122,tesla123,tesla1
fs395  633  0.0  0.0  55544  1072 ?S22:41   0:00  |   |   \_ 
srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=7 
--nodelist=tesla122,tesla123,te
fs395  651  0.0  0.0 106072  1392 ?S22:41   0:00  |   \_ 
/bin/bash ./run_linpack ./xhpl
fs395  654  295 35.5 120113412 23289280 ?  RLl  22:41   3:12  |   |   \_ 
./xhpl
fs395  652  0.0  0.0 106072  1396 ?S22:41   0:00  |   \_ 
/bin/bash ./run_linpack ./xhpl
fs395  656  307 35.5 120070332 23267728 ?  RLl  22:41   3:19  |   \_ 
./xhpl


The "xhpl" processes allocated on the first node of a job are not called by 
srun and because of this the SLURM profile plugin is not activated on the 
node!!! As result I always miss the first node profile information. Intel MPI 
does not have this behavior, mpiexec.hydra uses srun on the first node. 

I got to the conclusion that SLURM is configured properly, something is wrong 
in the way I lunch Open MPI using mpirun. If I disable SLURM support and I 
revert back to rsh (--mca plm rsh) everything work but there is not profiling 
because the SLURM plug-in is not activated. During the configure step, Open MPI 
1.8.1 detects slurm and libmpi/libpmi2 correctly. Honestly, I would prefer to 
avoid to use srun as job luncher if possible...

Any suggestion to get this sorted out is really appreciated!

Best Regards,
Filippo

--
Mr. Filippo SPIGA, M.Sc.
http://filippospiga.info ~ skype: filippo.spiga

«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert

*
Disclaimer: "Please note this message and any attachments are CONFIDENTIAL and 
may be privileged or otherwise protected from disclosure. The contents are not 
to be disclosed to anyone other than the addressee. Unauthorized recipients are 
requested to preserve this confidentiality and to advise the sender immediately 
of any error in transmission."




Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-20 Thread Joshua Ladd
Hi, Filippo

When launching with mpirun in a SLURM environment, srun is only being used
to launch the ORTE daemons (orteds.)  Since the daemon will already exist
on the node from which you invoked mpirun, this node will not be included
in the list of nodes. SLURM's PMI library is not involved (that
functionality is only necessary if you directly launch your MPI application
with srun, in which case it is used to exchanged wireup info amongst
slurmds.) This is the expected behavior.

~/ompi-top-level/orte/mca/plm/plm_slurm_module.c +294
/* if the daemon already exists on this node, then
 * don't include it
 */
if (node->daemon_launched) {
continue;
}

Do you have a frontend node that you can launch from? What happens if you
set "-np X" where X = 8*ppn. The alternative is to do a direct launch of
the MPI application with srun.


Best,

Josh



On Wed, Aug 20, 2014 at 6:48 PM, Filippo Spiga 
wrote:

> Dear Open MPI experts,
>
> I have a problem that is related to the integration of OpenMPI, slurm and
> PMI interface. I spent some time today with a colleague of mine trying to
> figure out why we were not able to obtain all H5 profile files (generated
> by acct_gather_profile) using Open MPI. When I say "all" I mean if I run
> using 8 nodes (e.g. tesla[121-128]) then I always systematically miss the
> file related to the first one (the first node in the allocation list, in
> this case tesla121).
>
> By comparing which processes are spawn on the compute nodes, I discovered
> that mpirun running on tesla121 calls srun only to spawn remotely new MPI
> processes to the other 7 nodes (maybe this is obvious, for me it was not)...
>
> fs395  617  0.0  0.0 106200  1504 ?S22:41   0:00 /bin/bash
> /var/spool/slurm-test/slurmd/job390044/slurm_script
> fs395  629  0.1  0.0 194552  5288 ?Sl   22:41   0:00  \_
> mpirun -bind-to socket --map-by ppr:1:socket --host
> tesla121,tesla122,tesla123,tesla124,tesla125,tesla126,tes
> fs395  632  0.0  0.0 659740  9148 ?Sl   22:41   0:00  |   \_
> srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=7
> --nodelist=tesla122,tesla123,tesla1
> fs395  633  0.0  0.0  55544  1072 ?S22:41   0:00  |   |
> \_ srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=7
> --nodelist=tesla122,tesla123,te
> fs395  651  0.0  0.0 106072  1392 ?S22:41   0:00  |   \_
> /bin/bash ./run_linpack ./xhpl
> fs395  654  295 35.5 120113412 23289280 ?  RLl  22:41   3:12  |   |
> \_ ./xhpl
> fs395  652  0.0  0.0 106072  1396 ?S22:41   0:00  |   \_
> /bin/bash ./run_linpack ./xhpl
> fs395  656  307 35.5 120070332 23267728 ?  RLl  22:41   3:19  |
> \_ ./xhpl
>
>
> The "xhpl" processes allocated on the first node of a job are not called
> by srun and because of this the SLURM profile plugin is not activated on
> the node!!! As result I always miss the first node profile information.
> Intel MPI does not have this behavior, mpiexec.hydra uses srun on the first
> node.
>
> I got to the conclusion that SLURM is configured properly, something is
> wrong in the way I lunch Open MPI using mpirun. If I disable SLURM support
> and I revert back to rsh (--mca plm rsh) everything work but there is not
> profiling because the SLURM plug-in is not activated. During the configure
> step, Open MPI 1.8.1 detects slurm and libmpi/libpmi2 correctly. Honestly,
> I would prefer to avoid to use srun as job luncher if possible...
>
> Any suggestion to get this sorted out is really appreciated!
>
> Best Regards,
> Filippo
>
> --
> Mr. Filippo SPIGA, M.Sc.
> http://filippospiga.info ~ skype: filippo.spiga
>
> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
>
> *
> Disclaimer: "Please note this message and any attachments are CONFIDENTIAL
> and may be privileged or otherwise protected from disclosure. The contents
> are not to be disclosed to anyone other than the addressee. Unauthorized
> recipients are requested to preserve this confidentiality and to advise the
> sender immediately of any error in transmission."
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25099.php
>


Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-20 Thread Ralph Castain
Or you can add 

   -nolocal|--nolocalDo not run any MPI applications on the local node

to your mpirun command line and we won't run any application procs on the node 
where mpirun is executing


On Aug 20, 2014, at 4:28 PM, Joshua Ladd  wrote:

> Hi, Filippo
> 
> When launching with mpirun in a SLURM environment, srun is only being used to 
> launch the ORTE daemons (orteds.)  Since the daemon will already exist on the 
> node from which you invoked mpirun, this node will not be included in the 
> list of nodes. SLURM's PMI library is not involved (that functionality is 
> only necessary if you directly launch your MPI application with srun, in 
> which case it is used to exchanged wireup info amongst slurmds.) This is the 
> expected behavior. 
> 
> ~/ompi-top-level/orte/mca/plm/plm_slurm_module.c +294
> /* if the daemon already exists on this node, then
>  * don't include it
>  */
> if (node->daemon_launched) {
> continue;
> }
> 
> Do you have a frontend node that you can launch from? What happens if you set 
> "-np X" where X = 8*ppn. The alternative is to do a direct launch of the MPI 
> application with srun.
> 
> 
> Best,
> 
> Josh
> 
> 
> 
> On Wed, Aug 20, 2014 at 6:48 PM, Filippo Spiga  
> wrote:
> Dear Open MPI experts,
> 
> I have a problem that is related to the integration of OpenMPI, slurm and PMI 
> interface. I spent some time today with a colleague of mine trying to figure 
> out why we were not able to obtain all H5 profile files (generated by 
> acct_gather_profile) using Open MPI. When I say "all" I mean if I run using 8 
> nodes (e.g. tesla[121-128]) then I always systematically miss the file 
> related to the first one (the first node in the allocation list, in this case 
> tesla121).
> 
> By comparing which processes are spawn on the compute nodes, I discovered 
> that mpirun running on tesla121 calls srun only to spawn remotely new MPI 
> processes to the other 7 nodes (maybe this is obvious, for me it was not)...
> 
> fs395  617  0.0  0.0 106200  1504 ?S22:41   0:00 /bin/bash 
> /var/spool/slurm-test/slurmd/job390044/slurm_script
> fs395  629  0.1  0.0 194552  5288 ?Sl   22:41   0:00  \_ mpirun 
> -bind-to socket --map-by ppr:1:socket --host 
> tesla121,tesla122,tesla123,tesla124,tesla125,tesla126,tes
> fs395  632  0.0  0.0 659740  9148 ?Sl   22:41   0:00  |   \_ srun 
> --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=7 
> --nodelist=tesla122,tesla123,tesla1
> fs395  633  0.0  0.0  55544  1072 ?S22:41   0:00  |   |   \_ 
> srun --ntasks-per-node=1 --kill-on-bad-exit --cpu_bind=none --nodes=7 
> --nodelist=tesla122,tesla123,te
> fs395  651  0.0  0.0 106072  1392 ?S22:41   0:00  |   \_ 
> /bin/bash ./run_linpack ./xhpl
> fs395  654  295 35.5 120113412 23289280 ?  RLl  22:41   3:12  |   |   \_ 
> ./xhpl
> fs395  652  0.0  0.0 106072  1396 ?S22:41   0:00  |   \_ 
> /bin/bash ./run_linpack ./xhpl
> fs395  656  307 35.5 120070332 23267728 ?  RLl  22:41   3:19  |   \_ 
> ./xhpl
> 
> 
> The "xhpl" processes allocated on the first node of a job are not called by 
> srun and because of this the SLURM profile plugin is not activated on the 
> node!!! As result I always miss the first node profile information. Intel MPI 
> does not have this behavior, mpiexec.hydra uses srun on the first node. 
> 
> I got to the conclusion that SLURM is configured properly, something is wrong 
> in the way I lunch Open MPI using mpirun. If I disable SLURM support and I 
> revert back to rsh (--mca plm rsh) everything work but there is not profiling 
> because the SLURM plug-in is not activated. During the configure step, Open 
> MPI 1.8.1 detects slurm and libmpi/libpmi2 correctly. Honestly, I would 
> prefer to avoid to use srun as job luncher if possible...
> 
> Any suggestion to get this sorted out is really appreciated!
> 
> Best Regards,
> Filippo
> 
> --
> Mr. Filippo SPIGA, M.Sc.
> http://filippospiga.info ~ skype: filippo.spiga
> 
> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
> 
> *
> Disclaimer: "Please note this message and any attachments are CONFIDENTIAL 
> and may be privileged or otherwise protected from disclosure. The contents 
> are not to be disclosed to anyone other than the addressee. Unauthorized 
> recipients are requested to preserve this confidentiality and to advise the 
> sender immediately of any error in transmission."
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25099.php
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/user

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread tmishima
Reuti,

Sorry for confusing you. Under the managed condition, actually
-np option is not necessary. So, this cmd line also works for me
with Torque.

$ qsub -l nodes=10:ppn=N
$ mpirun -map-by slot:pe=N ./inverse.exe

At least, Ralph confirmed it worked with Slurm and I comfirmed
with Torque as shown below:

[mishima@manage ~]$ qsub -I -l nodes=4:ppn=8
qsub: waiting for job 8798.manage.cluster to start
qsub: job 8798.manage.cluster ready

[mishima@node09 ~]$ cat $PBS_NODEFILE
node09
node09
node09
node09
node09
node09
node09
node09
node10
node10
node10
node10
node10
node10
node10
node10
node11
node11
node11
node11
node11
node11
node11
node11
node12
node12
node12
node12
node12
node12
node12
node12
[mishima@node09 ~]$ mpirun -map-by slot:pe=8 -display-map
~/mis/openmpi/demos/myprog
 Data for JOB [8050,1] offset 0

    JOB MAP   

 Data for node: node09  Num slots: 8Max slots: 0Num procs: 1
Process OMPI jobid: [8050,1] App: 0 Process rank: 0

 Data for node: node10  Num slots: 8Max slots: 0Num procs: 1
Process OMPI jobid: [8050,1] App: 0 Process rank: 1

 Data for node: node11  Num slots: 8Max slots: 0Num procs: 1
Process OMPI jobid: [8050,1] App: 0 Process rank: 2

 Data for node: node12  Num slots: 8Max slots: 0Num procs: 1
Process OMPI jobid: [8050,1] App: 0 Process rank: 3

 =
Hello world from process 0 of 4
Hello world from process 2 of 4
Hello world from process 3 of 4
Hello world from process 1 of 4
[mishima@node09 ~]$ mpirun -map-by slot:pe=4 -display-map
~/mis/openmpi/demos/myprog
 Data for JOB [8056,1] offset 0

    JOB MAP   

 Data for node: node09  Num slots: 8Max slots: 0Num procs: 2
Process OMPI jobid: [8056,1] App: 0 Process rank: 0
Process OMPI jobid: [8056,1] App: 0 Process rank: 1

 Data for node: node10  Num slots: 8Max slots: 0Num procs: 2
Process OMPI jobid: [8056,1] App: 0 Process rank: 2
Process OMPI jobid: [8056,1] App: 0 Process rank: 3

 Data for node: node11  Num slots: 8Max slots: 0Num procs: 2
Process OMPI jobid: [8056,1] App: 0 Process rank: 4
Process OMPI jobid: [8056,1] App: 0 Process rank: 5

 Data for node: node12  Num slots: 8Max slots: 0Num procs: 2
Process OMPI jobid: [8056,1] App: 0 Process rank: 6
Process OMPI jobid: [8056,1] App: 0 Process rank: 7

 =
Hello world from process 1 of 8
Hello world from process 0 of 8
Hello world from process 2 of 8
Hello world from process 3 of 8
Hello world from process 4 of 8
Hello world from process 5 of 8
Hello world from process 6 of 8
Hello world from process 7 of 8

I don't know why it dosen't work with SGE. Could you show me
your output adding -display-map and -mca rmaps_base_verbose 5 options?

By the way, the option -map-by ppr:N:node or ppr:N:socket might be
useful for your purpose. The ppr can reduce the slot counts given
by RM without binding and allocate N procs by the specified resource.

[mishima@node09 ~]$ mpirun -map-by ppr:1:node -display-map
~/mis/openmpi/demos/myprog
 Data for JOB [7913,1] offset 0

    JOB MAP   

 Data for node: node09  Num slots: 8Max slots: 0Num procs: 1
Process OMPI jobid: [7913,1] App: 0 Process rank: 0

 Data for node: node10  Num slots: 8Max slots: 0Num procs: 1
Process OMPI jobid: [7913,1] App: 0 Process rank: 1

 Data for node: node11  Num slots: 8Max slots: 0Num procs: 1
Process OMPI jobid: [7913,1] App: 0 Process rank: 2

 Data for node: node12  Num slots: 8Max slots: 0Num procs: 1
Process OMPI jobid: [7913,1] App: 0 Process rank: 3

 =
Hello world from process 0 of 4
Hello world from process 2 of 4
Hello world from process 1 of 4
Hello world from process 3 of 4

Tetsuya


> Hi,
>
> Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp:
>
> > Reuti,
> >
> > If you want to allocate 10 procs with N threads, the Torque
> > script below should work for you:
> >
> > qsub -l nodes=10:ppn=N
> > mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe
>
> I played around with giving -np 10 in addition to a Tight Integration.
The slot count is not really divided I think, but only 10 out of the
granted maximum is used (while on each of the listed
> machines an `orted` is started). Due to the fixed allocation this is of
course the result we want to achieve as it subtracts bunches of 8 from the
given list of machines resp. slots. In SGE it's
> sufficient to use and AFAICS it works (without touching the $PE_HOSTFILE
any longer):
>
> ===
> export OMP_NUM_THREADS=8
> mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS /
$OMP_NUM_THREADS") ./inver

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread tmishima
Oscar,

As I mentioned before, I've never used SGE. So please ask
for Reuti's advise. Only thing I can tell is that you have
to use the openmpi 1.8 series to use -map-by slot:pe=N option.

Tetsuya


> Hi
>
> Well, with qconf -sq one.q I got the following:
>
> [oscar@aguia free-noise]$ qconf -sq one.q
> qname one.q
> hostlist compute-1-30.local compute-1-2.local
compute-1-3.local \
>   compute-1-4.local compute-1-5.local
compute-1-6.local \
>   compute-1-7.local compute-1-8.local
compute-1-9.local \
>   compute-1-10.local compute-1-11.local
compute-1-12.local \
>   compute-1-13.local compute-1-14.local
compute-1-15.local
> seq_no    0
> load_thresholds np_load_avg=1.75
> suspend_thresholds  NONE
> nsuspend  1
> suspend_interval    00:05:00
> priority    0
> min_cpu_interval    00:05:00
> processors UNDEFINED
> qtype BATCH INTERACTIVE
> ckpt_list   NONE
> pe_list     make mpich mpi orte
> rerun FALSE
> slots  1,[compute-1-30.local=1],[compute-1-2.local=1], \
>   [compute-1-3.local=1],[compute-1-5.local=1], \
>   [compute-1-8.local=1],[compute-1-6.local=1], \
>   [compute-1-4.local=1],[compute-1-9.local=1], \
>   [compute-1-11.local=1],[compute-1-7.local=1], \
>   [compute-1-13.local=1],[compute-1-10.local=1], \
>   [compute-1-15.local=1],[compute-1-12.local=1], \
>   [compute-1-14.local=1]
>
> the admin was who created this queue, so I have to speak to him to change
the number of slots to number of threads that i wish to use.
>
> Then I could make use of:
> ===
> export OMP_NUM_THREADS=N
> mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS /
$OMP_NUM_THREADS") ./inverse.exe
> ===
>
> For now in my case this command line just would work for 10 processes and
the work wouldn't be divided in threads, is it right?
>
> can I set a maximum number of threads in the queue one.q (e.g. 15 ) and
change the number in the 'export' for my convenience
>
> I feel like a child hearing the adults speaking
> Thanks I'm learning a lot
>
>
> Oscar Fabian Mojica Ladino
> Geologist M.S. in  Geophysics
>
>
> > From: re...@staff.uni-marburg.de
> > Date: Tue, 19 Aug 2014 19:51:46 +0200
> > To: us...@open-mpi.org
> > Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
> >
> > Hi,
> >
> > Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
> >
> > > I discovered what was the error. I forgot include the '-fopenmp' when
I compiled the objects in the Makefile, so the program worked but it didn't
divide the job in threads. Now the program is
> working and I can use until 15 cores for machine in the queue one.q.
> > >
> > > Anyway i would like to try implement your advice. Well I'm not alone
in the cluster so i must implement your second suggestion. The steps are
> > >
> > > a) Use '$ qconf -mp orte' to change the allocation rule to 8
> >
> > The number of slots defined in your used one.q was also increased to 8
(`qconf -sq one.q`)?
> >
> >
> > > b) Set '#$ -pe orte 80' in the script
> >
> > Fine.
> >
> >
> > > c) I'm not sure how to do this step. I'd appreciate your help here. I
can add some lines to the script to determine the PE_HOSTFILE path and
contents, but i don't know how alter it
> >
> > For now you can put in your jobscript (just after OMP_NUM_THREAD is
exported):
> >
> > awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads;
print }' $PE_HOSTFILE > $TMPDIR/machines
> > export PE_HOSTFILE=$TMPDIR/machines
> >
> > =
> >
> > Unfortunately noone stepped into this discussion, as in my opinion it's
a much broader issue which targets all users who want to combine MPI with
OpenMP. The queuingsystem should get a proper
> request for the overall amount of slots the user needs. For now this will
be forwarded to Open MPI and it will use this information to start the
appropriate number of processes (which was an
> achievement for the Tight Integration out-of-the-box of course) and
ignores any setting of OMP_NUM_THREADS. So, where should the generated list
of machines be adjusted; there are several options:
> >
> > a) The PE of the queuingsystem should do it:
> >
> > + a one time setup for the admin
> > + in SGE the "start_proc_args" of the PE could alter the $PE_HOSTFILE
> > - the "start_proc_args" would need to know the number of threads, i.e.
OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the jobscript
(tricky scanning of the submitted jobscript for
> OMP_NUM_THREADS would be too nasty)
> > - limits to use inside the jobscript calls to libraries behaving in the
same way as Open MPI only
> >
> >
> > b) The particular queue should do it in a queue prolog:
> >
> > same as a) I think
> >
> >
> > c) The user should