Re: [OMPI users] File-backed mmaped I/O and openib btl.

2014-11-11 Thread Emmanuel Thomé
Hi again,

I've been able to simplify my test case significantly. It now runs
with 2 nodes, and only a single MPI_Send / MPI_Recv pair is used.

The pattern is as follows.

 *  - ranks 0 and 1 both own a local buffer.
 *  - each fills it with (deterministically known) data.
 *  - rank 0 collects the data from rank 1's local buffer
 *(whose contents should be no mystery), and writes this to a
 *file-backed mmaped area.
 *  - rank 0 compares what it receives with what it knows it *should
 *  have* received.

The test fails if:

 *  - the openib btl is used among the 2 nodes
 *  - a file-backed mmaped area is used for receiving the data.
 *  - the write is done to a newly created file.
 *  - per-node buffer is large enough.

For a per-node buffer size above 12kb (12240 bytes to be exact), my
program fails, since the MPI_Recv does not receive the correct data
chunk (it just gets zeroes).

I attach the simplified test case. I hope someone will be able to
reproduce the problem.

Best regards,

E.


On Mon, Nov 10, 2014 at 5:48 PM, Emmanuel Thomé
 wrote:
> Thanks for your answer.
>
> On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd  wrote:
>> Just really quick off the top of my head, mmaping relies on the virtual
>> memory subsystem, whereas IB RDMA operations rely on physical memory being
>> pinned (unswappable.)
>
> Yes. Does that mean that the result of computations should be
> undefined if I happen to give a user buffer which corresponds to a
> file ? That would be surprising.
>
>> For a large message transfer, the OpenIB BTL will
>> register the user buffer, which will pin the pages and make them
>> unswappable.
>
> Yes. But what are the semantics of pinning the VM area pointed to by
> ptr if ptr happens to be mmaped from a file ?
>
>> If the data being transfered is small, you'll copy-in/out to
>> internal bounce buffers and you shouldn't have issues.
>
> Are you saying that the openib layer does have provision in this case
> for letting the RDMA happen with a pinned physical memory range, and
> later perform the copy to the file-backed mmaped range ? That would
> make perfect sense indeed, although I don't have enough familiarity
> with the OMPI code to see where it happens, and more importantly
> whether the completion properly waits for this post-RDMA copy to
> complete.
>
>
>> 1.If you try to just bcast a few kilobytes of data using this technique, do
>> you run into issues?
>
> No. All "simpler" attempts were successful, unfortunately. Can you be
> a little bit more precise about what scenario you imagine ? The
> setting "all ranks mmap a local file, and rank 0 broadcasts there" is
> successful.
>
>> 2. How large is the data in the collective (input and output), is in_place
>> used? I'm guess it's large enough that the BTL tries to work with the user
>> buffer.
>
> MPI_IN_PLACE is used in reduce_scatter and allgather in the code.
> Collectives are with communicators of 2 nodes, and we're talking (for
> the smallest failing run) 8kb per node (i.e. 16kb total for an
> allgather).
>
> E.
>
>> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé 
>> wrote:
>>>
>>> Hi,
>>>
>>> I'm stumbling on a problem related to the openib btl in
>>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed
>>> mmaped areas for receiving data through MPI collective calls.
>>>
>>> A test case is attached. I've tried to make is reasonably small,
>>> although I recognize that it's not extra thin. The test case is a
>>> trimmed down version of what I witness in the context of a rather
>>> large program, so there is no claim of relevance of the test case
>>> itself. It's here just to trigger the desired misbehaviour. The test
>>> case contains some detailed information on what is done, and the
>>> experiments I did.
>>>
>>> In a nutshell, the problem is as follows.
>>>
>>>  - I do a computation, which involves MPI_Reduce_scatter and
>>> MPI_Allgather.
>>>  - I save the result to a file (collective operation).
>>>
>>> *If* I save the file using something such as:
>>>  fd = open("blah", ...
>>>  area = mmap(..., fd, )
>>>  MPI_Gather(..., area, ...)
>>> *AND* the MPI_Reduce_scatter is done with an alternative
>>> implementation (which I believe is correct)
>>> *AND* communication is done through the openib btl,
>>>
>>> then the file which gets saved is inconsistent with what is obtained
>>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide
>>> before the save).
>>>
>>> I tried to dig a bit in the openib internals, but all I've been able
>>> to witness was beyond my expertise (an RDMA read not transferring the
>>> expected data, but I'm too uncomfortable with this layer to say
>>> anything I'm sure about).
>>>
>>> Tests have been done with several openmpi versions including 1.8.3, on
>>> a debian wheezy (7.5) + OFED 2.3 cluster.
>>>
>>> It would be great if someone could tell me if he is able to reproduce
>>> the bug, or tell me whether something which is done in this test case
>>> is illegal in any res

Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-11 Thread SLIM H.A.
Dear Reuti and Ralph

Below is the output of the run for openmpi 1.8.3 with this line

mpirun -np $NSLOTS --display-map --display-allocation --cpus-per-proc 1 $exe


master=cn6050
PE=orte
JOB_ID=2482923
Got 32 slots.
slots:
cn6050 16 par6.q@cn6050 
cn6045 16 par6.q@cn6045 
Tue Nov 11 12:37:37 GMT 2014

==   ALLOCATED NODES   ==
cn6050: slots=16 max_slots=0 slots_inuse=0 state=UP
=
Data for JOB [57374,1] offset 0

   JOB MAP   

Data for node: cn6050  Num slots: 16   Max slots: 0Num procs: 32
Process OMPI jobid: [57374,1] App: 0 Process rank: 0
Process OMPI jobid: [57374,1] App: 0 Process rank: 1

…
Process OMPI jobid: [57374,1] App: 0 Process rank: 31


Also
ompi_info | grep grid
gives MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.8.3)
and
ompi_info | grep psm
gives MCA mtl: psm (MCA v2.0, API v2.0, Component v1.8.3)
because the intercoonect is TrueScale/QLogic

and

setenv OMPI_MCA_mtl "psm"

is set in the script. This is the PE

pe_name   orte
slots 4000
user_listsNONE
xuser_lists   NONE
start_proc_args   /bin/true
stop_proc_args/bin/true
allocation_rule   $fill_up
control_slavesTRUE
job_is_first_task FALSE
urgency_slots min

Many thanks

Henk


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: 10 November 2014 05:07
To: Open MPI Users
Subject: Re: [OMPI users] oversubscription of slots with GridEngine

You might also add the —display-allocation flag to mpirun so we can see what it 
thinks the allocation looks like. If there are only 16 slots on the node, it 
seems odd that OMPI would assign 32 procs to it unless it thinks there is only 
1 node in the job, and oversubscription is allowed (which it won’t be by 
default if it read the GE allocation)


On Nov 9, 2014, at 9:56 AM, Reuti 
mailto:re...@staff.uni-marburg.de>> wrote:

Hi,


Am 09.11.2014 um 18:20 schrieb SLIM H.A. 
mailto:h.a.s...@durham.ac.uk>>:

We switched on hyper threading on our cluster with two eight core sockets per 
node (32 threads per node).

We configured  gridengine with 16 slots per node to allow the 16 extra threads 
for kernel process use but this apparently does not work. Printout of the 
gridengine hostfile shows that for a 32 slots job, 16 slots are placed on each 
of two nodes as expected. Including the openmpi --display-map option shows that 
all 32 processes are incorrectly  placed on the head node.

You mean the master node of the parallel job I assume.


Here is part of the output

master=cn6083
PE=orte

What allocation rule was defined for this PE - "control_slave yes" is set?


JOB_ID=2481793
Got 32 slots.
slots:
cn6083 16 par6.q@cn6083 
cn6085 16 par6.q@cn6085 
Sun Nov  9 16:50:59 GMT 2014
Data for JOB [44767,1] offset 0

   JOB MAP   

Data for node: cn6083  Num slots: 16   Max slots: 0Num procs: 32
  Process OMPI jobid: [44767,1] App: 0 Process rank: 0
  Process OMPI jobid: [44767,1] App: 0 Process rank: 1
...
  Process OMPI jobid: [44767,1] App: 0 Process rank: 31

=

I found some related mailings about a new warning in 1.8.2 about 
oversubscription and  I tried a few options to avoid the use of the extra 
threads for MPI tasks by openmpi without success, e.g. variants of

--cpus-per-proc 1
--bind-to-core

and some others. Gridengine treats hw threads as cores==slots (?) but the 
content of $PE_HOSTFILE suggests it distributes the slots sensibly  so it seems 
there is an option for openmpi required to get 16 cores per node?

Was Open MPI configured with --with-sge?

-- Reuti


I tried both 1.8.2, 1.8.3 and also 1.6.5.

Thanks for some clarification that anyone can give.

Henk


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25718.php
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25719.php



Re: [OMPI users] EXTERNAL: Re: Question on mapping processes to hosts file

2014-11-11 Thread Blosch, Edwin L
OK, that’s what I was suspecting.  It’s a bug, right?  I asked for 4 processes 
and I supplied a host file with 4 lines in it, and mpirun didn’t launch the 
processes where I told it to launch them.

Do you know when or if this changed?  I can’t recall seeing this this behavior 
in 1.6.5 or 1.4 or 1.2, and I know I’ve run cases across workstation clusters, 
so I think I would have noticed this behavior.

Can I throw another one at you, most likely related?  On a system where node01, 
node02, node03, and node04 already had a full load of work (i.e. other 
applications were running a number of processes equal to the number of cores on 
each node), I had a hosts file like this:  node01, node01, node02, node02.   I 
asked for 4 processes.  mpirun launched them as I would think: rank 0 and rank 
1 on node01, and rank 2 and 3 on node02.  Then I tried node01, node01, node02, 
node03.  In this case, all 4 processes were launched on node01.  Is there a 
logical explanation for this behavior as well?

Thanks again,

Ed


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, November 07, 2014 11:51 AM
To: Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] Question on mapping processes to hosts file

Ah, yes - so here is what is happening. When no slot info is provided, we use 
the number of detected cores on each node as the #slots. So if you want to 
loadbalance across the nodes, you need to set —map-by node

Or add slots=1 to each line of your host file to override the default behavior

On Nov 7, 2014, at 8:52 AM, Blosch, Edwin L 
mailto:edwin.l.blo...@lmco.com>> wrote:

Here’s my command:

/bin/mpirun  --machinefile 
hosts.dat -np 4 

Here’s my hosts.dat file:

% cat hosts.dat
node01
node02
node03
node04

All 4 ranks are launched on node01.  I don’t believe I’ve ever seen this 
before.  I had to do a sanity check, so I tried MVAPICH2-2.1a and got what I 
expected: 1 process runs on each of the 4 nodes.  The mpirun man page says 
‘round-robin’, which I take to mean that one process would be launched per line 
in the hosts file, so this really seems like incorrect behavior.

What could be the possibilities here?

Thanks for the help!



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25707.php



Re: [OMPI users] File-backed mmaped I/O and openib btl.

2014-11-11 Thread Joshua Ladd
I was able to reproduce your issue and I think I understand the problem a
bit better at least. This demonstrates exactly what I was pointing to:

It looks like when the test switches over from eager RDMA (I'll explain in
a second), to doing a rendezvous protocol working entirely in user buffer
space things go bad.

If you're input is smaller than some threshold, the eager RDMA limit, then
the contents of your user buffer are copied into OMPI/OpenIB BTL scratch
buffers called "eager fragments". This pool of resources is preregistered,
pinned, and have had their rkeys exchanged. So, in the eager protocol, your
data is copied into these "locked and loaded" RDMA frags and the put/get is
handled internally. When the data is received, it's copied back out into
your buffer. In your setup, this always works.

$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
btl_openib_eager_limit 512 -mca btl openib,self ./ibtest -s 56
per-node buffer has size 448 bytes
node 0 iteration 0, lead word received from peer is 0x0401 [ok]
node 0 iteration 1, lead word received from peer is 0x0801 [ok]
node 0 iteration 2, lead word received from peer is 0x0c01 [ok]
node 0 iteration 3, lead word received from peer is 0x1001 [ok]

When you exceed the eager threshold, this always fails on the second
iteration. To understand this, you need to understand that there is a
protocol switch where now your user buffer is used for the transfer. Hence,
the user buffer is registered with the HCA. This operation is an inherently
high latency operation and is one of the primary motives for doing
copy-in/copy-out into preregistered buffers for small, latency sensitive
ops. For bandwidth bound transfers, the cost to register can be amortized
over the whole transfer, but it still affects the total bandwidth. In the
case of a rendezvous protocol where the user buffer is registered, there is
an optimization mostly used to help improve the numbers in a bandwidth
benchmark called a registration cache. With registration caching the user
buffer is registered once and the mkey put into a cache and the memory is
kept pinned until the system provides some notification via either memory
hooks in p2p malloc, or ummunotify that the buffer has been freed and this
signals that the mkey can be evicted from the cache.  On subsequent
send/recv operations from the same user buffer address, OpenIB BTL will
find the address in the registration cache and take the cached mkey and
avoid paying the cost of the memory registration the memory registration
and start the data transfer.

What I noticed is when the rendezvous protocol kicks in, it always fails on
the second iteration.

$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
btl_openib_eager_limit 128 -mca btl openib,self ./ibtest -s 56
per-node buffer has size 448 bytes
node 0 iteration 0, lead word received from peer is 0x0401 [ok]
node 0 iteration 1, lead word received from peer is 0x [NOK]
--

So, I suspected it has something to do with the way the virtual address is
being handled in this case. To test that theory, I just completely disabled
the registration cache by setting -mca mpi_leave_pinned 0 and things start
to work:

$mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
btl_openib_eager_limit 128 -mca mpi_leave_pinned 0 -mca btl openib,self
./ibtest -s 56
per-node buffer has size 448 bytes
node 0 iteration 0, lead word received from peer is 0x0401 [ok]
node 0 iteration 1, lead word received from peer is 0x0801 [ok]
node 0 iteration 2, lead word received from peer is 0x0c01 [ok]
node 0 iteration 3, lead word received from peer is 0x1001 [ok]

I don't know enough about memory hooks or the registration cache
implementation to speak with any authority, but it looks like this is where
the issue resides. As a workaround, can you try your original experiment
with -mca mpi_leave_pinned 0 and see if you get consistent results.


Josh





On Tue, Nov 11, 2014 at 7:07 AM, Emmanuel Thomé 
wrote:

> Hi again,
>
> I've been able to simplify my test case significantly. It now runs
> with 2 nodes, and only a single MPI_Send / MPI_Recv pair is used.
>
> The pattern is as follows.
>
>  *  - ranks 0 and 1 both own a local buffer.
>  *  - each fills it with (deterministically known) data.
>  *  - rank 0 collects the data from rank 1's local buffer
>  *(whose contents should be no mystery), and writes this to a
>  *file-backed mmaped area.
>  *  - rank 0 compares what it receives with what it knows it *should
>  *  have* received.
>
> The test fails if:
>
>  *  - the openib btl is used among the 2 nodes
>  *  - a file-backed mmaped area is used for receiving the data.
>  *  - the

Re: [OMPI users] OPENMPI-1.8.3: missing fortran bindings for MPI_SIZEOF

2014-11-11 Thread Dave Love
"Jeff Squyres (jsquyres)"  writes:

> There are several reasons why MPI implementations have not added explicit 
> interfaces to their mpif.h files, mostly boiling down to: they may/will break 
> real world MPI programs.
>
> 1. All modern compilers have ignore-TKR syntax,

Hang on!  (An equivalent of) ignore_tkr only appeared in gfortran 4.9
(the latest release) as far as I know.  The system compiler of the bulk
of GNU/Linux HPC systems currently is distinctly older (and the RHEL
devtoolset packaging of gcc-4.9 is still beta).  RHEL 6 has gcc 4.4 as
te system compiler and Debian stable has 4.7 and older.

I'm just pointing that out in case decisions are being made assuming
everyone has this.  No worries if not.

> so it's at least not a problem for subroutines like MPI_SEND (with a choice 
> buffer).  However: a) this was not true at the time when MPI-3 was written, 
> and b) it's not standard fortran.



Re: [OMPI users] OPENMPI-1.8.3: missing fortran bindings for MPI_SIZEOF

2014-11-11 Thread Dave Love
"Jeff Squyres (jsquyres)"  writes:

> On Nov 10, 2014, at 8:27 AM, Dave Love  wrote:
>
>>> https://github.com/open-mpi/ompi/commit/d7eaca83fac0d9783d40cac17e71c2b090437a8c
>> 
>> I don't have time to follow this properly, but am I reading right that
>> that says mpi_sizeof will now _not_ work with gcc < 4.9, i.e. the system
>> compiler of the vast majority of HPC GNU/Linux systems, whereas it did
>> before (at least in simple cases)?
>
> You raise a very good point, which raises another unfortunately good related 
> point.
>
> 1. No, the goal is to enable MPI_SIZEOF in *more* cases, and still preserve 
> all the old cases.  Your mail made me go back and test all the old cases this 
> morning, and I discovered a bug which I need to fix before 1.8.4 is released 
> (details unimportant, unless someone wants to gory details).

I haven't checked the source, but the commit message above says

  If the Fortran compiler supports both INTERFACE and ISO_FORTRAN_ENV,
  then we'll build the MPI_SIZEOF interfaces.  If not, we'll skip
  MPI_SIZEOF in mpif.h and the mpi module.

which implies it it's been removed for gcc < 4.9, whereas it worked before.

> The answer actually turned out to be "yes".  :-\
>
> Specifically: the spec just says it's available in the Fortran interfaces.  
> It doesn't say "the Fortran interfaces, except MPI_SIZEOF."
>
> Indeed, the spec doesn't prohibit explicit interfaces in mpif.h (it never 
> has).  It's just that most (all?) MPI implementations have not provided 
> explicit interfaces in mpif.h.
>
> But for MPI_SIZEOF to work, explicit interfaces are *required*.

[Yes, I understand -- sorry if that wasn't clear and you wasted time
explaining.]

>> but I'd expect that to be deprecated anyhow.
>> (The man pages generally don't mention USE, only INCLUDE, which seems
>> wrong.)
>
> Mmm.  Yes, true.
>
> Any chance I could convince you to submit a patch?  :-)

Maybe, but I don't really know what it should involve or whether it can
be done mechanically; I definitely don't have time to dissect the spec.
Actually, I'd have expected the API man pages to be reference versions,
shared across implementations, but MPICH's are different.

> Fortran 77 compilers haven't existed for *many, many years*.

[I think f2c still gets some use, and g77 was only obsoleted with gcc 4
-- I'm not _that old_!  I'm not actually advocating f77, of course.]

> And I'll say it again: MPI has *never* supported Fortran 77 (it's a
> common misconception that it ever did).

Well, having "Fortran77 interface" in the standard could confuse a
stupid person!  (As a former language lawyer for it, I'd allow laxity in
"Fortran77", like the latest MPI isn't completely compatible with the
latest Fortran either.)

>> Fortran has interfaces, not prototypes!
>
> Yes, sorry -- I'm a C programmer and I dabble in Fortran

That was mainly as in it's better ☺.

> (read: I'm the guy who keeps the Fortran stuff maintained in OMPI), so
> I sometimes use the wrong terminology.  Mea culpa!

Sure, and thanks.  I dare say you can get some community help if you
need it, especially if people think Fortran isn't being properly
supported, though I'm not complaining.



Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-11 Thread Dave Love
"SLIM H.A."  writes:

> We switched on hyper threading on our cluster with two eight core
> sockets per node (32 threads per node).

Assuming that's Xeon-ish hyperthreading, the best advice is not to.  It
will typically hurt performance of HPC applications, not least if it
defeats core binding, and it is likely to confusion with resource
managers.  If there are specific applications which benefit from it,
under Linux you can switch it on on the relevant cores for the duration
of jobs which ask for it.

> We configured  gridengine with 16 slots per node to allow the 16 extra
> threads for kernel process use

Have you actually measured that?  We did, and we switch off HT at boot
time.  We've never had cause to turn it on, though there might be a few
jobs which could use it.

> but this apparently does not work. Printout of the gridengine hostfile
> shows that for a 32 slots job, 16 slots are placed on each of two
> nodes as expected. Including the openmpi --display-map option shows
> that all 32 processes are incorrectly placed on the head node. Here is
> part of the output

If OMPI is scheduling by thread, then that's what you'd expect.  (As far
as I know, SGE will DTRT, binding a cores per slot in that case, but
I'll look at bug reports if not.)

> I found some related mailings about a new warning in 1.8.2 about 
> oversubscription and  I tried a few options to avoid the use of the extra 
> threads for MPI tasks by openmpi without success, e.g. variants of
>
> --cpus-per-proc 1 
> --bind-to-core 
>
> and some others. Gridengine treats hw threads as cores==slots (?)

What a slot is is up to you, but if you want to do core binding at all
sensibly, it needs to correspond to a core.  You can fiddle things in
the job itself (see the recent thread that Mark started for OMPI --np !=
SGE NSLOTS).

> but the content of $PE_HOSTFILE suggests it distributes the slots
> sensibly  so it seems there is an option for openmpi required to get
> 16 cores per node?

I'm not sure precisely what you want, but with OMPI 1.8, you should be
able to lay out the job by core if that's what you want.  That may
requires exclusive node access, which makes SGE core binding a null
operation.

> I tried both 1.8.2, 1.8.3 and also 1.6.5.
>
> Thanks for some clarification that anyone can give.

The above is for the current SGE with a recent hwloc.  If Durham are
still using an ancient version, it may not apply, but that should be
irrelevant with -l exclusive or a fixed-count PE.



Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-11 Thread Ralph Castain
This clearly displays the problem - if you look at the reported “allocated 
nodes”, you see that we only got one node (cn6050). This is why we mapped all 
your procs onto that node.

So the real question is - why? Can you show us the content of PE_HOSTFILE?


> On Nov 11, 2014, at 4:51 AM, SLIM H.A.  wrote:
> 
> Dear Reuti and Ralph
>  
> Below is the output of the run for openmpi 1.8.3 with this line
>  
> mpirun -np $NSLOTS --display-map --display-allocation --cpus-per-proc 1 $exe
>  
>  
> master=cn6050
> PE=orte
> JOB_ID=2482923
> Got 32 slots.
> slots:
> cn6050 16 par6.q@cn6050 
> cn6045 16 par6.q@cn6045 
> Tue Nov 11 12:37:37 GMT 2014
>  
> ==   ALLOCATED NODES   ==
> cn6050: slots=16 max_slots=0 slots_inuse=0 state=UP
> =
> Data for JOB [57374,1] offset 0
>  
>    JOB MAP   
>  
> Data for node: cn6050  Num slots: 16   Max slots: 0Num procs: 32
> Process OMPI jobid: [57374,1] App: 0 Process rank: 0
> Process OMPI jobid: [57374,1] App: 0 Process rank: 1
>  
> …
> Process OMPI jobid: [57374,1] App: 0 Process rank: 31
>  
>  
> Also
> ompi_info | grep grid
> gives MCA ras: gridengine (MCA v2.0, API v2.0, Component 
> v1.8.3)
> and
> ompi_info | grep psm
> gives MCA mtl: psm (MCA v2.0, API v2.0, Component v1.8.3)
> because the intercoonect is TrueScale/QLogic
>  
> and
>  
> setenv OMPI_MCA_mtl "psm"
>  
> is set in the script. This is the PE
>  
> pe_name   orte
> slots 4000
> user_listsNONE
> xuser_lists   NONE
> start_proc_args   /bin/true
> stop_proc_args/bin/true
> allocation_rule   $fill_up
> control_slavesTRUE
> job_is_first_task FALSE
> urgency_slots min
>  
> Many thanks
>  
> Henk
>  
>  
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: 10 November 2014 05:07
> To: Open MPI Users
> Subject: Re: [OMPI users] oversubscription of slots with GridEngine
>  
> You might also add the —display-allocation flag to mpirun so we can see what 
> it thinks the allocation looks like. If there are only 16 slots on the node, 
> it seems odd that OMPI would assign 32 procs to it unless it thinks there is 
> only 1 node in the job, and oversubscription is allowed (which it won’t be by 
> default if it read the GE allocation)
>  
>  
> On Nov 9, 2014, at 9:56 AM, Reuti  > wrote:
>  
> Hi,
> 
> 
> Am 09.11.2014 um 18:20 schrieb SLIM H.A.  >:
> 
> We switched on hyper threading on our cluster with two eight core sockets per 
> node (32 threads per node).
> 
> We configured  gridengine with 16 slots per node to allow the 16 extra 
> threads for kernel process use but this apparently does not work. Printout of 
> the gridengine hostfile shows that for a 32 slots job, 16 slots are placed on 
> each of two nodes as expected. Including the openmpi --display-map option 
> shows that all 32 processes are incorrectly  placed on the head node.
> 
> You mean the master node of the parallel job I assume.
> 
> 
> Here is part of the output
> 
> master=cn6083
> PE=orte
> 
> What allocation rule was defined for this PE - "control_slave yes" is set?
> 
> 
> JOB_ID=2481793
> Got 32 slots.
> slots:
> cn6083 16 par6.q@cn6083  
> cn6085 16 par6.q@cn6085  
> Sun Nov  9 16:50:59 GMT 2014
> Data for JOB [44767,1] offset 0
> 
>    JOB MAP   
> 
> Data for node: cn6083  Num slots: 16   Max slots: 0Num procs: 32
>   Process OMPI jobid: [44767,1] App: 0 Process rank: 0
>   Process OMPI jobid: [44767,1] App: 0 Process rank: 1
> ...
>   Process OMPI jobid: [44767,1] App: 0 Process rank: 31
> 
> =
> 
> I found some related mailings about a new warning in 1.8.2 about 
> oversubscription and  I tried a few options to avoid the use of the extra 
> threads for MPI tasks by openmpi without success, e.g. variants of
> 
> --cpus-per-proc 1 
> --bind-to-core 
> 
> and some others. Gridengine treats hw threads as cores==slots (?) but the 
> content of $PE_HOSTFILE suggests it distributes the slots sensibly  so it 
> seems there is an option for openmpi required to get 16 cores per node?
> 
> Was Open MPI configured with --with-sge?
> 
> -- Reuti
> 
> 
> I tried both 1.8.2, 1.8.3 and also 1.6.5.
> 
> Thanks for some clarification that anyone can give.
> 
> Henk
> 
> 
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25718.php 
> 

Re: [OMPI users] OPENMPI-1.8.3: missing fortran bindings for MPI_SIZEOF

2014-11-11 Thread Jeff Squyres (jsquyres)
On Nov 11, 2014, at 9:38 AM, Dave Love  wrote:

>> 1. All modern compilers have ignore-TKR syntax,
> 
> Hang on!  (An equivalent of) ignore_tkr only appeared in gfortran 4.9
> (the latest release) as far as I know.  The system compiler of the bulk
> of GNU/Linux HPC systems currently is distinctly older (and the RHEL
> devtoolset packaging of gcc-4.9 is still beta).  RHEL 6 has gcc 4.4 as
> te system compiler and Debian stable has 4.7 and older.
> 
> I'm just pointing that out in case decisions are being made assuming
> everyone has this.  No worries if not.

Sorry, that statement was loaded with my assumption that "gfortran 4.9 is a 
modern fortran compiler; prior versions are not."

So don't worry: we're well aware that only gfortran >=4.9 has these features, 
and most everyone is not using it yet.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] OPENMPI-1.8.3: missing fortran bindings for MPI_SIZEOF

2014-11-11 Thread Jeff Squyres (jsquyres)
On Nov 11, 2014, at 9:43 AM, Dave Love  wrote:

> I haven't checked the source, but the commit message above says
> 
>  If the Fortran compiler supports both INTERFACE and ISO_FORTRAN_ENV,
>  then we'll build the MPI_SIZEOF interfaces.  If not, we'll skip
>  MPI_SIZEOF in mpif.h and the mpi module.
> 
> which implies it it's been removed for gcc < 4.9, whereas it worked before.

I'll update the README to be more clear.

>> Any chance I could convince you to submit a patch?  :-)
> 
> Maybe, but I don't really know what it should involve or whether it can
> be done mechanically; I definitely don't have time to dissect the spec.
> Actually, I'd have expected the API man pages to be reference versions,
> shared across implementations, but MPICH's are different.

Yeah, we don't actually share man pages.

I think the main issue would be just to edit the *.3in pages here:

https://github.com/open-mpi/ompi/tree/master/ompi/mpi/man/man3

They're all native nroff format (they're .3in instead of .3 because we 
pre-process them during "make" to substitute things like the release date and 
version in).

I'm guessing it would be a pretty mechanical kind of patch -- just adding 
Fortran interfaces at the top of each page.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-11 Thread Reuti
Am 11.11.2014 um 16:13 schrieb Ralph Castain:

> This clearly displays the problem - if you look at the reported “allocated 
> nodes”, you see that we only got one node (cn6050). This is why we mapped all 
> your procs onto that node.
> 
> So the real question is - why? Can you show us the content of PE_HOSTFILE?
> 
> 
>> On Nov 11, 2014, at 4:51 AM, SLIM H.A.  wrote:
>> 
>> Dear Reuti and Ralph
>>  
>> Below is the output of the run for openmpi 1.8.3 with this line
>>  
>> mpirun -np $NSLOTS --display-map --display-allocation --cpus-per-proc 1 $exe
>>  
>>  
>> master=cn6050
>> PE=orte
>> JOB_ID=2482923
>> Got 32 slots.
>> slots:
>> cn6050 16 par6.q@cn6050 
>> cn6045 16 par6.q@cn6045 

The above looks like the PE_HOSTFILE. So it should be 16 slots per node.

I wonder whether any environment variable was reset, which normally allows Open 
MPI to discover that it's running inside SGE.

I.e. SGE_ROOT, JOB_ID, ARC and PE_HOSTFILE are untouched before the job starts?

Supplying "-np $NSLOTS" shouldn't be necessary though.

-- Reuti



>> Tue Nov 11 12:37:37 GMT 2014
>>  
>> ==   ALLOCATED NODES   ==
>> cn6050: slots=16 max_slots=0 slots_inuse=0 state=UP
>> =
>> Data for JOB [57374,1] offset 0
>>  
>>    JOB MAP   
>>  
>> Data for node: cn6050  Num slots: 16   Max slots: 0Num procs: 32
>> Process OMPI jobid: [57374,1] App: 0 Process rank: 0
>> Process OMPI jobid: [57374,1] App: 0 Process rank: 1
>>  
>> …
>> Process OMPI jobid: [57374,1] App: 0 Process rank: 31
>>  
>>  
>> Also
>> ompi_info | grep grid
>> gives MCA ras: gridengine (MCA v2.0, API v2.0, Component 
>> v1.8.3)
>> and
>> ompi_info | grep psm
>> gives MCA mtl: psm (MCA v2.0, API v2.0, Component v1.8.3)
>> because the intercoonect is TrueScale/QLogic
>>  
>> and
>>  
>> setenv OMPI_MCA_mtl "psm"
>>  
>> is set in the script. This is the PE
>>  
>> pe_name   orte
>> slots 4000
>> user_listsNONE
>> xuser_lists   NONE
>> start_proc_args   /bin/true
>> stop_proc_args/bin/true
>> allocation_rule   $fill_up
>> control_slavesTRUE
>> job_is_first_task FALSE
>> urgency_slots min
>>  
>> Many thanks
>>  
>> Henk
>>  
>>  
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
>> Sent: 10 November 2014 05:07
>> To: Open MPI Users
>> Subject: Re: [OMPI users] oversubscription of slots with GridEngine
>>  
>> You might also add the —display-allocation flag to mpirun so we can see what 
>> it thinks the allocation looks like. If there are only 16 slots on the node, 
>> it seems odd that OMPI would assign 32 procs to it unless it thinks there is 
>> only 1 node in the job, and oversubscription is allowed (which it won’t be 
>> by default if it read the GE allocation)
>>  
>>  
>> On Nov 9, 2014, at 9:56 AM, Reuti  wrote:
>>  
>> Hi,
>> 
>> 
>> Am 09.11.2014 um 18:20 schrieb SLIM H.A. :
>> 
>> We switched on hyper threading on our cluster with two eight core sockets 
>> per node (32 threads per node).
>> 
>> We configured  gridengine with 16 slots per node to allow the 16 extra 
>> threads for kernel process use but this apparently does not work. Printout 
>> of the gridengine hostfile shows that for a 32 slots job, 16 slots are 
>> placed on each of two nodes as expected. Including the openmpi --display-map 
>> option shows that all 32 processes are incorrectly  placed on the head node.
>> 
>> You mean the master node of the parallel job I assume.
>> 
>> 
>> Here is part of the output
>> 
>> master=cn6083
>> PE=orte
>> 
>> What allocation rule was defined for this PE - "control_slave yes" is set?
>> 
>> 
>> JOB_ID=2481793
>> Got 32 slots.
>> slots:
>> cn6083 16 par6.q@cn6083 
>> cn6085 16 par6.q@cn6085 
>> Sun Nov  9 16:50:59 GMT 2014
>> Data for JOB [44767,1] offset 0
>> 
>>    JOB MAP   
>> 
>> Data for node: cn6083  Num slots: 16   Max slots: 0Num procs: 32
>>   Process OMPI jobid: [44767,1] App: 0 Process rank: 0
>>   Process OMPI jobid: [44767,1] App: 0 Process rank: 1
>> ...
>>   Process OMPI jobid: [44767,1] App: 0 Process rank: 31
>> 
>> =
>> 
>> I found some related mailings about a new warning in 1.8.2 about 
>> oversubscription and  I tried a few options to avoid the use of the extra 
>> threads for MPI tasks by openmpi without success, e.g. variants of
>> 
>> --cpus-per-proc 1 
>> --bind-to-core 
>> 
>> and some others. Gridengine treats hw threads as cores==slots (?) but the 
>> content of $PE_HOSTFILE suggests it distributes the slots sensibly  so it 
>> seems there is an option for openmpi required to get 16 cores per node?
>> 
>> Was Open MPI configured with --with-sge?
>> 
>> -- Reuti
>> 
>> 
>> I tried both 1.8.2, 1.8.3 and also 1.6.5.
>> 
>> 

Re: [OMPI users] EXTERNAL: Re: Question on mapping processes to hosts file

2014-11-11 Thread Ralph Castain

> On Nov 11, 2014, at 6:11 AM, Blosch, Edwin L  wrote:
> 
> OK, that’s what I was suspecting.  It’s a bug, right?  I asked for 4 
> processes and I supplied a host file with 4 lines in it, and mpirun didn’t 
> launch the processes where I told it to launch them. 

Actually, no - it’s an intended “feature”. When the dinosaurs still roamed the 
earth and OMPI was an infant, we had no way of detecting the number of 
processors on a node in advance of the map/launch phase. During that time, 
users were required to tell us that info in the hostfile, which was a source of 
constant complaint.

Since that time, we have changed the launch procedure so we do have access to 
that info when we need it. Accordingly, we now check to see if you told us the 
number of slots on each node in the hostfile - if not, then we autodetect it 
for you.

Quite honestly, it sounds to me like you might be happier using the 
“sequential” mapper for this use case. It will place one proc on each of the 
indicated nodes, with the rank set by the order in the hostfile. So a hostfile 
like this:

node1
node2
node1
node3

will result in
rank 0 -> node1
rank 1 -> node2
rank 2 -> node1
rank 3 -> node3

etc. To use it, just add "-mca rmaps seq" to you cmd line. Alternatively, you 
could add “--map-by node" to your cmd line and we will round-robin by node.

>  
> Do you know when or if this changed?  I can’t recall seeing this this 
> behavior in 1.6.5 or 1.4 or 1.2, and I know I’ve run cases across workstation 
> clusters, so I think I would have noticed this behavior. 

It changed early in the 1.7 series, and has remained consistent since then.

>  
> Can I throw another one at you, most likely related?  On a system where 
> node01, node02, node03, and node04 already had a full load of work (i.e. 
> other applications were running a number of processes equal to the number of 
> cores on each node), I had a hosts file like this:  node01, node01, node02, 
> node02.   I asked for 4 processes.  mpirun launched them as I would think: 
> rank 0 and rank 1 on node01, and rank 2 and 3 on node02.  Then I tried 
> node01, node01, node02, node03.  In this case, all 4 processes were launched 
> on node01.  Is there a logical explanation for this behavior as well?

Now that one is indeed a bug! I’ll dig it up and fix it.


>  
> Thanks again,
>  
> Ed
>  
>  
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Friday, November 07, 2014 11:51 AM
> To: Open MPI Users
> Subject: EXTERNAL: Re: [OMPI users] Question on mapping processes to hosts 
> file
>  
> Ah, yes - so here is what is happening. When no slot info is provided, we use 
> the number of detected cores on each node as the #slots. So if you want to 
> loadbalance across the nodes, you need to set —map-by node
>  
> Or add slots=1 to each line of your host file to override the default behavior
>  
> On Nov 7, 2014, at 8:52 AM, Blosch, Edwin L  > wrote:
>  
> Here’s my command:
>  
> /bin/mpirun  --machinefile 
> hosts.dat -np 4 
>  
> Here’s my hosts.dat file:
>  
> % cat hosts.dat
> node01
> node02
> node03
> node04
>  
> All 4 ranks are launched on node01.  I don’t believe I’ve ever seen this 
> before.  I had to do a sanity check, so I tried MVAPICH2-2.1a and got what I 
> expected: 1 process runs on each of the 4 nodes.  The mpirun man page says 
> ‘round-robin’, which I take to mean that one process would be launched per 
> line in the hosts file, so this really seems like incorrect behavior.
>  
> What could be the possibilities here?
>  
> Thanks for the help!
>  
>  
>  
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25707.php 
> 
>  
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25742.php 
> 


Re: [OMPI users] EXTERNAL: Re: Question on mapping processes to hosts file

2014-11-11 Thread Blosch, Edwin L
Thanks Ralph.  I’ll experiment with these options.  Much appreciated.

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, November 11, 2014 10:00 AM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: Question on mapping processes to hosts 
file


On Nov 11, 2014, at 6:11 AM, Blosch, Edwin L 
mailto:edwin.l.blo...@lmco.com>> wrote:

OK, that’s what I was suspecting.  It’s a bug, right?  I asked for 4 processes 
and I supplied a host file with 4 lines in it, and mpirun didn’t launch the 
processes where I told it to launch them.

Actually, no - it’s an intended “feature”. When the dinosaurs still roamed the 
earth and OMPI was an infant, we had no way of detecting the number of 
processors on a node in advance of the map/launch phase. During that time, 
users were required to tell us that info in the hostfile, which was a source of 
constant complaint.

Since that time, we have changed the launch procedure so we do have access to 
that info when we need it. Accordingly, we now check to see if you told us the 
number of slots on each node in the hostfile - if not, then we autodetect it 
for you.

Quite honestly, it sounds to me like you might be happier using the 
“sequential” mapper for this use case. It will place one proc on each of the 
indicated nodes, with the rank set by the order in the hostfile. So a hostfile 
like this:

node1
node2
node1
node3

will result in
rank 0 -> node1
rank 1 -> node2
rank 2 -> node1
rank 3 -> node3

etc. To use it, just add "-mca rmaps seq" to you cmd line. Alternatively, you 
could add “--map-by node" to your cmd line and we will round-robin by node.



Do you know when or if this changed?  I can’t recall seeing this this behavior 
in 1.6.5 or 1.4 or 1.2, and I know I’ve run cases across workstation clusters, 
so I think I would have noticed this behavior.

It changed early in the 1.7 series, and has remained consistent since then.



Can I throw another one at you, most likely related?  On a system where node01, 
node02, node03, and node04 already had a full load of work (i.e. other 
applications were running a number of processes equal to the number of cores on 
each node), I had a hosts file like this:  node01, node01, node02, node02.   I 
asked for 4 processes.  mpirun launched them as I would think: rank 0 and rank 
1 on node01, and rank 2 and 3 on node02.  Then I tried node01, node01, node02, 
node03.  In this case, all 4 processes were launched on node01.  Is there a 
logical explanation for this behavior as well?

Now that one is indeed a bug! I’ll dig it up and fix it.




Thanks again,

Ed


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, November 07, 2014 11:51 AM
To: Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] Question on mapping processes to hosts file

Ah, yes - so here is what is happening. When no slot info is provided, we use 
the number of detected cores on each node as the #slots. So if you want to 
loadbalance across the nodes, you need to set —map-by node

Or add slots=1 to each line of your host file to override the default behavior

On Nov 7, 2014, at 8:52 AM, Blosch, Edwin L 
mailto:edwin.l.blo...@lmco.com>> wrote:

Here’s my command:

/bin/mpirun  --machinefile 
hosts.dat -np 4 

Here’s my hosts.dat file:

% cat hosts.dat
node01
node02
node03
node04

All 4 ranks are launched on node01.  I don’t believe I’ve ever seen this 
before.  I had to do a sanity check, so I tried MVAPICH2-2.1a and got what I 
expected: 1 process runs on each of the 4 nodes.  The mpirun man page says 
‘round-robin’, which I take to mean that one process would be launched per line 
in the hosts file, so this really seems like incorrect behavior.

What could be the possibilities here?

Thanks for the help!



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25707.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25742.php



Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-11 Thread Ralph Castain

> On Nov 11, 2014, at 7:57 AM, Reuti  wrote:
> 
> Am 11.11.2014 um 16:13 schrieb Ralph Castain:
> 
>> This clearly displays the problem - if you look at the reported “allocated 
>> nodes”, you see that we only got one node (cn6050). This is why we mapped 
>> all your procs onto that node.
>> 
>> So the real question is - why? Can you show us the content of PE_HOSTFILE?
>> 
>> 
>>> On Nov 11, 2014, at 4:51 AM, SLIM H.A.  wrote:
>>> 
>>> Dear Reuti and Ralph
>>> 
>>> Below is the output of the run for openmpi 1.8.3 with this line
>>> 
>>> mpirun -np $NSLOTS --display-map --display-allocation --cpus-per-proc 1 $exe
>>> 
>>> 
>>> master=cn6050
>>> PE=orte
>>> JOB_ID=2482923
>>> Got 32 slots.
>>> slots:
>>> cn6050 16 par6.q@cn6050 
>>> cn6045 16 par6.q@cn6045 
> 
> The above looks like the PE_HOSTFILE. So it should be 16 slots per node.

Hey Reuti

Is that the standard PE_HOSTFILE format? I’m looking at the ras/gridengine 
module, and it looks like it is expecting a different format. I suspect that is 
the problem

Ralph

> 
> I wonder whether any environment variable was reset, which normally allows 
> Open MPI to discover that it's running inside SGE.
> 
> I.e. SGE_ROOT, JOB_ID, ARC and PE_HOSTFILE are untouched before the job 
> starts?
> 
> Supplying "-np $NSLOTS" shouldn't be necessary though.
> 
> -- Reuti
> 
> 
> 
>>> Tue Nov 11 12:37:37 GMT 2014
>>> 
>>> ==   ALLOCATED NODES   ==
>>>cn6050: slots=16 max_slots=0 slots_inuse=0 state=UP
>>> =
>>> Data for JOB [57374,1] offset 0
>>> 
>>>    JOB MAP   
>>> 
>>> Data for node: cn6050  Num slots: 16   Max slots: 0Num procs: 32
>>>Process OMPI jobid: [57374,1] App: 0 Process rank: 0
>>>Process OMPI jobid: [57374,1] App: 0 Process rank: 1
>>> 
>>> …
>>>Process OMPI jobid: [57374,1] App: 0 Process rank: 31
>>> 
>>> 
>>> Also
>>> ompi_info | grep grid
>>> gives MCA ras: gridengine (MCA v2.0, API v2.0, Component 
>>> v1.8.3)
>>> and
>>> ompi_info | grep psm
>>> gives MCA mtl: psm (MCA v2.0, API v2.0, Component v1.8.3)
>>> because the intercoonect is TrueScale/QLogic
>>> 
>>> and
>>> 
>>> setenv OMPI_MCA_mtl "psm"
>>> 
>>> is set in the script. This is the PE
>>> 
>>> pe_name   orte
>>> slots 4000
>>> user_listsNONE
>>> xuser_lists   NONE
>>> start_proc_args   /bin/true
>>> stop_proc_args/bin/true
>>> allocation_rule   $fill_up
>>> control_slavesTRUE
>>> job_is_first_task FALSE
>>> urgency_slots min
>>> 
>>> Many thanks
>>> 
>>> Henk
>>> 
>>> 
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
>>> Sent: 10 November 2014 05:07
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] oversubscription of slots with GridEngine
>>> 
>>> You might also add the —display-allocation flag to mpirun so we can see 
>>> what it thinks the allocation looks like. If there are only 16 slots on the 
>>> node, it seems odd that OMPI would assign 32 procs to it unless it thinks 
>>> there is only 1 node in the job, and oversubscription is allowed (which it 
>>> won’t be by default if it read the GE allocation)
>>> 
>>> 
>>> On Nov 9, 2014, at 9:56 AM, Reuti  wrote:
>>> 
>>> Hi,
>>> 
>>> 
>>> Am 09.11.2014 um 18:20 schrieb SLIM H.A. :
>>> 
>>> We switched on hyper threading on our cluster with two eight core sockets 
>>> per node (32 threads per node).
>>> 
>>> We configured  gridengine with 16 slots per node to allow the 16 extra 
>>> threads for kernel process use but this apparently does not work. Printout 
>>> of the gridengine hostfile shows that for a 32 slots job, 16 slots are 
>>> placed on each of two nodes as expected. Including the openmpi 
>>> --display-map option shows that all 32 processes are incorrectly  placed on 
>>> the head node.
>>> 
>>> You mean the master node of the parallel job I assume.
>>> 
>>> 
>>> Here is part of the output
>>> 
>>> master=cn6083
>>> PE=orte
>>> 
>>> What allocation rule was defined for this PE - "control_slave yes" is set?
>>> 
>>> 
>>> JOB_ID=2481793
>>> Got 32 slots.
>>> slots:
>>> cn6083 16 par6.q@cn6083 
>>> cn6085 16 par6.q@cn6085 
>>> Sun Nov  9 16:50:59 GMT 2014
>>> Data for JOB [44767,1] offset 0
>>> 
>>>    JOB MAP   
>>> 
>>> Data for node: cn6083  Num slots: 16   Max slots: 0Num procs: 32
>>>  Process OMPI jobid: [44767,1] App: 0 Process rank: 0
>>>  Process OMPI jobid: [44767,1] App: 0 Process rank: 1
>>> ...
>>>  Process OMPI jobid: [44767,1] App: 0 Process rank: 31
>>> 
>>> =
>>> 
>>> I found some related mailings about a new warning in 1.8.2 about 
>>> oversubscription and  I tried a few options to avoid the use of the extra 
>>> threads for MPI tasks by openmpi without success, e.g. variants of
>>> 
>>> --cpus-per-proc

Re: [OMPI users] EXTERNAL: Re: Question on mapping processes to hosts file

2014-11-11 Thread Ralph Castain
I checked that bug using the current 1.8.4 branch and I can’t replicate it - 
looks like it might have already been fixed. If I give a hostfile like the one 
you described:
node1
node1
node2
node3

and then ask to launch four processes:
mpirun -n 4 --display-allocation --display-map --do-not-launch --do-not-resolve 
-hostfile ./hosts hostname

I get the following allocation and map:

==   ALLOCATED NODES   ==
bend001: slots=6 max_slots=0 slots_inuse=0 state=UP
node1: slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
node2: slots=12 max_slots=0 slots_inuse=0 state=UNKNOWN
node3: slots=12 max_slots=0 slots_inuse=0 state=UNKNOWN
=
 Data for JOB [54391,1] offset 0

    JOB MAP   

 Data for node: node1   Num slots: 2Max slots: 0Num procs: 2
Process OMPI jobid: [54391,1] App: 0 Process rank: 0
Process OMPI jobid: [54391,1] App: 0 Process rank: 1

 Data for node: node2   Num slots: 12   Max slots: 0Num procs: 2
Process OMPI jobid: [54391,1] App: 0 Process rank: 2
Process OMPI jobid: [54391,1] App: 0 Process rank: 3

Note that we see the host where mpirun is executing in the “allocation”, but it 
isn’t used as we specified a hostfile that didn’t include it. Also, you see the 
impact of the autodetect algo. Since I specified node1 more than once, we 
assume this is intended to provide a slot count and use that instead of what we 
detect. Since node2 and node3 were only given once, we autodetect those cores 
and set the slots equal to them.

The job map matches what I would have expected, so I think we are okay here.

HTH
Ralph


> On Nov 11, 2014, at 8:10 AM, Blosch, Edwin L  wrote:
> 
> Thanks Ralph.  I’ll experiment with these options.  Much appreciated.
>  
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Tuesday, November 11, 2014 10:00 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] EXTERNAL: Re: Question on mapping processes to 
> hosts file
>  
>  
> On Nov 11, 2014, at 6:11 AM, Blosch, Edwin L  > wrote:
>  
> OK, that’s what I was suspecting.  It’s a bug, right?  I asked for 4 
> processes and I supplied a host file with 4 lines in it, and mpirun didn’t 
> launch the processes where I told it to launch them. 
>  
> Actually, no - it’s an intended “feature”. When the dinosaurs still roamed 
> the earth and OMPI was an infant, we had no way of detecting the number of 
> processors on a node in advance of the map/launch phase. During that time, 
> users were required to tell us that info in the hostfile, which was a source 
> of constant complaint.
>  
> Since that time, we have changed the launch procedure so we do have access to 
> that info when we need it. Accordingly, we now check to see if you told us 
> the number of slots on each node in the hostfile - if not, then we autodetect 
> it for you.
>  
> Quite honestly, it sounds to me like you might be happier using the 
> “sequential” mapper for this use case. It will place one proc on each of the 
> indicated nodes, with the rank set by the order in the hostfile. So a 
> hostfile like this:
>  
> node1
> node2
> node1
> node3
>  
> will result in
> rank 0 -> node1
> rank 1 -> node2
> rank 2 -> node1
> rank 3 -> node3
>  
> etc. To use it, just add "-mca rmaps seq" to you cmd line. Alternatively, you 
> could add “--map-by node" to your cmd line and we will round-robin by node.
> 
> 
>  
> Do you know when or if this changed?  I can’t recall seeing this this 
> behavior in 1.6.5 or 1.4 or 1.2, and I know I’ve run cases across workstation 
> clusters, so I think I would have noticed this behavior. 
>  
> It changed early in the 1.7 series, and has remained consistent since then.
> 
> 
>  
> Can I throw another one at you, most likely related?  On a system where 
> node01, node02, node03, and node04 already had a full load of work (i.e. 
> other applications were running a number of processes equal to the number of 
> cores on each node), I had a hosts file like this:  node01, node01, node02, 
> node02.   I asked for 4 processes.  mpirun launched them as I would think: 
> rank 0 and rank 1 on node01, and rank 2 and 3 on node02.  Then I tried 
> node01, node01, node02, node03.  In this case, all 4 processes were launched 
> on node01.  Is there a logical explanation for this behavior as well?
>  
> Now that one is indeed a bug! I’ll dig it up and fix it.
>  
> 
> 
>  
> Thanks again,
>  
> Ed
>  
>  
> From: users [mailto:users-boun...@open-mpi.org 
> ] On Behalf Of Ralph Castain
> Sent: Friday, November 07, 2014 11:51 AM
> To: Open MPI Users
> Subject: EXTERNAL: Re: [OMPI users] Question on mapping processes to hosts 
> file
>  
> Ah, yes - so here is what is happening. When no slot info is provided, we use 
> the number of detec

Re: [OMPI users] what order do I get messages coming to MPI Recv from MPI_ANY_SOURCE?

2014-11-11 Thread George Bosilca
Using MPI_ANY_SOURCE will extract one message from the queue of unexpected 
messages. The fairness is not guaranteed by the MPI standard, thus it is 
impossible to predict the order between servers. 

If you need fairness your second choice is the way to go.

  George.

> On Nov 10, 2014, at 20:14 , David A. Schneider  
> wrote:
> 
> I am implementing a hub/servers MPI application. Each of the servers can get 
> tied up waiting for some data, then they do an MPI Send to the hub. It is 
> relatively simple for me to have the hub waiting around doing a Recv from 
> ANY_SOURCE. The hub can get busy working with the data. What I'm worried 
> about is skipping data from one of the servers. How likely is this scenario:
> 
>server 1 and 2 do Send's
>hub does Recv and ends up getting data from server 1
>while hub busy, server 1 gets more data, does another Send
>when hub does its next Recv, it gets the more recent server 1 data rather 
> than the older server2
> 
> I don't need a guarantee that the order the Send's occur is the order the 
> ANY_SOURCE processes them (though it would be nice), but if I new in practice 
> it will be close to the order they are sent, I may go with the above. However 
> if it is likely I could skip over data from one of the servers, I need to 
> implement something more complicated. Which I think would be this pattern:
> 
>servers each do Send's
>hub does an Irecv for each server
>hub does a Waitany on all server requests
>upon completion of one server request, hub does a Test on all the others
>of all the Irecv's that have completed, hub selects the oldest server data 
> (there is timing tag in the server data)
>hub communicates with the server it just chose, has it start a new Send, 
> hub a new Irecv
> 
> This requires more complex code, and my first effort crashed inside the 
> Waitany call in a way that I'm finding difficult to debug. I am using the 
> Python bindings mpi4py - so I have less control over buffers being used.
> 
> I just posted this on stackoverflow also, but maybe this is a better place to 
> post?
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25738.php



Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-11 Thread Reuti
Am 11.11.2014 um 17:52 schrieb Ralph Castain:

> 
>> On Nov 11, 2014, at 7:57 AM, Reuti  wrote:
>> 
>> Am 11.11.2014 um 16:13 schrieb Ralph Castain:
>> 
>>> This clearly displays the problem - if you look at the reported “allocated 
>>> nodes”, you see that we only got one node (cn6050). This is why we mapped 
>>> all your procs onto that node.
>>> 
>>> So the real question is - why? Can you show us the content of PE_HOSTFILE?
>>> 
>>> 
 On Nov 11, 2014, at 4:51 AM, SLIM H.A.  wrote:
 
 Dear Reuti and Ralph
 
 Below is the output of the run for openmpi 1.8.3 with this line
 
 mpirun -np $NSLOTS --display-map --display-allocation --cpus-per-proc 1 
 $exe
 
 
 master=cn6050
 PE=orte
 JOB_ID=2482923
 Got 32 slots.
 slots:
 cn6050 16 par6.q@cn6050 
 cn6045 16 par6.q@cn6045 
>> 
>> The above looks like the PE_HOSTFILE. So it should be 16 slots per node.
> 
> Hey Reuti
> 
> Is that the standard PE_HOSTFILE format? I’m looking at the ras/gridengine 
> module, and it looks like it is expecting a different format. I suspect that 
> is the problem

Well, the fourth column can be a processer range in older versions of SGE and 
the binding in newer ones, but the first three columns were always this way.

-- Reuti


> Ralph
> 
>> 
>> I wonder whether any environment variable was reset, which normally allows 
>> Open MPI to discover that it's running inside SGE.
>> 
>> I.e. SGE_ROOT, JOB_ID, ARC and PE_HOSTFILE are untouched before the job 
>> starts?
>> 
>> Supplying "-np $NSLOTS" shouldn't be necessary though.
>> 
>> -- Reuti
>> 
>> 
>> 
 Tue Nov 11 12:37:37 GMT 2014
 
 ==   ALLOCATED NODES   ==
   cn6050: slots=16 max_slots=0 slots_inuse=0 state=UP
 =
 Data for JOB [57374,1] offset 0
 
    JOB MAP   
 
 Data for node: cn6050  Num slots: 16   Max slots: 0Num procs: 32
   Process OMPI jobid: [57374,1] App: 0 Process rank: 0
   Process OMPI jobid: [57374,1] App: 0 Process rank: 1
 
 …
   Process OMPI jobid: [57374,1] App: 0 Process rank: 31
 
 
 Also
 ompi_info | grep grid
 gives MCA ras: gridengine (MCA v2.0, API v2.0, Component 
 v1.8.3)
 and
 ompi_info | grep psm
 gives MCA mtl: psm (MCA v2.0, API v2.0, Component v1.8.3)
 because the intercoonect is TrueScale/QLogic
 
 and
 
 setenv OMPI_MCA_mtl "psm"
 
 is set in the script. This is the PE
 
 pe_name   orte
 slots 4000
 user_listsNONE
 xuser_lists   NONE
 start_proc_args   /bin/true
 stop_proc_args/bin/true
 allocation_rule   $fill_up
 control_slavesTRUE
 job_is_first_task FALSE
 urgency_slots min
 
 Many thanks
 
 Henk
 
 
 From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
 Sent: 10 November 2014 05:07
 To: Open MPI Users
 Subject: Re: [OMPI users] oversubscription of slots with GridEngine
 
 You might also add the —display-allocation flag to mpirun so we can see 
 what it thinks the allocation looks like. If there are only 16 slots on 
 the node, it seems odd that OMPI would assign 32 procs to it unless it 
 thinks there is only 1 node in the job, and oversubscription is allowed 
 (which it won’t be by default if it read the GE allocation)
 
 
 On Nov 9, 2014, at 9:56 AM, Reuti  wrote:
 
 Hi,
 
 
 Am 09.11.2014 um 18:20 schrieb SLIM H.A. :
 
 We switched on hyper threading on our cluster with two eight core sockets 
 per node (32 threads per node).
 
 We configured  gridengine with 16 slots per node to allow the 16 extra 
 threads for kernel process use but this apparently does not work. Printout 
 of the gridengine hostfile shows that for a 32 slots job, 16 slots are 
 placed on each of two nodes as expected. Including the openmpi 
 --display-map option shows that all 32 processes are incorrectly  placed 
 on the head node.
 
 You mean the master node of the parallel job I assume.
 
 
 Here is part of the output
 
 master=cn6083
 PE=orte
 
 What allocation rule was defined for this PE - "control_slave yes" is set?
 
 
 JOB_ID=2481793
 Got 32 slots.
 slots:
 cn6083 16 par6.q@cn6083 
 cn6085 16 par6.q@cn6085 
 Sun Nov  9 16:50:59 GMT 2014
 Data for JOB [44767,1] offset 0
 
    JOB MAP   
 
 Data for node: cn6083  Num slots: 16   Max slots: 0Num procs: 32
 Process OMPI jobid: [44767,1] App: 0 Process rank: 0
 Process OMPI jobid: [44767,1] App: 0 Process rank: 1
 ...
  

Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-11 Thread Ralph Castain

> On Nov 11, 2014, at 10:06 AM, Reuti  wrote:
> 
> Am 11.11.2014 um 17:52 schrieb Ralph Castain:
> 
>> 
>>> On Nov 11, 2014, at 7:57 AM, Reuti  wrote:
>>> 
>>> Am 11.11.2014 um 16:13 schrieb Ralph Castain:
>>> 
 This clearly displays the problem - if you look at the reported “allocated 
 nodes”, you see that we only got one node (cn6050). This is why we mapped 
 all your procs onto that node.
 
 So the real question is - why? Can you show us the content of PE_HOSTFILE?
 
 
> On Nov 11, 2014, at 4:51 AM, SLIM H.A.  wrote:
> 
> Dear Reuti and Ralph
> 
> Below is the output of the run for openmpi 1.8.3 with this line
> 
> mpirun -np $NSLOTS --display-map --display-allocation --cpus-per-proc 1 
> $exe
> 
> 
> master=cn6050
> PE=orte
> JOB_ID=2482923
> Got 32 slots.
> slots:
> cn6050 16 par6.q@cn6050 
> cn6045 16 par6.q@cn6045 
>>> 
>>> The above looks like the PE_HOSTFILE. So it should be 16 slots per node.
>> 
>> Hey Reuti
>> 
>> Is that the standard PE_HOSTFILE format? I’m looking at the ras/gridengine 
>> module, and it looks like it is expecting a different format. I suspect that 
>> is the problem
> 
> Well, the fourth column can be a processer range in older versions of SGE and 
> the binding in newer ones, but the first three columns were always this way.

Hmmm…perhaps I’m confused here. I guess you’re saying that just the last two 
lines of his output contain the PE_HOSTFILE, as opposed to the entire thing? If 
so, I’m wondering if that NULL he shows in there is the source of the trouble. 
The parser doesn’t look like it would handle that very well, though I’d need to 
test it. Is that NULL expected? Or is the NULL not really in the file?


> 
> -- Reuti
> 
> 
>> Ralph
>> 
>>> 
>>> I wonder whether any environment variable was reset, which normally allows 
>>> Open MPI to discover that it's running inside SGE.
>>> 
>>> I.e. SGE_ROOT, JOB_ID, ARC and PE_HOSTFILE are untouched before the job 
>>> starts?
>>> 
>>> Supplying "-np $NSLOTS" shouldn't be necessary though.
>>> 
>>> -- Reuti
>>> 
>>> 
>>> 
> Tue Nov 11 12:37:37 GMT 2014
> 
> ==   ALLOCATED NODES   ==
>  cn6050: slots=16 max_slots=0 slots_inuse=0 state=UP
> =
> Data for JOB [57374,1] offset 0
> 
>    JOB MAP   
> 
> Data for node: cn6050  Num slots: 16   Max slots: 0Num procs: 32
>  Process OMPI jobid: [57374,1] App: 0 Process rank: 0
>  Process OMPI jobid: [57374,1] App: 0 Process rank: 1
> 
> …
>  Process OMPI jobid: [57374,1] App: 0 Process rank: 31
> 
> 
> Also
> ompi_info | grep grid
> gives MCA ras: gridengine (MCA v2.0, API v2.0, Component 
> v1.8.3)
> and
> ompi_info | grep psm
> gives MCA mtl: psm (MCA v2.0, API v2.0, Component v1.8.3)
> because the intercoonect is TrueScale/QLogic
> 
> and
> 
> setenv OMPI_MCA_mtl "psm"
> 
> is set in the script. This is the PE
> 
> pe_name   orte
> slots 4000
> user_listsNONE
> xuser_lists   NONE
> start_proc_args   /bin/true
> stop_proc_args/bin/true
> allocation_rule   $fill_up
> control_slavesTRUE
> job_is_first_task FALSE
> urgency_slots min
> 
> Many thanks
> 
> Henk
> 
> 
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: 10 November 2014 05:07
> To: Open MPI Users
> Subject: Re: [OMPI users] oversubscription of slots with GridEngine
> 
> You might also add the —display-allocation flag to mpirun so we can see 
> what it thinks the allocation looks like. If there are only 16 slots on 
> the node, it seems odd that OMPI would assign 32 procs to it unless it 
> thinks there is only 1 node in the job, and oversubscription is allowed 
> (which it won’t be by default if it read the GE allocation)
> 
> 
> On Nov 9, 2014, at 9:56 AM, Reuti  wrote:
> 
> Hi,
> 
> 
> Am 09.11.2014 um 18:20 schrieb SLIM H.A. :
> 
> We switched on hyper threading on our cluster with two eight core sockets 
> per node (32 threads per node).
> 
> We configured  gridengine with 16 slots per node to allow the 16 extra 
> threads for kernel process use but this apparently does not work. 
> Printout of the gridengine hostfile shows that for a 32 slots job, 16 
> slots are placed on each of two nodes as expected. Including the openmpi 
> --display-map option shows that all 32 processes are incorrectly  placed 
> on the head node.
> 
> You mean the master node of the parallel job I assume.
> 
> 
> Here is part of the output
> 
> master

Re: [OMPI users] File-backed mmaped I/O and openib btl.

2014-11-11 Thread Emmanuel Thomé
Thanks a lot for your analysis. This seems consistent with what I can
obtain by playing around with my different test cases.

It seems that munmap() does *not* unregister the memory chunk from the
cache. I suppose this is the reason for the bug.

In fact using mmap(..., MAP_ANONYMOUS | MAP_PRIVATE) and munmap() as
substitutes for malloc()/free() trigger the same problem.

It looks to me that there is an oversight in the OPAL hooks around the
memory functions, then. Do you agree ?

E.

On Tue, Nov 11, 2014 at 3:17 PM, Joshua Ladd  wrote:
> I was able to reproduce your issue and I think I understand the problem a
> bit better at least. This demonstrates exactly what I was pointing to:
>
> It looks like when the test switches over from eager RDMA (I'll explain in a
> second), to doing a rendezvous protocol working entirely in user buffer
> space things go bad.
>
> If you're input is smaller than some threshold, the eager RDMA limit, then
> the contents of your user buffer are copied into OMPI/OpenIB BTL scratch
> buffers called "eager fragments". This pool of resources is preregistered,
> pinned, and have had their rkeys exchanged. So, in the eager protocol, your
> data is copied into these "locked and loaded" RDMA frags and the put/get is
> handled internally. When the data is received, it's copied back out into
> your buffer. In your setup, this always works.
>
> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
> btl_openib_eager_limit 512 -mca btl openib,self ./ibtest -s 56
> per-node buffer has size 448 bytes
> node 0 iteration 0, lead word received from peer is 0x0401 [ok]
> node 0 iteration 1, lead word received from peer is 0x0801 [ok]
> node 0 iteration 2, lead word received from peer is 0x0c01 [ok]
> node 0 iteration 3, lead word received from peer is 0x1001 [ok]
>
> When you exceed the eager threshold, this always fails on the second
> iteration. To understand this, you need to understand that there is a
> protocol switch where now your user buffer is used for the transfer. Hence,
> the user buffer is registered with the HCA. This operation is an inherently
> high latency operation and is one of the primary motives for doing
> copy-in/copy-out into preregistered buffers for small, latency sensitive
> ops. For bandwidth bound transfers, the cost to register can be amortized
> over the whole transfer, but it still affects the total bandwidth. In the
> case of a rendezvous protocol where the user buffer is registered, there is
> an optimization mostly used to help improve the numbers in a bandwidth
> benchmark called a registration cache. With registration caching the user
> buffer is registered once and the mkey put into a cache and the memory is
> kept pinned until the system provides some notification via either memory
> hooks in p2p malloc, or ummunotify that the buffer has been freed and this
> signals that the mkey can be evicted from the cache.  On subsequent
> send/recv operations from the same user buffer address, OpenIB BTL will find
> the address in the registration cache and take the cached mkey and avoid
> paying the cost of the memory registration the memory registration and start
> the data transfer.
>
> What I noticed is when the rendezvous protocol kicks in, it always fails on
> the second iteration.
>
> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
> btl_openib_eager_limit 128 -mca btl openib,self ./ibtest -s 56
> per-node buffer has size 448 bytes
> node 0 iteration 0, lead word received from peer is 0x0401 [ok]
> node 0 iteration 1, lead word received from peer is 0x [NOK]
> --
>
> So, I suspected it has something to do with the way the virtual address is
> being handled in this case. To test that theory, I just completely disabled
> the registration cache by setting -mca mpi_leave_pinned 0 and things start
> to work:
>
> $mpirun -np 2 --map-by node --bind-to core -mca pml ob1 -mca
> btl_openib_if_include mlx4_0:1 -mca btl_openib_use_eager_rdma 1 -mca
> btl_openib_eager_limit 128 -mca mpi_leave_pinned 0 -mca btl openib,self
> ./ibtest -s 56
> per-node buffer has size 448 bytes
> node 0 iteration 0, lead word received from peer is 0x0401 [ok]
> node 0 iteration 1, lead word received from peer is 0x0801 [ok]
> node 0 iteration 2, lead word received from peer is 0x0c01 [ok]
> node 0 iteration 3, lead word received from peer is 0x1001 [ok]
>
> I don't know enough about memory hooks or the registration cache
> implementation to speak with any authority, but it looks like this is where
> the issue resides. As a workaround, can you try your original experiment
> with -mca mpi_leave_pinned 0 and see if you get consistent results.
>
>
> Josh
>
>
>
>
>
> On Tue, Nov 11, 2014 at 7:07 AM, Emmanuel Thomé 
> wrote:
>

Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-11 Thread Reuti

Am 11.11.2014 um 19:29 schrieb Ralph Castain:

> 
>> On Nov 11, 2014, at 10:06 AM, Reuti  wrote:
>> 
>> Am 11.11.2014 um 17:52 schrieb Ralph Castain:
>> 
>>> 
 On Nov 11, 2014, at 7:57 AM, Reuti  wrote:
 
 Am 11.11.2014 um 16:13 schrieb Ralph Castain:
 
> This clearly displays the problem - if you look at the reported 
> “allocated nodes”, you see that we only got one node (cn6050). This is 
> why we mapped all your procs onto that node.
> 
> So the real question is - why? Can you show us the content of PE_HOSTFILE?
> 
> 
>> On Nov 11, 2014, at 4:51 AM, SLIM H.A.  wrote:
>> 
>> Dear Reuti and Ralph
>> 
>> Below is the output of the run for openmpi 1.8.3 with this line
>> 
>> mpirun -np $NSLOTS --display-map --display-allocation --cpus-per-proc 1 
>> $exe
>> 
>> 
>> master=cn6050
>> PE=orte
>> JOB_ID=2482923
>> Got 32 slots.
>> slots:
>> cn6050 16 par6.q@cn6050 
>> cn6045 16 par6.q@cn6045 
 
 The above looks like the PE_HOSTFILE. So it should be 16 slots per node.
>>> 
>>> Hey Reuti
>>> 
>>> Is that the standard PE_HOSTFILE format? I’m looking at the ras/gridengine 
>>> module, and it looks like it is expecting a different format. I suspect 
>>> that is the problem
>> 
>> Well, the fourth column can be a processer range in older versions of SGE 
>> and the binding in newer ones, but the first three columns were always this 
>> way.
> 
> Hmmm…perhaps I’m confused here. I guess you’re saying that just the last two 
> lines of his output contain the PE_HOSTFILE, as opposed to the entire thing?

Yes. The entire thing looks like an output of the jobscript from the OP. Only 
the last two lines should be the content of the PE_HOSTFILE


> If so, I’m wondering if that NULL he shows in there is the source of the 
> trouble. The parser doesn’t look like it would handle that very well, though 
> I’d need to test it. Is that NULL expected? Or is the NULL not really in the 
> file?

I must admit here: for me the fourth column is either literally UNDEFINED or 
the tuple cpu,core in case of turned on binding like 0,0 But it's never , 
neither literally nor the byte 0x00. Maybe the OP can tell us which GE version 
he uses,.

-- Reuti


>> -- Reuti
>> 
>> 
>>> Ralph
>>> 
 
 I wonder whether any environment variable was reset, which normally allows 
 Open MPI to discover that it's running inside SGE.
 
 I.e. SGE_ROOT, JOB_ID, ARC and PE_HOSTFILE are untouched before the job 
 starts?
 
 Supplying "-np $NSLOTS" shouldn't be necessary though.
 
 -- Reuti
 
 
 
>> Tue Nov 11 12:37:37 GMT 2014
>> 
>> ==   ALLOCATED NODES   ==
>>  cn6050: slots=16 max_slots=0 slots_inuse=0 state=UP
>> =
>> Data for JOB [57374,1] offset 0
>> 
>>    JOB MAP   
>> 
>> Data for node: cn6050  Num slots: 16   Max slots: 0Num procs: 32
>>  Process OMPI jobid: [57374,1] App: 0 Process rank: 0
>>  Process OMPI jobid: [57374,1] App: 0 Process rank: 1
>> 
>> …
>>  Process OMPI jobid: [57374,1] App: 0 Process rank: 31
>> 
>> 
>> Also
>> ompi_info | grep grid
>> gives MCA ras: gridengine (MCA v2.0, API v2.0, Component 
>> v1.8.3)
>> and
>> ompi_info | grep psm
>> gives MCA mtl: psm (MCA v2.0, API v2.0, Component v1.8.3)
>> because the intercoonect is TrueScale/QLogic
>> 
>> and
>> 
>> setenv OMPI_MCA_mtl "psm"
>> 
>> is set in the script. This is the PE
>> 
>> pe_name   orte
>> slots 4000
>> user_listsNONE
>> xuser_lists   NONE
>> start_proc_args   /bin/true
>> stop_proc_args/bin/true
>> allocation_rule   $fill_up
>> control_slavesTRUE
>> job_is_first_task FALSE
>> urgency_slots min
>> 
>> Many thanks
>> 
>> Henk
>> 
>> 
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph 
>> Castain
>> Sent: 10 November 2014 05:07
>> To: Open MPI Users
>> Subject: Re: [OMPI users] oversubscription of slots with GridEngine
>> 
>> You might also add the —display-allocation flag to mpirun so we can see 
>> what it thinks the allocation looks like. If there are only 16 slots on 
>> the node, it seems odd that OMPI would assign 32 procs to it unless it 
>> thinks there is only 1 node in the job, and oversubscription is allowed 
>> (which it won’t be by default if it read the GE allocation)
>> 
>> 
>> On Nov 9, 2014, at 9:56 AM, Reuti  wrote:
>> 
>> Hi,
>> 
>> 
>> Am 09.11.2014 um 18:20 schrieb SLIM H.A. :
>> 
>> We switched on hyper threading on our cluster with two eight core