Re: [OMPI users] oversubscription of slots with GridEngine

2014-11-10 Thread Ralph Castain
You might also add the —display-allocation flag to mpirun so we can see what it 
thinks the allocation looks like. If there are only 16 slots on the node, it 
seems odd that OMPI would assign 32 procs to it unless it thinks there is only 
1 node in the job, and oversubscription is allowed (which it won’t be by 
default if it read the GE allocation)


> On Nov 9, 2014, at 9:56 AM, Reuti  wrote:
> 
> Hi,
> 
>> Am 09.11.2014 um 18:20 schrieb SLIM H.A. > >:
>> 
>> We switched on hyper threading on our cluster with two eight core sockets 
>> per node (32 threads per node).
>> 
>> We configured  gridengine with 16 slots per node to allow the 16 extra 
>> threads for kernel process use but this apparently does not work. Printout 
>> of the gridengine hostfile shows that for a 32 slots job, 16 slots are 
>> placed on each of two nodes as expected. Including the openmpi --display-map 
>> option shows that all 32 processes are incorrectly  placed on the head node.
> 
> You mean the master node of the parallel job I assume.
> 
>> Here is part of the output
>> 
>> master=cn6083
>> PE=orte
> 
> What allocation rule was defined for this PE - "control_slave yes" is set?
> 
>> JOB_ID=2481793
>> Got 32 slots.
>> slots:
>> cn6083 16 par6.q@cn6083 
>> cn6085 16 par6.q@cn6085 
>> Sun Nov  9 16:50:59 GMT 2014
>> Data for JOB [44767,1] offset 0
>> 
>>    JOB MAP   
>> 
>> Data for node: cn6083  Num slots: 16   Max slots: 0Num procs: 32
>>   Process OMPI jobid: [44767,1] App: 0 Process rank: 0
>>   Process OMPI jobid: [44767,1] App: 0 Process rank: 1
>> ...
>>   Process OMPI jobid: [44767,1] App: 0 Process rank: 31
>> 
>> =
>> 
>> I found some related mailings about a new warning in 1.8.2 about 
>> oversubscription and  I tried a few options to avoid the use of the extra 
>> threads for MPI tasks by openmpi without success, e.g. variants of
>> 
>> --cpus-per-proc 1 
>> --bind-to-core 
>> 
>> and some others. Gridengine treats hw threads as cores==slots (?) but the 
>> content of $PE_HOSTFILE suggests it distributes the slots sensibly  so it 
>> seems there is an option for openmpi required to get 16 cores per node?
> 
> Was Open MPI configured with --with-sge?
> 
> -- Reuti
> 
>> I tried both 1.8.2, 1.8.3 and also 1.6.5.
>> 
>> Thanks for some clarification that anyone can give.
>> 
>> Henk
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25718.php 
>> 
> ___
> users mailing list
> us...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25719.php 
> 


Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Reuti
Hi,

Am 09.11.2014 um 05:38 schrieb Ralph Castain:

> FWIW: during MPI_Init, each process “publishes” all of its interfaces. Each 
> process receives a complete map of that info for every process in the job. So 
> when the TCP btl sets itself up, it attempts to connect across -all- the 
> interfaces published by the other end.
> 
> So it doesn’t matter what hostname is provided by the RM. We discover and 
> “share” all of the interface info for every node, and then use them for 
> loadbalancing.

does this lead to any time delay when starting up? I stayed with Open MPI 1.6.5 
for some time and tried to use Open MPI 1.8.3 now. As there is a delay when the 
applications starts in my first compilation of 1.8.3 I disregarded even all my 
extra options and run it outside of any queuingsystem - the delay remains - on 
two different clusters.

I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
creates this delay when starting up a simple mpihello. I assume it may lay in 
the way how to reach other machines, as with one single machine there is no 
delay. But using one (and only one - no tree spawn involved) additional machine 
already triggers this delay.

Did anyone else notice it?

-- Reuti


> HTH
> Ralph
> 
> 
>> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
>> 
>> Ok I figured, i'm going to have to read some more for my own curiosity. The 
>> reason I mention the Resource Manager we use, and that the hostnames given 
>> but PBS/Torque match the 1gig-e interfaces, i'm curious what path it would 
>> take to get to a peer node when the node list given all match the 1gig 
>> interfaces but yet data is being sent out the 10gig eoib0/ib0 interfaces.  
>> 
>> I'll go do some measurements and see.
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres)  
>>> wrote:
>>> 
>>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by default.  
>>> 
>>> This short FAQ has links to 2 other FAQs that provide detailed information 
>>> about reachability:
>>> 
>>>  http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
>>> 
>>> The usNIC BTL uses UDP for its wire transport and actually does a much more 
>>> standards-conformant peer reachability determination (i.e., it actually 
>>> checks routing tables to see if it can reach a given peer which has all 
>>> kinds of caching benefits, kernel controls if you want them, etc.).  We 
>>> haven't back-ported this to the TCP BTL because a) most people who use TCP 
>>> for MPI still use a single L2 address space, and b) no one has asked for 
>>> it.  :-)
>>> 
>>> As for the round robin scheduling, there's no indication from the Linux TCP 
>>> stack what the bandwidth is on a given IP interface.  So unless you use the 
>>> btl_tcp_bandwidth_ (e.g., btl_tcp_bandwidth_eth0) MCA 
>>> params, OMPI will round-robin across them equally.
>>> 
>>> If you have multiple IP interfaces sharing a single physical link, there 
>>> will likely be no benefit from having Open MPI use more than one of them.  
>>> You should probably use btl_tcp_if_include / btl_tcp_if_exclude to select 
>>> just one.
>>> 
>>> 
>>> 
>>> 
>>> On Nov 7, 2014, at 2:53 PM, Brock Palen  wrote:
>>> 
 I was doing a test on our IB based cluster, where I was diabling IB
 
 --mca btl ^openib --mca mtl ^mxm
 
 I was sending very large messages >1GB  and I was surppised by the speed.
 
 I noticed then that of all our ethernet interfaces
 
 eth0  (1gig-e)
 ib0  (ip over ib, for lustre configuration at vendor request)
 eoib0  (ethernet over IB interface for IB -> Ethernet gateway for some 
 extrnal storage support at >1Gig speed
 
 I saw all three were getting traffic.
 
 We use torque for our Resource Manager and use TM support, the hostnames 
 given by torque match the eth0 interfaces.
 
 How does OMPI figure out that it can also talk over the others?  How does 
 it chose to load balance?
 
 BTW that is fine, but we will use if_exclude on one of the IB ones as ib0 
 and eoib0  are the same physical device and may screw with load balancing 
 if anyone ver falls back to TCP.
 
 Brock Palen
 www.umich.edu/~brockp
 CAEN Advanced Computing
 XSEDE Campus Champion
 bro...@umich.edu
 (734)936-1985
 
 
 
 ___
 users mailing list
 us...@open-mpi.org
 Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
 Link to this post: 
 http://www.open-mpi.org/community/lists/users/2014/11/25709.php
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Reuti
Am 10.11.2014 um 12:24 schrieb Reuti:

> Hi,
> 
> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
> 
>> FWIW: during MPI_Init, each process “publishes” all of its interfaces. Each 
>> process receives a complete map of that info for every process in the job. 
>> So when the TCP btl sets itself up, it attempts to connect across -all- the 
>> interfaces published by the other end.
>> 
>> So it doesn’t matter what hostname is provided by the RM. We discover and 
>> “share” all of the interface info for every node, and then use them for 
>> loadbalancing.
> 
> does this lead to any time delay when starting up? I stayed with Open MPI 
> 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there is a delay 
> when the applications starts in my first compilation of 1.8.3 I disregarded 
> even all my extra options and run it outside of any queuingsystem - the delay 
> remains - on two different clusters.

I forgot to mention: the delay is more or less exactly 2 minutes from the time 
I issued `mpiexec` until the `mpihello` starts up (there is no delay for the 
initial `ssh` to reach the other node though).

-- Reuti


> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
> creates this delay when starting up a simple mpihello. I assume it may lay in 
> the way how to reach other machines, as with one single machine there is no 
> delay. But using one (and only one - no tree spawn involved) additional 
> machine already triggers this delay.
> 
> Did anyone else notice it?
> 
> -- Reuti
> 
> 
>> HTH
>> Ralph
>> 
>> 
>>> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
>>> 
>>> Ok I figured, i'm going to have to read some more for my own curiosity. The 
>>> reason I mention the Resource Manager we use, and that the hostnames given 
>>> but PBS/Torque match the 1gig-e interfaces, i'm curious what path it would 
>>> take to get to a peer node when the node list given all match the 1gig 
>>> interfaces but yet data is being sent out the 10gig eoib0/ib0 interfaces.  
>>> 
>>> I'll go do some measurements and see.
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> XSEDE Campus Champion
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>> 
>>> 
 On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres)  
 wrote:
 
 Ralph is right: OMPI aggressively uses all Ethernet interfaces by default. 
  
 
 This short FAQ has links to 2 other FAQs that provide detailed information 
 about reachability:
 
 http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
 
 The usNIC BTL uses UDP for its wire transport and actually does a much 
 more standards-conformant peer reachability determination (i.e., it 
 actually checks routing tables to see if it can reach a given peer which 
 has all kinds of caching benefits, kernel controls if you want them, 
 etc.).  We haven't back-ported this to the TCP BTL because a) most people 
 who use TCP for MPI still use a single L2 address space, and b) no one has 
 asked for it.  :-)
 
 As for the round robin scheduling, there's no indication from the Linux 
 TCP stack what the bandwidth is on a given IP interface.  So unless you 
 use the btl_tcp_bandwidth_ (e.g., 
 btl_tcp_bandwidth_eth0) MCA params, OMPI will round-robin across them 
 equally.
 
 If you have multiple IP interfaces sharing a single physical link, there 
 will likely be no benefit from having Open MPI use more than one of them.  
 You should probably use btl_tcp_if_include / btl_tcp_if_exclude to select 
 just one.
 
 
 
 
 On Nov 7, 2014, at 2:53 PM, Brock Palen  wrote:
 
> I was doing a test on our IB based cluster, where I was diabling IB
> 
> --mca btl ^openib --mca mtl ^mxm
> 
> I was sending very large messages >1GB  and I was surppised by the speed.
> 
> I noticed then that of all our ethernet interfaces
> 
> eth0  (1gig-e)
> ib0  (ip over ib, for lustre configuration at vendor request)
> eoib0  (ethernet over IB interface for IB -> Ethernet gateway for some 
> extrnal storage support at >1Gig speed
> 
> I saw all three were getting traffic.
> 
> We use torque for our Resource Manager and use TM support, the hostnames 
> given by torque match the eth0 interfaces.
> 
> How does OMPI figure out that it can also talk over the others?  How does 
> it chose to load balance?
> 
> BTW that is fine, but we will use if_exclude on one of the IB ones as ib0 
> and eoib0  are the same physical device and may screw with load balancing 
> if anyone ver falls back to TCP.
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription:

Re: [OMPI users] What could cause a segfault in OpenMPI?

2014-11-10 Thread Jeff Squyres (jsquyres)
I am sorry for the delay; I've been caught up in SC deadlines.  :-(

I don't see anything blatantly wrong in this output.

Two things:

1. Can you try a nightly v1.8.4 snapshot tarball?  This will check to see if 
whatever the bug is has been fixed for the upcoming release:

http://www.open-mpi.org/nightly/v1.8/

2. Build Open MPI with the --enable-debug option (note that this adds a 
slight-but-noticeable performance penalty).  When you run, it should dump a 
core file.  Load that core file in a debugger and see where it is failing 
(i.e., file and line in the OMPI source).

We don't usually have to resort to asking users to perform #2, but there's no 
additional information to give a clue as to what is happening.  :-(



On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake  wrote:

> Hi Jeff,
> 
> You are probably busy, but just checking if you had a chance to look at this.
> 
> Thanks,
> Saliya
> 
> On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake  wrote:
> Hi Jeff,
> 
> I've attached a tar file with information. 
> 
> Thank you,
> Saliya
> 
> On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres)  
> wrote:
> Looks like it's failing in the openib BTL setup.
> 
> Can you send the info listed here?
> 
> http://www.open-mpi.org/community/help/
> 
> 
> 
> On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake  wrote:
> 
> > Hi,
> >
> > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. It 
> > builds fine, but when I try to run even the simplest hello.c program it'll 
> > cause a segfault. Any suggestions on how to correct this?
> >
> > The steps I did and error message are below.
> >
> > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached.
> > 2. cd to examples directory and mpicc hello_c.c
> > 3. mpirun -np 2 ./a.out
> > 4. Error text is attached.
> >
> > Please let me know if you need more info.
> >
> > Thank you,
> > Saliya
> >
> >
> > --
> > Saliya Ekanayake esal...@gmail.com
> > Cell 812-391-4914 Home 812-961-6383
> > http://saliya.org
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/users/2014/11/25668.php
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25672.php
> 
> 
> 
> -- 
> Saliya Ekanayake esal...@gmail.com 
> Cell 812-391-4914 Home 812-961-6383
> http://saliya.org
> 
> 
> 
> -- 
> Saliya Ekanayake esal...@gmail.com 
> Cell 812-391-4914 Home 812-961-6383
> http://saliya.org
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25717.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Jeff Squyres (jsquyres)
Wow, that's pretty terrible!  :(

Is the behavior BTL-specific, perchance?  E.G., if you only use certain BTLs, 
does the delay disappear?

FWIW: the use-all-IP interfaces approach has been in OMPI forever. 

Sent from my phone. No type good. 

> On Nov 10, 2014, at 6:42 AM, Reuti  wrote:
> 
>> Am 10.11.2014 um 12:24 schrieb Reuti:
>> 
>> Hi,
>> 
>>> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
>>> 
>>> FWIW: during MPI_Init, each process “publishes” all of its interfaces. Each 
>>> process receives a complete map of that info for every process in the job. 
>>> So when the TCP btl sets itself up, it attempts to connect across -all- the 
>>> interfaces published by the other end.
>>> 
>>> So it doesn’t matter what hostname is provided by the RM. We discover and 
>>> “share” all of the interface info for every node, and then use them for 
>>> loadbalancing.
>> 
>> does this lead to any time delay when starting up? I stayed with Open MPI 
>> 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there is a delay 
>> when the applications starts in my first compilation of 1.8.3 I disregarded 
>> even all my extra options and run it outside of any queuingsystem - the 
>> delay remains - on two different clusters.
> 
> I forgot to mention: the delay is more or less exactly 2 minutes from the 
> time I issued `mpiexec` until the `mpihello` starts up (there is no delay for 
> the initial `ssh` to reach the other node though).
> 
> -- Reuti
> 
> 
>> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
>> creates this delay when starting up a simple mpihello. I assume it may lay 
>> in the way how to reach other machines, as with one single machine there is 
>> no delay. But using one (and only one - no tree spawn involved) additional 
>> machine already triggers this delay.
>> 
>> Did anyone else notice it?
>> 
>> -- Reuti
>> 
>> 
>>> HTH
>>> Ralph
>>> 
>>> 
 On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
 
 Ok I figured, i'm going to have to read some more for my own curiosity. 
 The reason I mention the Resource Manager we use, and that the hostnames 
 given but PBS/Torque match the 1gig-e interfaces, i'm curious what path it 
 would take to get to a peer node when the node list given all match the 
 1gig interfaces but yet data is being sent out the 10gig eoib0/ib0 
 interfaces.  
 
 I'll go do some measurements and see.
 
 Brock Palen
 www.umich.edu/~brockp
 CAEN Advanced Computing
 XSEDE Campus Champion
 bro...@umich.edu
 (734)936-1985
 
 
 
> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Ralph is right: OMPI aggressively uses all Ethernet interfaces by 
> default.  
> 
> This short FAQ has links to 2 other FAQs that provide detailed 
> information about reachability:
> 
> http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
> 
> The usNIC BTL uses UDP for its wire transport and actually does a much 
> more standards-conformant peer reachability determination (i.e., it 
> actually checks routing tables to see if it can reach a given peer which 
> has all kinds of caching benefits, kernel controls if you want them, 
> etc.).  We haven't back-ported this to the TCP BTL because a) most people 
> who use TCP for MPI still use a single L2 address space, and b) no one 
> has asked for it.  :-)
> 
> As for the round robin scheduling, there's no indication from the Linux 
> TCP stack what the bandwidth is on a given IP interface.  So unless you 
> use the btl_tcp_bandwidth_ (e.g., 
> btl_tcp_bandwidth_eth0) MCA params, OMPI will round-robin across them 
> equally.
> 
> If you have multiple IP interfaces sharing a single physical link, there 
> will likely be no benefit from having Open MPI use more than one of them. 
>  You should probably use btl_tcp_if_include / btl_tcp_if_exclude to 
> select just one.
> 
> 
> 
> 
>> On Nov 7, 2014, at 2:53 PM, Brock Palen  wrote:
>> 
>> I was doing a test on our IB based cluster, where I was diabling IB
>> 
>> --mca btl ^openib --mca mtl ^mxm
>> 
>> I was sending very large messages >1GB  and I was surppised by the speed.
>> 
>> I noticed then that of all our ethernet interfaces
>> 
>> eth0  (1gig-e)
>> ib0  (ip over ib, for lustre configuration at vendor request)
>> eoib0  (ethernet over IB interface for IB -> Ethernet gateway for some 
>> extrnal storage support at >1Gig speed
>> 
>> I saw all three were getting traffic.
>> 
>> We use torque for our Resource Manager and use TM support, the hostnames 
>> given by torque match the eth0 interfaces.
>> 
>> How does OMPI figure out that it can also talk over the others?  How 
>> does it chose to load balance?
>> 
>> BTW that is fine, but we will use if_exclude on one of the 

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Reuti
Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):

> Wow, that's pretty terrible!  :(
> 
> Is the behavior BTL-specific, perchance?  E.G., if you only use certain BTLs, 
> does the delay disappear?

You mean something like:

reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
./mpihello; date
Mon Nov 10 13:44:34 CET 2014
Hello World from Node 1.
Total: 4
Universe: 4
Hello World from Node 0.
Hello World from Node 3.
Hello World from Node 2.
Mon Nov 10 13:46:42 CET 2014

(the above was even the latest v1.8.3-186-g978f61d)

Falling back to 1.8.1 gives (as expected):

reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
./mpihello; date
Mon Nov 10 13:49:51 CET 2014
Hello World from Node 1.
Total: 4
Universe: 4
Hello World from Node 0.
Hello World from Node 2.
Hello World from Node 3.
Mon Nov 10 13:49:53 CET 2014


-- Reuti

> FWIW: the use-all-IP interfaces approach has been in OMPI forever. 
> 
> Sent from my phone. No type good. 
> 
>> On Nov 10, 2014, at 6:42 AM, Reuti  wrote:
>> 
>>> Am 10.11.2014 um 12:24 schrieb Reuti:
>>> 
>>> Hi,
>>> 
 Am 09.11.2014 um 05:38 schrieb Ralph Castain:
 
 FWIW: during MPI_Init, each process “publishes” all of its interfaces. 
 Each process receives a complete map of that info for every process in the 
 job. So when the TCP btl sets itself up, it attempts to connect across 
 -all- the interfaces published by the other end.
 
 So it doesn’t matter what hostname is provided by the RM. We discover and 
 “share” all of the interface info for every node, and then use them for 
 loadbalancing.
>>> 
>>> does this lead to any time delay when starting up? I stayed with Open MPI 
>>> 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there is a 
>>> delay when the applications starts in my first compilation of 1.8.3 I 
>>> disregarded even all my extra options and run it outside of any 
>>> queuingsystem - the delay remains - on two different clusters.
>> 
>> I forgot to mention: the delay is more or less exactly 2 minutes from the 
>> time I issued `mpiexec` until the `mpihello` starts up (there is no delay 
>> for the initial `ssh` to reach the other node though).
>> 
>> -- Reuti
>> 
>> 
>>> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
>>> creates this delay when starting up a simple mpihello. I assume it may lay 
>>> in the way how to reach other machines, as with one single machine there is 
>>> no delay. But using one (and only one - no tree spawn involved) additional 
>>> machine already triggers this delay.
>>> 
>>> Did anyone else notice it?
>>> 
>>> -- Reuti
>>> 
>>> 
 HTH
 Ralph
 
 
> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
> 
> Ok I figured, i'm going to have to read some more for my own curiosity. 
> The reason I mention the Resource Manager we use, and that the hostnames 
> given but PBS/Torque match the 1gig-e interfaces, i'm curious what path 
> it would take to get to a peer node when the node list given all match 
> the 1gig interfaces but yet data is being sent out the 10gig eoib0/ib0 
> interfaces.  
> 
> I'll go do some measurements and see.
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> XSEDE Campus Champion
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by 
>> default.  
>> 
>> This short FAQ has links to 2 other FAQs that provide detailed 
>> information about reachability:
>> 
>> http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
>> 
>> The usNIC BTL uses UDP for its wire transport and actually does a much 
>> more standards-conformant peer reachability determination (i.e., it 
>> actually checks routing tables to see if it can reach a given peer which 
>> has all kinds of caching benefits, kernel controls if you want them, 
>> etc.).  We haven't back-ported this to the TCP BTL because a) most 
>> people who use TCP for MPI still use a single L2 address space, and b) 
>> no one has asked for it.  :-)
>> 
>> As for the round robin scheduling, there's no indication from the Linux 
>> TCP stack what the bandwidth is on a given IP interface.  So unless you 
>> use the btl_tcp_bandwidth_ (e.g., 
>> btl_tcp_bandwidth_eth0) MCA params, OMPI will round-robin across them 
>> equally.
>> 
>> If you have multiple IP interfaces sharing a single physical link, there 
>> will likely be no benefit from having Open MPI use more than one of 
>> them.  You should probably use btl_tcp_if_include / btl_tcp_if_exclude 
>> to select just one.
>> 
>> 
>> 
>> 
>>> On Nov 7, 2014, at 2:53 PM, Brock Palen  wrote:
>>> 
>>> I was doing a test on our IB

Re: [OMPI users] OPENMPI-1.8.3: missing fortran bindings for MPI_SIZEOF

2014-11-10 Thread Dave Love
"Jeff Squyres (jsquyres)"  writes:

> There were several commits; this was the first one:
>
> https://github.com/open-mpi/ompi/commit/d7eaca83fac0d9783d40cac17e71c2b090437a8c

I don't have time to follow this properly, but am I reading right that
that says mpi_sizeof will now _not_ work with gcc < 4.9, i.e. the system
compiler of the vast majority of HPC GNU/Linux systems, whereas it did
before (at least in simple cases)?

> IIRC, it only affected certain configure situations (e.g., only
> certain fortran compilers).  I'm failing to remember the exact
> scenario offhand that was problematic right now, but it led to the
> larger question of: "hey, wait, don't we have to support MPI_SIZEOF in
> mpif.h, too?"

I'd have said the answer was a clear "no", without knowing what the
standard says about mpif.h, but I'd expect that to be deprecated anyhow.
(The man pages generally don't mention USE, only INCLUDE, which seems
wrong.)

>
>> I don't understand how it can work generally with mpif.h (f77?), as
>> implied by the man page, rather than the module.
>
> According to discussion in the Forum Fortran working group, it is
> required that MPI_SIZEOF must be supported in *all* MPI Fortran
> interfaces, including mpif.h.

Well that's generally impossible if it's meant to include Fortran77
compilers (which I must say doesn't seem worth it at this stage).

> Hence, if certain conditions are met by your Fortran compiler (i.e.,
> it's modern enough), OMPI 1.8.4 will have MPI_SIZEOF prototypes in
> mpif.h.  If not, then you get the same old mpif.h you've always had
> (i.e., no MPI_SIZEOF prototypes, and MPI_SIZEOF won't work properly if
> you use the mpif.h interfaces).

If it's any consolation, it doesn't work in the other MPIs here
(mp(va)pich and intel), as I'd expect.

> Keep in mind that MPI does not prohibit having prototypes in mpif.h --
> it's just that most (all?) MPI implementations don't tend to provide
> them.  However, in the case of MPI_SIZEOF, it is *required* that
> prototypes are available because the implementation needs the type
> information to return the size properly (in mpif.h., mpi module, and
> mpi_f08 module).
>
> Make sense?

Fortran has interfaces, not prototypes!

I understand the technicalities -- I hacked on g77 intrinsics -- but I'm
not sure how much sense it's making if things have effectively gone
backwards with gfortran.



[OMPI users] MPI_Wtime not working with -mno-sse flag

2014-11-10 Thread maxinator333

Hello again,

I have a piece of code, which worked fine on my PC, but on my notebook 
MPI_Wtime and MPI_Wtick won't work with the -mno-sse flag specified. 
MPI_Wtick will return 0 instead of 1e-6 and MPI_Wtime will also return 
always 0. clock() works in all cases.


The Code is:

#include 
#include 
#include 
#include  // sleep

using namespace std;
int main(int argc, char* argv[])
{
MPI_Init(NULL,NULL);
cout << "Wtick: " << MPI_Wtick();
for (int i=0; i<10; i++) {
clock_t   t0  = MPI_Wtime();
sleep(0.1);
//sleep(1); // SLEEP WON'T BE COUNTED FOR CLOCK!!! FUCK, 
SAME SEEMS TO BE TRUE FOR MPI_WTIME

// ALSO MPI_WTIME might not be working with -mno-sse!!!
clock_t   t1= MPI_Wtime();
cout << "clock=" << clock() << ", CLOCKS_PER_SEC=" << 
CLOCKS_PER_SEC;
cout << ", t0=" << t0 << ", t1=" << t1 << ", dt=" << 
double(t1-t0)/CLOCKS_PER_SEC << endl;

}
return 0;
}

I compile and run it in cygwin 64bit, Open MPI: 1.8.3 r32794 and g++ 
(GCC) 4.8.3 with the following command lines and outputs:


$ rm matrix-diverror.exe; mpic++ -g matrix-diverror.cpp -o 
matrix-diverror.exe -Wall -std=c++0x -mno-sse; ./matrix-diverror.exe

Wtick: 0clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314

Versus the Version without SSE deactivated:

$ rm matrix-diverror.exe; mpic++ -g matrix-diverror.cpp -o 
matrix-diverror.exe -Wall -std=c++0x; ./matrix-diverror.exe
Wtick: 1e-06clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, 
t1=1415626322, dt=0

clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0






Re: [OMPI users] MPI_Wtime not working with -mno-sse flag

2014-11-10 Thread Jeff Squyres (jsquyres)
On some platforms, the MPI_Wtime function essentially uses gettimeofday() under 
the covers.

See this stackoverflow question about -mno-sse:


http://stackoverflow.com/questions/3687845/error-with-mno-sse-flag-and-gettimeofday-in-c



On Nov 10, 2014, at 8:35 AM, maxinator333  wrote:

> Hello again,
> 
> I have a piece of code, which worked fine on my PC, but on my notebook 
> MPI_Wtime and MPI_Wtick won't work with the -mno-sse flag specified. 
> MPI_Wtick will return 0 instead of 1e-6 and MPI_Wtime will also return always 
> 0. clock() works in all cases.
> 
> The Code is:
> 
> #include 
> #include 
> #include 
> #include  // sleep
> 
> using namespace std;
> int main(int argc, char* argv[])
> {
>MPI_Init(NULL,NULL);
>cout << "Wtick: " << MPI_Wtick();
>for (int i=0; i<10; i++) {
>clock_t   t0  = MPI_Wtime();
>sleep(0.1);
>//sleep(1); // SLEEP WON'T BE COUNTED FOR CLOCK!!! FUCK, SAME 
> SEEMS TO BE TRUE FOR MPI_WTIME
>// ALSO MPI_WTIME might not be working with -mno-sse!!!
>clock_t   t1= MPI_Wtime();
>cout << "clock=" << clock() << ", CLOCKS_PER_SEC=" << CLOCKS_PER_SEC;
>cout << ", t0=" << t0 << ", t1=" << t1 << ", dt=" << 
> double(t1-t0)/CLOCKS_PER_SEC << endl;
>}
>return 0;
> }
> 
> I compile and run it in cygwin 64bit, Open MPI: 1.8.3 r32794 and g++ (GCC) 
> 4.8.3 with the following command lines and outputs:
> 
> $ rm matrix-diverror.exe; mpic++ -g matrix-diverror.cpp -o 
> matrix-diverror.exe -Wall -std=c++0x -mno-sse; ./matrix-diverror.exe
> Wtick: 0clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
> clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
> clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
> clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
> clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
> clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
> clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
> clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
> clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
> clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
> 
> Versus the Version without SSE deactivated:
> 
> $ rm matrix-diverror.exe; mpic++ -g matrix-diverror.cpp -o 
> matrix-diverror.exe -Wall -std=c++0x; ./matrix-diverror.exe
> Wtick: 1e-06clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
> clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
> clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
> clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
> clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
> clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
> clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
> clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
> clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
> clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25727.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] MPI_Wtime not working with -mno-sse flag

2014-11-10 Thread Alex A. Granovsky

Hello,

use RDTSC (or RDTSCP) to read TSC directly

Kind regards,
Alex Granovsky



-Original Message- 
From: maxinator333

Sent: Monday, November 10, 2014 4:35 PM
To: us...@open-mpi.org
Subject: [OMPI users] MPI_Wtime not working with -mno-sse flag

Hello again,

I have a piece of code, which worked fine on my PC, but on my notebook
MPI_Wtime and MPI_Wtick won't work with the -mno-sse flag specified.
MPI_Wtick will return 0 instead of 1e-6 and MPI_Wtime will also return
always 0. clock() works in all cases.

The Code is:

#include 
#include 
#include 
#include  // sleep

using namespace std;
int main(int argc, char* argv[])
{
MPI_Init(NULL,NULL);
cout << "Wtick: " << MPI_Wtick();
for (int i=0; i<10; i++) {
clock_t   t0  = MPI_Wtime();
sleep(0.1);
//sleep(1); // SLEEP WON'T BE COUNTED FOR CLOCK!!! FUCK,
SAME SEEMS TO BE TRUE FOR MPI_WTIME
// ALSO MPI_WTIME might not be working with -mno-sse!!!
clock_t   t1= MPI_Wtime();
cout << "clock=" << clock() << ", CLOCKS_PER_SEC=" <<
CLOCKS_PER_SEC;
cout << ", t0=" << t0 << ", t1=" << t1 << ", dt=" <<
double(t1-t0)/CLOCKS_PER_SEC << endl;
}
return 0;
}

I compile and run it in cygwin 64bit, Open MPI: 1.8.3 r32794 and g++
(GCC) 4.8.3 with the following command lines and outputs:

$ rm matrix-diverror.exe; mpic++ -g matrix-diverror.cpp -o
matrix-diverror.exe -Wall -std=c++0x -mno-sse; ./matrix-diverror.exe
Wtick: 0clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314
clock=451, CLOCKS_PER_SEC=1000, t0=0, t1=0, dt=8.37245e-314

Versus the Version without SSE deactivated:

$ rm matrix-diverror.exe; mpic++ -g matrix-diverror.cpp -o
matrix-diverror.exe -Wall -std=c++0x; ./matrix-diverror.exe
Wtick: 1e-06clock=483, CLOCKS_PER_SEC=1000, t0=1415626322,
t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0
clock=483, CLOCKS_PER_SEC=1000, t0=1415626322, t1=1415626322, dt=0




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25727.php 





[OMPI users] File-backed mmaped I/O and openib btl.

2014-11-10 Thread Emmanuel Thomé
Hi,

I'm stumbling on a problem related to the openib btl in
openmpi-1.[78].*, and the (I think legitimate) use of file-backed
mmaped areas for receiving data through MPI collective calls.

A test case is attached. I've tried to make is reasonably small,
although I recognize that it's not extra thin. The test case is a
trimmed down version of what I witness in the context of a rather
large program, so there is no claim of relevance of the test case
itself. It's here just to trigger the desired misbehaviour. The test
case contains some detailed information on what is done, and the
experiments I did.

In a nutshell, the problem is as follows.

 - I do a computation, which involves MPI_Reduce_scatter and MPI_Allgather.
 - I save the result to a file (collective operation).

*If* I save the file using something such as:
 fd = open("blah", ...
 area = mmap(..., fd, )
 MPI_Gather(..., area, ...)
*AND* the MPI_Reduce_scatter is done with an alternative
implementation (which I believe is correct)
*AND* communication is done through the openib btl,

then the file which gets saved is inconsistent with what is obtained
with the normal MPI_Reduce_scatter (alghough memory areas do coincide
before the save).

I tried to dig a bit in the openib internals, but all I've been able
to witness was beyond my expertise (an RDMA read not transferring the
expected data, but I'm too uncomfortable with this layer to say
anything I'm sure about).

Tests have been done with several openmpi versions including 1.8.3, on
a debian wheezy (7.5) + OFED 2.3 cluster.

It would be great if someone could tell me if he is able to reproduce
the bug, or tell me whether something which is done in this test case
is illegal in any respect. I'd be glad to provide further information
which could be of any help.

Best regards,

E. Thomé.
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

/* This test file illustrates how in certain circumstances, using a
 * file-backed mmaped area for receiving data via an MPI_Gather call
 * perturbates data flow within MPI_Sendrecv calls in the same program.
 *
 * This program wants to run on 4 distinct nodes.
 *
 * Normal behaviour of the program consists in printing identical pairs
 * of md5 hashes for the files /tmp/u and /tmp/v, e.g. as with the
 * following output:
 * 
b28afcaeb2824fb298dace8da9f5b0f0  /tmp/u
b28afcaeb2824fb298dace8da9f5b0f0  /tmp/v
511e3325d7edf693c19c25dcfd1ce231  /tmp/u
511e3325d7edf693c19c25dcfd1ce231  /tmp/v
(repeated many times ; actual hashes may differ from one execution to another).
 *
 * Abnormal behaviour is when different md5 hashes are obtained, which
 * triggers MPI_Abort().
 *
 * Each iteration of the main loop is supposed to do the same
 * thing twice, once involving reduce_scatter, and once involving a
 * supposedly identical implementation. Both computations are followed by
 * a saving computed data to a file in /tmp/. A divergence is witnessed if the
 * conditions below are met.
 *
 *  - use the openib btl among the 4 nodes
 *  - use file-backed mmaped areas for doing the save.
 * 
 * The former condition is controlled by the btl mca, while the latter is
 * controlled by the flag below. Undefining it makes the bug go away.
 */
#define USE_MMAP_FOR_FILE_IO
/* Upon failure, it seems that the file which gets written ends with an
 * unexpected chunk of zeroes.
 */


/*
 * Instructions to reproduce:
 *
 *  - the bug seems to happen with openmpi versions 1.7 to 1.8.3
 *  (current).
 *  - the bug does not seem to happen with older openmpi, nor with
 *  mvapich2-2.1a
 *  - a setup with 4 distinct IB nodes is mandated to observe the
 *  problem.
 *
 * For compiling, one may do:
 
  MPI=$HOME/Packages/openmpi-1.8.3
  $MPI/bin/mpicc -W -Wall -std=c99 -O0 -g prog2.c
 
 * For running, assuming /tmp/hosts contains the list of 4 nodes, and
 * $SSH is used to connect to these:
 
  SSH_AUTH_SOCK= DISPLAY= $MPI/bin/mpiexec -machinefile /tmp/hosts --mca plm_rsh_agent $SSH --mca rmaps_base_mapping_policy node -n 4  ./a.out
 
 * If you pass the argument --mca btl ^openib to the mpiexec command
 * line, no crash.
 */

/*
 * Tested (FAIL means that setting USE_MMAP_FOR_FILE_IO above yields to a
 * program failure, while we succeed if it is unset).
 *
 * IB boards MCX353A-FCBT, fw rev 2.32.5100, MLNX_OFED_LINUX-2.3-1.0.1-debian7.5-x86_64
 * openmpi-1.8.3 FAIL   (ok with --mca btl ^openib)
 * openmpi-1.8.2 FAIL   (ok with --mca btl ^openib)
 * openmpi-1.8.1 FAIL   (ok with --mca btl ^openib)
 * openmpi-1.7.4 FAIL   (ok with --mca btl ^openib)
 * openmpi-1.7 FAIL (ok with --mca btl ^openib)
 * openmpi-1.6.5 ok
 * openmpi-1.4.5 ok
 * mvapich2-2.1a ok
 *
 * IB boards MCX353A-FCBT, fw rev 2.30.8000, debian 7.5 stock + kernel 3.13.10-1~bpo70+1
 * openmpi-1.8.2 FAIL   (ok with --mca btl ^openib)
 * mvapich2-2.0 ok
 *
 * IB boards MHGH29-XTC, fw rev 2.8.600, debian 7.6 stock
 * openmpi-1.8.3 FAIL   

Re: [OMPI users] What could cause a segfault in OpenMPI?

2014-11-10 Thread Saliya Ekanayake
Thank you Jeff, I'll try this and  let you know.

Saliya
On Nov 10, 2014 6:42 AM, "Jeff Squyres (jsquyres)" 
wrote:

> I am sorry for the delay; I've been caught up in SC deadlines.  :-(
>
> I don't see anything blatantly wrong in this output.
>
> Two things:
>
> 1. Can you try a nightly v1.8.4 snapshot tarball?  This will check to see
> if whatever the bug is has been fixed for the upcoming release:
>
> http://www.open-mpi.org/nightly/v1.8/
>
> 2. Build Open MPI with the --enable-debug option (note that this adds a
> slight-but-noticeable performance penalty).  When you run, it should dump a
> core file.  Load that core file in a debugger and see where it is failing
> (i.e., file and line in the OMPI source).
>
> We don't usually have to resort to asking users to perform #2, but there's
> no additional information to give a clue as to what is happening.  :-(
>
>
>
> On Nov 9, 2014, at 11:43 AM, Saliya Ekanayake  wrote:
>
> > Hi Jeff,
> >
> > You are probably busy, but just checking if you had a chance to look at
> this.
> >
> > Thanks,
> > Saliya
> >
> > On Thu, Nov 6, 2014 at 9:19 AM, Saliya Ekanayake 
> wrote:
> > Hi Jeff,
> >
> > I've attached a tar file with information.
> >
> > Thank you,
> > Saliya
> >
> > On Tue, Nov 4, 2014 at 4:18 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > Looks like it's failing in the openib BTL setup.
> >
> > Can you send the info listed here?
> >
> > http://www.open-mpi.org/community/help/
> >
> >
> >
> > On Nov 4, 2014, at 1:10 PM, Saliya Ekanayake  wrote:
> >
> > > Hi,
> > >
> > > I am using OpenMPI 1.8.1 in a Linux cluster that we recently setup. It
> builds fine, but when I try to run even the simplest hello.c program it'll
> cause a segfault. Any suggestions on how to correct this?
> > >
> > > The steps I did and error message are below.
> > >
> > > 1. Built OpenMPI 1.8.1 on the cluster. The ompi_info is attached.
> > > 2. cd to examples directory and mpicc hello_c.c
> > > 3. mpirun -np 2 ./a.out
> > > 4. Error text is attached.
> > >
> > > Please let me know if you need more info.
> > >
> > > Thank you,
> > > Saliya
> > >
> > >
> > > --
> > > Saliya Ekanayake esal...@gmail.com
> > > Cell 812-391-4914 Home 812-961-6383
> > > http://saliya.org
> > >
> ___
> > > users mailing list
> > > us...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > > Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25668.php
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25672.php
> >
> >
> >
> > --
> > Saliya Ekanayake esal...@gmail.com
> > Cell 812-391-4914 Home 812-961-6383
> > http://saliya.org
> >
> >
> >
> > --
> > Saliya Ekanayake esal...@gmail.com
> > Cell 812-391-4914 Home 812-961-6383
> > http://saliya.org
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25717.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25723.php
>


Re: [OMPI users] File-backed mmaped I/O and openib btl.

2014-11-10 Thread Joshua Ladd
Just really quick off the top of my head, mmaping relies on the virtual
memory subsystem, whereas IB RDMA operations rely on physical memory being
pinned (unswappable.) For a large message transfer, the OpenIB BTL will
register the user buffer, which will pin the pages and make them
unswappable. If the data being transfered is small, you'll copy-in/out to
internal bounce buffers and you shouldn't have issues.

1.If you try to just bcast a few kilobytes of data using this technique, do
you run into issues?

2. How large is the data in the collective (input and output), is in_place
used? I'm guess it's large enough that the BTL tries to work with the user
buffer.

Josh

On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé 
wrote:

> Hi,
>
> I'm stumbling on a problem related to the openib btl in
> openmpi-1.[78].*, and the (I think legitimate) use of file-backed
> mmaped areas for receiving data through MPI collective calls.
>
> A test case is attached. I've tried to make is reasonably small,
> although I recognize that it's not extra thin. The test case is a
> trimmed down version of what I witness in the context of a rather
> large program, so there is no claim of relevance of the test case
> itself. It's here just to trigger the desired misbehaviour. The test
> case contains some detailed information on what is done, and the
> experiments I did.
>
> In a nutshell, the problem is as follows.
>
>  - I do a computation, which involves MPI_Reduce_scatter and MPI_Allgather.
>  - I save the result to a file (collective operation).
>
> *If* I save the file using something such as:
>  fd = open("blah", ...
>  area = mmap(..., fd, )
>  MPI_Gather(..., area, ...)
> *AND* the MPI_Reduce_scatter is done with an alternative
> implementation (which I believe is correct)
> *AND* communication is done through the openib btl,
>
> then the file which gets saved is inconsistent with what is obtained
> with the normal MPI_Reduce_scatter (alghough memory areas do coincide
> before the save).
>
> I tried to dig a bit in the openib internals, but all I've been able
> to witness was beyond my expertise (an RDMA read not transferring the
> expected data, but I'm too uncomfortable with this layer to say
> anything I'm sure about).
>
> Tests have been done with several openmpi versions including 1.8.3, on
> a debian wheezy (7.5) + OFED 2.3 cluster.
>
> It would be great if someone could tell me if he is able to reproduce
> the bug, or tell me whether something which is done in this test case
> is illegal in any respect. I'd be glad to provide further information
> which could be of any help.
>
> Best regards,
>
> E. Thomé.
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25730.php
>


Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Ralph Castain
That is indeed bizarre - we haven’t heard of anything similar from other users. 
What is your network configuration? If you use oob_tcp_if_include or exclude, 
can you resolve the problem?


> On Nov 10, 2014, at 4:50 AM, Reuti  wrote:
> 
> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
> 
>> Wow, that's pretty terrible!  :(
>> 
>> Is the behavior BTL-specific, perchance?  E.G., if you only use certain 
>> BTLs, does the delay disappear?
> 
> You mean something like:
> 
> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
> ./mpihello; date
> Mon Nov 10 13:44:34 CET 2014
> Hello World from Node 1.
> Total: 4
> Universe: 4
> Hello World from Node 0.
> Hello World from Node 3.
> Hello World from Node 2.
> Mon Nov 10 13:46:42 CET 2014
> 
> (the above was even the latest v1.8.3-186-g978f61d)
> 
> Falling back to 1.8.1 gives (as expected):
> 
> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
> ./mpihello; date
> Mon Nov 10 13:49:51 CET 2014
> Hello World from Node 1.
> Total: 4
> Universe: 4
> Hello World from Node 0.
> Hello World from Node 2.
> Hello World from Node 3.
> Mon Nov 10 13:49:53 CET 2014
> 
> 
> -- Reuti
> 
>> FWIW: the use-all-IP interfaces approach has been in OMPI forever. 
>> 
>> Sent from my phone. No type good. 
>> 
>>> On Nov 10, 2014, at 6:42 AM, Reuti  wrote:
>>> 
 Am 10.11.2014 um 12:24 schrieb Reuti:
 
 Hi,
 
> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
> 
> FWIW: during MPI_Init, each process “publishes” all of its interfaces. 
> Each process receives a complete map of that info for every process in 
> the job. So when the TCP btl sets itself up, it attempts to connect 
> across -all- the interfaces published by the other end.
> 
> So it doesn’t matter what hostname is provided by the RM. We discover and 
> “share” all of the interface info for every node, and then use them for 
> loadbalancing.
 
 does this lead to any time delay when starting up? I stayed with Open MPI 
 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there is a 
 delay when the applications starts in my first compilation of 1.8.3 I 
 disregarded even all my extra options and run it outside of any 
 queuingsystem - the delay remains - on two different clusters.
>>> 
>>> I forgot to mention: the delay is more or less exactly 2 minutes from the 
>>> time I issued `mpiexec` until the `mpihello` starts up (there is no delay 
>>> for the initial `ssh` to reach the other node though).
>>> 
>>> -- Reuti
>>> 
>>> 
 I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
 creates this delay when starting up a simple mpihello. I assume it may lay 
 in the way how to reach other machines, as with one single machine there 
 is no delay. But using one (and only one - no tree spawn involved) 
 additional machine already triggers this delay.
 
 Did anyone else notice it?
 
 -- Reuti
 
 
> HTH
> Ralph
> 
> 
>> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
>> 
>> Ok I figured, i'm going to have to read some more for my own curiosity. 
>> The reason I mention the Resource Manager we use, and that the hostnames 
>> given but PBS/Torque match the 1gig-e interfaces, i'm curious what path 
>> it would take to get to a peer node when the node list given all match 
>> the 1gig interfaces but yet data is being sent out the 10gig eoib0/ib0 
>> interfaces.  
>> 
>> I'll go do some measurements and see.
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> CAEN Advanced Computing
>> XSEDE Campus Champion
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres) 
>>>  wrote:
>>> 
>>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by 
>>> default.  
>>> 
>>> This short FAQ has links to 2 other FAQs that provide detailed 
>>> information about reachability:
>>> 
>>> http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
>>> 
>>> The usNIC BTL uses UDP for its wire transport and actually does a much 
>>> more standards-conformant peer reachability determination (i.e., it 
>>> actually checks routing tables to see if it can reach a given peer 
>>> which has all kinds of caching benefits, kernel controls if you want 
>>> them, etc.).  We haven't back-ported this to the TCP BTL because a) 
>>> most people who use TCP for MPI still use a single L2 address space, 
>>> and b) no one has asked for it.  :-)
>>> 
>>> As for the round robin scheduling, there's no indication from the Linux 
>>> TCP stack what the bandwidth is on a given IP interface.  So unless you 
>>> use the btl_tcp_bandwidth_ (e.g., 
>>> btl_tcp_bandwidth_eth0) MCA params, OMPI will round-robin across them 
>>> equally.

Re: [OMPI users] OPENMPI-1.8.3: missing fortran bindings for MPI_SIZEOF

2014-11-10 Thread Jeff Squyres (jsquyres)
On Nov 10, 2014, at 8:27 AM, Dave Love  wrote:

>> https://github.com/open-mpi/ompi/commit/d7eaca83fac0d9783d40cac17e71c2b090437a8c
> 
> I don't have time to follow this properly, but am I reading right that
> that says mpi_sizeof will now _not_ work with gcc < 4.9, i.e. the system
> compiler of the vast majority of HPC GNU/Linux systems, whereas it did
> before (at least in simple cases)?

You raise a very good point, which raises another unfortunately good related 
point.

1. No, the goal is to enable MPI_SIZEOF in *more* cases, and still preserve all 
the old cases.  Your mail made me go back and test all the old cases this 
morning, and I discovered a bug which I need to fix before 1.8.4 is released 
(details unimportant, unless someone wants to gory details).

2. I did not think about the implications of the v1.7/1.8 series ABI guarantees 
to the MPI_SIZEOF fixes we applied.  Unfortunately, it looks like we 
effectively changed the gfortran>=4.9 "use mpi" ABI in 1.8.2 without fully 
realizing it.  Essentially:

 "use mpi" ABI in:
 1.8  1.8.1  1.8.2  1.8.3  1.8.4(upcoming)
gfortran <  4.9   A A  A  A  A(*)
gfortran >= 4.9   A A  B  B  ?

(*) = Buggy; needs to be fixed before release

Meaning: gfortran<4.9 has had ABI "A" throughout 1.8.  But we changed it to "B" 
for gfortran>=4.9 in 1.8.2.

So the question is: what's the Right Thing to do for 1.8.4?  Assume we'll have 
something like a "--enable|disable-new-fortran-goodness-for-gfortran-ge-4.9" 
configure switch (where the --enable form will be ABI B, the --disable form 
will be ABI A).  The question is: what should be the default?

1. Have ABI A as the default
2. Have ABI B as the default

I'm leaning towards #1 because it keeps ABI with a longer series of releases 
(1.7.x and 1.8-1.8.1).

Opinions?

>> IIRC, it only affected certain configure situations (e.g., only
>> certain fortran compilers).  I'm failing to remember the exact
>> scenario offhand that was problematic right now, but it led to the
>> larger question of: "hey, wait, don't we have to support MPI_SIZEOF in
>> mpif.h, too?"
> 
> I'd have said the answer was a clear "no", without knowing what the
> standard says about mpif.h,

The answer actually turned out to be "yes".  :-\

Specifically: the spec just says it's available in the Fortran interfaces.  It 
doesn't say "the Fortran interfaces, except MPI_SIZEOF."

Indeed, the spec doesn't prohibit explicit interfaces in mpif.h (it never has). 
 It's just that most (all?) MPI implementations have not provided explicit 
interfaces in mpif.h.

But for MPI_SIZEOF to work, explicit interfaces are *required*.

In OMPI, this has translated to: if your compiler supports enough features, 
MPI_SIZEOF will now have explicit interfaces in mpif.h (modulo ABI issues with 
the v1.8 series).

I predict that the Forum will deprecate MPI_SIZEOF in MPI-4.  But MPI_SIZEOF 
unfortunately will still have to live in OMPI for quite a long time.  :-(

> but I'd expect that to be deprecated anyhow.
> (The man pages generally don't mention USE, only INCLUDE, which seems
> wrong.)

Mmm.  Yes, true.

Any chance I could convince you to submit a patch?  :-)

>> According to discussion in the Forum Fortran working group, it is
>> required that MPI_SIZEOF must be supported in *all* MPI Fortran
>> interfaces, including mpif.h.
> 
> Well that's generally impossible if it's meant to include Fortran77
> compilers (which I must say doesn't seem worth it at this stage).

Fortran 77 compilers haven't existed for *many, many years*.  And I'll say it 
again: MPI has *never* supported Fortran 77 (it's a common misconception that 
it ever did).

The MPI spec is targeting the current version of Fortran (which has supported 
what is needed for MPI_SIZEOF for quite some time.. since F2003? F95? I'm not 
sure offhand).

> If it's any consolation, it doesn't work in the other MPIs here
> (mp(va)pich and intel), as I'd expect.

Right.  I don't know what MPICH's plans are in this area, but the Forum's 
conclusions (very recently, I might add) were quite clear.

>> Keep in mind that MPI does not prohibit having prototypes in mpif.h --
>> it's just that most (all?) MPI implementations don't tend to provide
>> them.  However, in the case of MPI_SIZEOF, it is *required* that
>> prototypes are available because the implementation needs the type
>> information to return the size properly (in mpif.h., mpi module, and
>> mpi_f08 module).
> 
> Fortran has interfaces, not prototypes!

Yes, sorry -- I'm a C programmer and I dabble in Fortran (read: I'm the guy who 
keeps the Fortran stuff maintained in OMPI), so I sometimes use the wrong 
terminology.  Mea culpa!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] File-backed mmaped I/O and openib btl.

2014-11-10 Thread Emmanuel Thomé
Thanks for your answer.

On Mon, Nov 10, 2014 at 4:31 PM, Joshua Ladd  wrote:
> Just really quick off the top of my head, mmaping relies on the virtual
> memory subsystem, whereas IB RDMA operations rely on physical memory being
> pinned (unswappable.)

Yes. Does that mean that the result of computations should be
undefined if I happen to give a user buffer which corresponds to a
file ? That would be surprising.

> For a large message transfer, the OpenIB BTL will
> register the user buffer, which will pin the pages and make them
> unswappable.

Yes. But what are the semantics of pinning the VM area pointed to by
ptr if ptr happens to be mmaped from a file ?

> If the data being transfered is small, you'll copy-in/out to
> internal bounce buffers and you shouldn't have issues.

Are you saying that the openib layer does have provision in this case
for letting the RDMA happen with a pinned physical memory range, and
later perform the copy to the file-backed mmaped range ? That would
make perfect sense indeed, although I don't have enough familiarity
with the OMPI code to see where it happens, and more importantly
whether the completion properly waits for this post-RDMA copy to
complete.


> 1.If you try to just bcast a few kilobytes of data using this technique, do
> you run into issues?

No. All "simpler" attempts were successful, unfortunately. Can you be
a little bit more precise about what scenario you imagine ? The
setting "all ranks mmap a local file, and rank 0 broadcasts there" is
successful.

> 2. How large is the data in the collective (input and output), is in_place
> used? I'm guess it's large enough that the BTL tries to work with the user
> buffer.

MPI_IN_PLACE is used in reduce_scatter and allgather in the code.
Collectives are with communicators of 2 nodes, and we're talking (for
the smallest failing run) 8kb per node (i.e. 16kb total for an
allgather).

E.

> On Mon, Nov 10, 2014 at 9:29 AM, Emmanuel Thomé 
> wrote:
>>
>> Hi,
>>
>> I'm stumbling on a problem related to the openib btl in
>> openmpi-1.[78].*, and the (I think legitimate) use of file-backed
>> mmaped areas for receiving data through MPI collective calls.
>>
>> A test case is attached. I've tried to make is reasonably small,
>> although I recognize that it's not extra thin. The test case is a
>> trimmed down version of what I witness in the context of a rather
>> large program, so there is no claim of relevance of the test case
>> itself. It's here just to trigger the desired misbehaviour. The test
>> case contains some detailed information on what is done, and the
>> experiments I did.
>>
>> In a nutshell, the problem is as follows.
>>
>>  - I do a computation, which involves MPI_Reduce_scatter and
>> MPI_Allgather.
>>  - I save the result to a file (collective operation).
>>
>> *If* I save the file using something such as:
>>  fd = open("blah", ...
>>  area = mmap(..., fd, )
>>  MPI_Gather(..., area, ...)
>> *AND* the MPI_Reduce_scatter is done with an alternative
>> implementation (which I believe is correct)
>> *AND* communication is done through the openib btl,
>>
>> then the file which gets saved is inconsistent with what is obtained
>> with the normal MPI_Reduce_scatter (alghough memory areas do coincide
>> before the save).
>>
>> I tried to dig a bit in the openib internals, but all I've been able
>> to witness was beyond my expertise (an RDMA read not transferring the
>> expected data, but I'm too uncomfortable with this layer to say
>> anything I'm sure about).
>>
>> Tests have been done with several openmpi versions including 1.8.3, on
>> a debian wheezy (7.5) + OFED 2.3 cluster.
>>
>> It would be great if someone could tell me if he is able to reproduce
>> the bug, or tell me whether something which is done in this test case
>> is illegal in any respect. I'd be glad to provide further information
>> which could be of any help.
>>
>> Best regards,
>>
>> E. Thomé.
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/11/25730.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25732.php


Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Reuti
Hi,

Am 10.11.2014 um 16:39 schrieb Ralph Castain:

> That is indeed bizarre - we haven’t heard of anything similar from other 
> users. What is your network configuration? If you use oob_tcp_if_include or 
> exclude, can you resolve the problem?

Thx - this option helped to get it working.

These tests were made for sake of simplicity between the headnode of the 
cluster and one (idle) compute node. I tried then between the (identical) 
compute nodes and this worked fine. The headnode of the cluster and the compute 
node are slightly different though (i.e. number of cores), and using eth1 resp. 
eth0 for the internal network of the cluster.

I tried --hetero-nodes with no change.

Then I turned to:

reuti@annemarie:~> date; mpiexec -mca btl self,tcp --mca oob_tcp_if_include 
192.168.154.0/26 -n 4 --hetero-nodes --hostfile machines ./mpihello; date

and the application started instantly. On another cluster, where the headnode 
is identical to the compute nodes but with the same network setup as above, I 
observed a delay of "only" 30 seconds. Nevertheless, also on this cluster the 
working addition was the correct "oob_tcp_if_include" to solve the issue.

The questions which remain: a) is this a targeted behavior, b) what changed in 
this scope between 1.8.1 and 1.8.2?

-- Reuti


> 
>> On Nov 10, 2014, at 4:50 AM, Reuti  wrote:
>> 
>> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
>> 
>>> Wow, that's pretty terrible!  :(
>>> 
>>> Is the behavior BTL-specific, perchance?  E.G., if you only use certain 
>>> BTLs, does the delay disappear?
>> 
>> You mean something like:
>> 
>> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
>> ./mpihello; date
>> Mon Nov 10 13:44:34 CET 2014
>> Hello World from Node 1.
>> Total: 4
>> Universe: 4
>> Hello World from Node 0.
>> Hello World from Node 3.
>> Hello World from Node 2.
>> Mon Nov 10 13:46:42 CET 2014
>> 
>> (the above was even the latest v1.8.3-186-g978f61d)
>> 
>> Falling back to 1.8.1 gives (as expected):
>> 
>> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile machines 
>> ./mpihello; date
>> Mon Nov 10 13:49:51 CET 2014
>> Hello World from Node 1.
>> Total: 4
>> Universe: 4
>> Hello World from Node 0.
>> Hello World from Node 2.
>> Hello World from Node 3.
>> Mon Nov 10 13:49:53 CET 2014
>> 
>> 
>> -- Reuti
>> 
>>> FWIW: the use-all-IP interfaces approach has been in OMPI forever. 
>>> 
>>> Sent from my phone. No type good. 
>>> 
 On Nov 10, 2014, at 6:42 AM, Reuti  wrote:
 
> Am 10.11.2014 um 12:24 schrieb Reuti:
> 
> Hi,
> 
>> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
>> 
>> FWIW: during MPI_Init, each process “publishes” all of its interfaces. 
>> Each process receives a complete map of that info for every process in 
>> the job. So when the TCP btl sets itself up, it attempts to connect 
>> across -all- the interfaces published by the other end.
>> 
>> So it doesn’t matter what hostname is provided by the RM. We discover 
>> and “share” all of the interface info for every node, and then use them 
>> for loadbalancing.
> 
> does this lead to any time delay when starting up? I stayed with Open MPI 
> 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there is a 
> delay when the applications starts in my first compilation of 1.8.3 I 
> disregarded even all my extra options and run it outside of any 
> queuingsystem - the delay remains - on two different clusters.
 
 I forgot to mention: the delay is more or less exactly 2 minutes from the 
 time I issued `mpiexec` until the `mpihello` starts up (there is no delay 
 for the initial `ssh` to reach the other node though).
 
 -- Reuti
 
 
> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 already 
> creates this delay when starting up a simple mpihello. I assume it may 
> lay in the way how to reach other machines, as with one single machine 
> there is no delay. But using one (and only one - no tree spawn involved) 
> additional machine already triggers this delay.
> 
> Did anyone else notice it?
> 
> -- Reuti
> 
> 
>> HTH
>> Ralph
>> 
>> 
>>> On Nov 8, 2014, at 8:13 PM, Brock Palen  wrote:
>>> 
>>> Ok I figured, i'm going to have to read some more for my own curiosity. 
>>> The reason I mention the Resource Manager we use, and that the 
>>> hostnames given but PBS/Torque match the 1gig-e interfaces, i'm curious 
>>> what path it would take to get to a peer node when the node list given 
>>> all match the 1gig interfaces but yet data is being sent out the 10gig 
>>> eoib0/ib0 interfaces.  
>>> 
>>> I'll go do some measurements and see.
>>> 
>>> Brock Palen
>>> www.umich.edu/~brockp
>>> CAEN Advanced Computing
>>> XSEDE Campus Champion
>>> bro...@umich.edu
>>> (734)936-1985
>>> 
>>>

Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Gilles Gouaillardet
Hi,

IIRC there were some bug fixes between 1.8.1 and 1.8.2 in order to really
use all the published interfaces.

by any change, are you running a firewall on your head node ?
one possible explanation is the compute node tries to access the public
interface of the head node, and packets get dropped by the firewall.

if you are running a firewall, can you make a test without it ?
/* if you do need NAT, then just remove the DROP and REJECT rules "/

an other possible explanation is the compute node is doing (reverse) dns
requests with the public name and/or ip of the head node and that takes
some time to complete (success or failure, this does not really matter here)

/* a simple test is to make sure all the hosts/ip of the head node are in
the /etc/hosts of the compute node */

could you check your network config (firewall and dns) ?

can you reproduce the delay when running mpirun on the head node and with
one mpi task on the compute node ?

if yes, then the hard way to trace the delay issue would be to strace -ttt
both orted and mpi task that are launched on the compute node and see where
the time is lost.
/* at this stage, i would suspect orted ... */

Cheers,

Gilles

On Mon, Nov 10, 2014 at 5:56 PM, Reuti  wrote:

> Hi,
>
> Am 10.11.2014 um 16:39 schrieb Ralph Castain:
>
> > That is indeed bizarre - we haven’t heard of anything similar from other
> users. What is your network configuration? If you use oob_tcp_if_include or
> exclude, can you resolve the problem?
>
> Thx - this option helped to get it working.
>
> These tests were made for sake of simplicity between the headnode of the
> cluster and one (idle) compute node. I tried then between the (identical)
> compute nodes and this worked fine. The headnode of the cluster and the
> compute node are slightly different though (i.e. number of cores), and
> using eth1 resp. eth0 for the internal network of the cluster.
>
> I tried --hetero-nodes with no change.
>
> Then I turned to:
>
> reuti@annemarie:~> date; mpiexec -mca btl self,tcp --mca
> oob_tcp_if_include 192.168.154.0/26 -n 4 --hetero-nodes --hostfile
> machines ./mpihello; date
>
> and the application started instantly. On another cluster, where the
> headnode is identical to the compute nodes but with the same network setup
> as above, I observed a delay of "only" 30 seconds. Nevertheless, also on
> this cluster the working addition was the correct "oob_tcp_if_include" to
> solve the issue.
>
> The questions which remain: a) is this a targeted behavior, b) what
> changed in this scope between 1.8.1 and 1.8.2?
>
> -- Reuti
>
>
> >
> >> On Nov 10, 2014, at 4:50 AM, Reuti  wrote:
> >>
> >> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
> >>
> >>> Wow, that's pretty terrible!  :(
> >>>
> >>> Is the behavior BTL-specific, perchance?  E.G., if you only use
> certain BTLs, does the delay disappear?
> >>
> >> You mean something like:
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile
> machines ./mpihello; date
> >> Mon Nov 10 13:44:34 CET 2014
> >> Hello World from Node 1.
> >> Total: 4
> >> Universe: 4
> >> Hello World from Node 0.
> >> Hello World from Node 3.
> >> Hello World from Node 2.
> >> Mon Nov 10 13:46:42 CET 2014
> >>
> >> (the above was even the latest v1.8.3-186-g978f61d)
> >>
> >> Falling back to 1.8.1 gives (as expected):
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile
> machines ./mpihello; date
> >> Mon Nov 10 13:49:51 CET 2014
> >> Hello World from Node 1.
> >> Total: 4
> >> Universe: 4
> >> Hello World from Node 0.
> >> Hello World from Node 2.
> >> Hello World from Node 3.
> >> Mon Nov 10 13:49:53 CET 2014
> >>
> >>
> >> -- Reuti
> >>
> >>> FWIW: the use-all-IP interfaces approach has been in OMPI forever.
> >>>
> >>> Sent from my phone. No type good.
> >>>
>  On Nov 10, 2014, at 6:42 AM, Reuti 
> wrote:
> 
> > Am 10.11.2014 um 12:24 schrieb Reuti:
> >
> > Hi,
> >
> >> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
> >>
> >> FWIW: during MPI_Init, each process “publishes” all of its
> interfaces. Each process receives a complete map of that info for every
> process in the job. So when the TCP btl sets itself up, it attempts to
> connect across -all- the interfaces published by the other end.
> >>
> >> So it doesn’t matter what hostname is provided by the RM. We
> discover and “share” all of the interface info for every node, and then use
> them for loadbalancing.
> >
> > does this lead to any time delay when starting up? I stayed with
> Open MPI 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there
> is a delay when the applications starts in my first compilation of 1.8.3 I
> disregarded even all my extra options and run it outside of any
> queuingsystem - the delay remains - on two different clusters.
> 
>  I forgot to mention: the delay is more or less exactly 2 minutes from
> the time I issued `mpiexec` until the `mpihello` starts up (there is 

[OMPI users] what order do I get messages coming to MPI Recv from MPI_ANY_SOURCE?

2014-11-10 Thread David A. Schneider
I am implementing a hub/servers MPI application. Each of the servers can 
get tied up waiting for some data, then they do an MPI Send to the hub. 
It is relatively simple for me to have the hub waiting around doing a 
Recv from ANY_SOURCE. The hub can get busy working with the data. What 
I'm worried about is skipping data from one of the servers. How likely 
is this scenario:


server 1 and 2 do Send's
hub does Recv and ends up getting data from server 1
while hub busy, server 1 gets more data, does another Send
when hub does its next Recv, it gets the more recent server 1 data 
rather than the older server2


I don't need a guarantee that the order the Send's occur is the order 
the ANY_SOURCE processes them (though it would be nice), but if I new in 
practice it will be close to the order they are sent, I may go with the 
above. However if it is likely I could skip over data from one of the 
servers, I need to implement something more complicated. Which I think 
would be this pattern:


servers each do Send's
hub does an Irecv for each server
hub does a Waitany on all server requests
upon completion of one server request, hub does a Test on all the 
others
of all the Irecv's that have completed, hub selects the oldest 
server data (there is timing tag in the server data)
hub communicates with the server it just chose, has it start a new 
Send, hub a new Irecv


This requires more complex code, and my first effort crashed inside the 
Waitany call in a way that I'm finding difficult to debug. I am using 
the Python bindings mpi4py - so I have less control over buffers being used.


I just posted this on stackoverflow also, but maybe this is a better 
place to post?




Re: [OMPI users] How OMPI picks ethernet interfaces

2014-11-10 Thread Ralph Castain
Another thing you can do is (a) ensure you built with —enable-debug, and then 
(b) run it with -mca oob_base_verbose 100  (without the tcp_if_include option) 
so we can watch the connection handshake and see what it is doing. The 
—hetero-nodes will have not affect here and can be ignored.

Ralph

> On Nov 10, 2014, at 5:12 PM, Gilles Gouaillardet 
>  wrote:
> 
> Hi,
> 
> IIRC there were some bug fixes between 1.8.1 and 1.8.2 in order to really use 
> all the published interfaces.
> 
> by any change, are you running a firewall on your head node ?
> one possible explanation is the compute node tries to access the public 
> interface of the head node, and packets get dropped by the firewall.
> 
> if you are running a firewall, can you make a test without it ?
> /* if you do need NAT, then just remove the DROP and REJECT rules "/
> 
> an other possible explanation is the compute node is doing (reverse) dns 
> requests with the public name and/or ip of the head node and that takes some 
> time to complete (success or failure, this does not really matter here)
> 
> /* a simple test is to make sure all the hosts/ip of the head node are in the 
> /etc/hosts of the compute node */
> 
> could you check your network config (firewall and dns) ?
> 
> can you reproduce the delay when running mpirun on the head node and with one 
> mpi task on the compute node ?
> 
> if yes, then the hard way to trace the delay issue would be to strace -ttt 
> both orted and mpi task that are launched on the compute node and see where 
> the time is lost.
> /* at this stage, i would suspect orted ... */
> 
> Cheers,
> 
> Gilles
> 
> On Mon, Nov 10, 2014 at 5:56 PM, Reuti  > wrote:
> Hi,
> 
> Am 10.11.2014 um 16:39 schrieb Ralph Castain:
> 
> > That is indeed bizarre - we haven’t heard of anything similar from other 
> > users. What is your network configuration? If you use oob_tcp_if_include or 
> > exclude, can you resolve the problem?
> 
> Thx - this option helped to get it working.
> 
> These tests were made for sake of simplicity between the headnode of the 
> cluster and one (idle) compute node. I tried then between the (identical) 
> compute nodes and this worked fine. The headnode of the cluster and the 
> compute node are slightly different though (i.e. number of cores), and using 
> eth1 resp. eth0 for the internal network of the cluster.
> 
> I tried --hetero-nodes with no change.
> 
> Then I turned to:
> 
> reuti@annemarie:~> date; mpiexec -mca btl self,tcp --mca oob_tcp_if_include 
> 192.168.154.0/26  -n 4 --hetero-nodes --hostfile 
> machines ./mpihello; date
> 
> and the application started instantly. On another cluster, where the headnode 
> is identical to the compute nodes but with the same network setup as above, I 
> observed a delay of "only" 30 seconds. Nevertheless, also on this cluster the 
> working addition was the correct "oob_tcp_if_include" to solve the issue.
> 
> The questions which remain: a) is this a targeted behavior, b) what changed 
> in this scope between 1.8.1 and 1.8.2?
> 
> -- Reuti
> 
> 
> >
> >> On Nov 10, 2014, at 4:50 AM, Reuti  >> > wrote:
> >>
> >> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres):
> >>
> >>> Wow, that's pretty terrible!  :(
> >>>
> >>> Is the behavior BTL-specific, perchance?  E.G., if you only use certain 
> >>> BTLs, does the delay disappear?
> >>
> >> You mean something like:
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile 
> >> machines ./mpihello; date
> >> Mon Nov 10 13:44:34 CET 2014
> >> Hello World from Node 1.
> >> Total: 4
> >> Universe: 4
> >> Hello World from Node 0.
> >> Hello World from Node 3.
> >> Hello World from Node 2.
> >> Mon Nov 10 13:46:42 CET 2014
> >>
> >> (the above was even the latest v1.8.3-186-g978f61d)
> >>
> >> Falling back to 1.8.1 gives (as expected):
> >>
> >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile 
> >> machines ./mpihello; date
> >> Mon Nov 10 13:49:51 CET 2014
> >> Hello World from Node 1.
> >> Total: 4
> >> Universe: 4
> >> Hello World from Node 0.
> >> Hello World from Node 2.
> >> Hello World from Node 3.
> >> Mon Nov 10 13:49:53 CET 2014
> >>
> >>
> >> -- Reuti
> >>
> >>> FWIW: the use-all-IP interfaces approach has been in OMPI forever.
> >>>
> >>> Sent from my phone. No type good.
> >>>
>  On Nov 10, 2014, at 6:42 AM, Reuti   > wrote:
> 
> > Am 10.11.2014 um 12:24 schrieb Reuti:
> >
> > Hi,
> >
> >> Am 09.11.2014 um 05:38 schrieb Ralph Castain:
> >>
> >> FWIW: during MPI_Init, each process “publishes” all of its interfaces. 
> >> Each process receives a complete map of that info for every process in 
> >> the job. So when the TCP btl sets itself up, it attempts to connect 
> >> across -all- the interfaces published by the other end.
> >>
> >> So it doesn’t matter what hostname is pro