Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

2010-07-20 Thread Ralph Castain
Well, I finally managed to make this work without the required ompi-server 
rendezvous point. The fix is only in the devel trunk right now - I'll have to 
ask the release managers for 1.5 and 1.4 if they want it ported to those series.

On the notion of integrating OMPI to your launch environment: remember that we 
don't necessarily require that you use mpiexec for that purpose. If your launch 
environment provides just a little info in the environment of the launched 
procs, we can usually devise a method that allows the procs to perform an 
MPI_Init as a single job without all this work you are doing.

Only difference is that your procs will all block in MPI_Init until they -all- 
have executed that function. If that isn't a problem, this would be a much more 
scalable and reliable method than doing it thru massive calls to 
MPI_Port_connect.


On Jul 18, 2010, at 4:09 PM, Philippe wrote:

> Ralph,
> 
> thanks for investigating.
> 
> I've applied the two patches you mentioned earlier and ran with the
> ompi server. Although i was able to runn our standalone test, when I
> integrated the changes to our code, the processes entered a crazy loop
> and allocated all the memory available when calling MPI_Port_Connect.
> I was not able to identify why it works standalone but not integrated
> with our code. If I found why, I'll let your know.
> 
> looking forward to your findings. We'll be happy to test any patches
> if you have some!
> 
> p.
> 
> On Sat, Jul 17, 2010 at 9:47 PM, Ralph Castain  wrote:
>> Okay, I can reproduce this problem. Frankly, I don't think this ever worked 
>> with OMPI, and I'm not sure how the choice of BTL makes a difference.
>> 
>> The program is crashing in the communicator definition, which involves a 
>> communication over our internal out-of-band messaging system. That system 
>> has zero connection to any BTL, so it should crash either way.
>> 
>> Regardless, I will play with this a little as time allows. Thanks for the 
>> reproducer!
>> 
>> 
>> On Jun 25, 2010, at 7:23 AM, Philippe wrote:
>> 
>>> Hi,
>>> 
>>> I'm trying to run a test program which consists of a server creating a
>>> port using MPI_Open_port and N clients using MPI_Comm_connect to
>>> connect to the server.
>>> 
>>> I'm able to do so with 1 server and 2 clients, but with 1 server + 3
>>> clients, I get the following error message:
>>> 
>>>   [node003:32274] [[37084,0],0]:route_callback tried routing message
>>> from [[37084,1],0] to [[40912,1],0]:102, can't find route
>>> 
>>> This is only happening with the openib BTL. With tcp BTL it works
>>> perfectly fine (ofud also works as a matter of fact...). This has been
>>> tested on two completely different clusters, with identical results.
>>> In either cases, the IB frabic works normally.
>>> 
>>> Any help would be greatly appreciated! Several people in my team
>>> looked at the problem. Google and the mailing list archive did not
>>> provide any clue. I believe that from an MPI standpoint, my test
>>> program is valid (and it works with TCP, which make me feel better
>>> about the sequence of MPI calls)
>>> 
>>> Regards,
>>> Philippe.
>>> 
>>> 
>>> 
>>> Background:
>>> 
>>> I intend to use openMPI to transport data inside a much larger
>>> application. Because of that, I cannot used mpiexec. Each process is
>>> started by our own "job management" and use a name server to find
>>> about each others. Once all the clients are connected, I would like
>>> the server to do MPI_Recv to get the data from all the client. I dont
>>> care about the order or which client are sending data, as long as I
>>> can receive it with on call. Do do that, the clients and the server
>>> are going through a series of Comm_accept/Conn_connect/Intercomm_merge
>>> so that at the end, all the clients and the server are inside the same
>>> intracomm.
>>> 
>>> Steps:
>>> 
>>> I have a sample program that show the issue. I tried to make it as
>>> short as possible. It needs to be executed on a shared file system
>>> like NFS because the server write the port info to a file that the
>>> client will read. To reproduce the issue, the following steps should
>>> be performed:
>>> 
>>> 0. compile the test with "mpicc -o ben12 ben12.c"
>>> 1. ssh to the machine that will be the server
>>> 2. run ./ben12 3 1
>>> 3. ssh to the machine that will be the client #1
>>> 4. run ./ben12 3 0
>>> 5. repeat step 3-4 for client #2 and #3
>>> 
>>> the server accept the connection from client #1 and merge it in a new
>>> intracomm. It then accept connection from client #2 and merge it. when
>>> the client #3 arrives, the server accept the connection, but that
>>> cause client #1 and #2 to die with the error above (see the complete
>>> trace in the tarball).
>>> 
>>> The exact steps are:
>>> 
>>> - server open port
>>> - server does accept
>>> - client #1 does connect
>>> - server and client #1 do merge
>>> - server does accept
>>> - client #2 does connect
>>> - server, client #1 and clien

[OMPI users] does OpenMPI 1.3.2 support MPI 2.2 standard?

2010-07-20 Thread Anton Shterenlikht
Hi

Does OpenMPI 1.3.2 support MPI 2.2 standard? 

I didn't get a clear answer from our sysadmin.

many thanks
anton


-- 
Anton Shterenlikht
Room 2.6, Queen's Building
Mech Eng Dept
Bristol University
University Walk, Bristol BS8 1TR, UK
Tel: +44 (0)117 331 5944
Fax: +44 (0)117 929 4423


Re: [OMPI users] does OpenMPI 1.3.2 support MPI 2.2 standard?

2010-07-20 Thread Jeff Squyres
On Jul 20, 2010, at 6:42 AM, Anton Shterenlikht wrote:

> Does OpenMPI 1.3.2 support MPI 2.2 standard?

No; it supports MPI-2.1.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-20 Thread Ralph Castain
Grzegorz: something occurred to me. When you start all these processes, how are 
you staggering their wireup? Are they flooding us, or are you time-shifting 
them a little?


On Jul 19, 2010, at 10:32 AM, Edgar Gabriel wrote:

> Hm, so I am not sure how to approach this. First of all, the test case
> works for me. I used up to 80 clients, and for both optimized and
> non-optimized compilation. I ran the tests with trunk (not with 1.4
> series, but the communicator code is identical in both cases). Clearly,
> the patch from Ralph is necessary to make it work.
> 
> Additionally, I went through the communicator creation code for dynamic
> communicators trying to find spots that could create problems. The only
> place that I found the number 64 appear is the fortran-to-c mapping
> arrays (e.g. for communicators), where the initial size of the table is
> 64. I looked twice over the pointer-array code to see whether we could
> have a problem their (since it is a key-piece of the cid allocation code
> for communicators), but I am fairly confident that it is correct.
> 
> Note, that we have other (non-dynamic tests), were comm_set is called
> 100,000 times, and the code per se does not seem to have a problem due
> to being called too often. So I am not sure what else to look at.
> 
> Edgar
> 
> 
> 
> On 7/13/2010 8:42 PM, Ralph Castain wrote:
>> As far as I can tell, it appears the problem is somewhere in our 
>> communicator setup. The people knowledgeable on that area are going to look 
>> into it later this week.
>> 
>> I'm creating a ticket to track the problem and will copy you on it.
>> 
>> 
>> On Jul 13, 2010, at 6:57 AM, Ralph Castain wrote:
>> 
>>> 
>>> On Jul 13, 2010, at 3:36 AM, Grzegorz Maj wrote:
>>> 
 Bad news..
 I've tried the latest patch with and without the prior one, but it
 hasn't changed anything. I've also tried using the old code but with
 the OMPI_DPM_BASE_MAXJOBIDS constant changed to 80, but it also didn't
 help.
 While looking through the sources of openmpi-1.4.2 I couldn't find any
 call of the function ompi_dpm_base_mark_dyncomm.
>>> 
>>> It isn't directly called - it shows in ompi_comm_set as 
>>> ompi_dpm.mark_dyncomm. You were definitely overrunning that array, but I 
>>> guess something else is also being hit. Have to look further...
>>> 
>>> 
 
 
 2010/7/12 Ralph Castain :
> Just so you don't have to wait for 1.4.3 release, here is the patch 
> (doesn't include the prior patch).
> 
> 
> 
> 
> On Jul 12, 2010, at 12:13 PM, Grzegorz Maj wrote:
> 
>> 2010/7/12 Ralph Castain :
>>> Dug around a bit and found the problem!!
>>> 
>>> I have no idea who or why this was done, but somebody set a limit of 64 
>>> separate jobids in the dynamic init called by ompi_comm_set, which 
>>> builds the intercommunicator. Unfortunately, they hard-wired the array 
>>> size, but never check that size before adding to it.
>>> 
>>> So after 64 calls to connect_accept, you are overwriting other areas of 
>>> the code. As you found, hitting 66 causes it to segfault.
>>> 
>>> I'll fix this on the developer's trunk (I'll also add that original 
>>> patch to it). Rather than my searching this thread in detail, can you 
>>> remind me what version you are using so I can patch it too?
>> 
>> I'm using 1.4.2
>> Thanks a lot and I'm looking forward for the patch.
>> 
>>> 
>>> Thanks for your patience with this!
>>> Ralph
>>> 
>>> 
>>> On Jul 12, 2010, at 7:20 AM, Grzegorz Maj wrote:
>>> 
 1024 is not the problem: changing it to 2048 hasn't change anything.
 Following your advice I've run my process using gdb. Unfortunately I
 didn't get anything more than:
 
 Program received signal SIGSEGV, Segmentation fault.
 [Switching to Thread 0xf7e4c6c0 (LWP 20246)]
 0xf7f39905 in ompi_comm_set () from /home/gmaj/openmpi/lib/libmpi.so.0
 
 (gdb) bt
 #0  0xf7f39905 in ompi_comm_set () from 
 /home/gmaj/openmpi/lib/libmpi.so.0
 #1  0xf7e3ba95 in connect_accept () from
 /home/gmaj/openmpi/lib/openmpi/mca_dpm_orte.so
 #2  0xf7f62013 in PMPI_Comm_connect () from 
 /home/gmaj/openmpi/lib/libmpi.so.0
 #3  0x080489ed in main (argc=825832753, argv=0x34393638) at client.c:43
 
 What's more: when I've added a breakpoint on ompi_comm_set in 66th
 process and stepped a couple of instructions, one of the other
 processes crashed (as usualy on ompi_comm_set) earlier than 66th did.
 
 Finally I decided to recompile openmpi using -g flag for gcc. In this
 case the 66 processes issue has gone! I was running my applications
 exactly the same way as previously (even without recompilation) and
 I've run successfully over 130 processes.
 When switching back to the openmpi 

Re: [OMPI users] mpiexec seems to be resolving names on server insteadof each node

2010-07-20 Thread Jeff Squyres
Micha --

(re-digging up this really, really old issue because Manuel just pointed me at 
the Debian bug for the same issue: 
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=524553)

Can you confirm that this is still an issue on the latest Open MPI?

If so, it should probably piggyback onto this Open MPI tickets:

https://svn.open-mpi.org/trac/ompi/ticket/2045
https://svn.open-mpi.org/trac/ompi/ticket/2383
https://svn.open-mpi.org/trac/ompi/ticket/1983



On Apr 17, 2009, at 8:45 PM, Micha Feigin wrote:

> I am having problems running openmpi 1.3 on my claster and I was wondering if
> anyone else is seeing this problem and/or can give hints on how to solve it
> 
> As far as I understand the error, mpiexec resolves host names on the master 
> node
> it is run on instead of an each host seperately. This works in an environment 
> where
> each hostname resolves to the same address on each host (cluster connected 
> via a
> switch) but fails where it resolves to different addresses (ring/star setups 
> for
> example where each computer is connected directly to all/some of the others)
> 
> I'm not 100% sure that this is the problem as I'm seeing success on a single
> case where this should probably fail but it is my best bet from the error 
> message.
> 
> version 1.2.8 worked fine for the same simple program (a simple hellow world 
> that
> just comunicated the computer name for each process)
> 
> An example output:
> 
> mpiexec is run on the master node hubert and is set to run the processes on 
> two nodes
> fry and leela. As is understood from the error messages leela tries to 
> connect to
> fry on address 192.168.1.2 which is it's address on hubert but not leela 
> (where it
> is 192.168.4.1)
> 
> This is a four node claster all interconnected
> 
> 192.168.1.1  192.168.1.2
> hubert  fry
>   |\/| 192.168.4.1
>   |   \  /   |
>   |  \/  |
>   | \  / |
>   | /  \ |
>   |  /\  |
>   |   /  \   |
>   |/ \   | 192.168.4.2
> hermes --- leelas
> 
> =
> mpiexec -np 8 -H fry,leela test_mpi
> Hello MPI from the server process of 8 on fry!
> [[36620,1],1][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect]
>  from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: 
> Network is unreachable
> 
> [[36620,1],3][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect]
>  from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: 
> Network is unreachable
> 
> [[36620,1],7][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect]
>  from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: 
> Network is unreachable
> 
> [leela:4436] *** An error occurred in MPI_Send
> [leela:4436] *** on communicator MPI_COMM_WORLD
> [leela:4436] *** MPI_ERR_INTERN: internal error
> [leela:4436] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [[36620,1],5][../../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:589:mca_btl_tcp_endpoint_start_connect]
>  from leela to: fry Unable to connect to the peer 192.168.1.2 on port 154: 
> Network is unreachable
> 
> --
> mpiexec has exited due to process rank 1 with PID 4433 on
> node leela exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> --
> [hubert:11312] 3 more processes have sent help message help-mpi-errors.txt / 
> mpi_errors_are_fatal
> [hubert:11312] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> help / error messages
> =
> 
> This seems to be a directional issue as running the program -H fry,leela 
> failes
> where -H leela,fry works, same behaviour for all senarious except those that 
> include
> the master node (hubert) where it resolves the external ip (from an external 
> dns) instead
> of the internal ip (from the hosts file). thus one direction fails (no 
> external connection
> at the moment for all but the master) and the other causes a lockup
> 
> I hope that the explenation is not too convoluted
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] Partitioning problem set data

2010-07-20 Thread Alexandru Blidaru
Hi,

I have a 3D array, which I need to split into equal n parts, so that each
part would run on a different node. I found the picture in the attachment
from this website (
https://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioning) on
the different ways to partition data. I am interested in the block methods,
as the cyclic methods wouldn't really work for me at all. Obviously the *,
BLOCK and the BLOCK, * methods would be really easy to implement for 3D
arrays, assuming that the 2D picture would be looking at the array from the
top. My question is if there are other better ways to do it from a
performance standpoint?

Thanks for your replies,
Alex


Re: [OMPI users] Partitioning problem set data

2010-07-20 Thread Alexandru Blidaru
If there is an already existing implementation of the *Block or Block*
methods that splits the array and sends the individual pieces to
the proper nodes, can you point me to it please?

On Tue, Jul 20, 2010 at 9:52 AM, Alexandru Blidaru wrote:

> Hi,
>
> I have a 3D array, which I need to split into equal n parts, so that each
> part would run on a different node. I found the picture in the attachment
> from this website (
> https://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioning) on
> the different ways to partition data. I am interested in the block methods,
> as the cyclic methods wouldn't really work for me at all. Obviously the *,
> BLOCK and the BLOCK, * methods would be really easy to implement for 3D
> arrays, assuming that the 2D picture would be looking at the array from the
> top. My question is if there are other better ways to do it from a
> performance standpoint?
>
> Thanks for your replies,
> Alex
>


Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

2010-07-20 Thread Grzegorz Maj
My start script looks almost exactly the same as the one published by
Edgar, ie. the processes are starting one by one with no delay.

2010/7/20 Ralph Castain :
> Grzegorz: something occurred to me. When you start all these processes, how 
> are you staggering their wireup? Are they flooding us, or are you 
> time-shifting them a little?
>
>
> On Jul 19, 2010, at 10:32 AM, Edgar Gabriel wrote:
>
>> Hm, so I am not sure how to approach this. First of all, the test case
>> works for me. I used up to 80 clients, and for both optimized and
>> non-optimized compilation. I ran the tests with trunk (not with 1.4
>> series, but the communicator code is identical in both cases). Clearly,
>> the patch from Ralph is necessary to make it work.
>>
>> Additionally, I went through the communicator creation code for dynamic
>> communicators trying to find spots that could create problems. The only
>> place that I found the number 64 appear is the fortran-to-c mapping
>> arrays (e.g. for communicators), where the initial size of the table is
>> 64. I looked twice over the pointer-array code to see whether we could
>> have a problem their (since it is a key-piece of the cid allocation code
>> for communicators), but I am fairly confident that it is correct.
>>
>> Note, that we have other (non-dynamic tests), were comm_set is called
>> 100,000 times, and the code per se does not seem to have a problem due
>> to being called too often. So I am not sure what else to look at.
>>
>> Edgar
>>
>>
>>
>> On 7/13/2010 8:42 PM, Ralph Castain wrote:
>>> As far as I can tell, it appears the problem is somewhere in our 
>>> communicator setup. The people knowledgeable on that area are going to look 
>>> into it later this week.
>>>
>>> I'm creating a ticket to track the problem and will copy you on it.
>>>
>>>
>>> On Jul 13, 2010, at 6:57 AM, Ralph Castain wrote:
>>>

 On Jul 13, 2010, at 3:36 AM, Grzegorz Maj wrote:

> Bad news..
> I've tried the latest patch with and without the prior one, but it
> hasn't changed anything. I've also tried using the old code but with
> the OMPI_DPM_BASE_MAXJOBIDS constant changed to 80, but it also didn't
> help.
> While looking through the sources of openmpi-1.4.2 I couldn't find any
> call of the function ompi_dpm_base_mark_dyncomm.

 It isn't directly called - it shows in ompi_comm_set as 
 ompi_dpm.mark_dyncomm. You were definitely overrunning that array, but I 
 guess something else is also being hit. Have to look further...


>
>
> 2010/7/12 Ralph Castain :
>> Just so you don't have to wait for 1.4.3 release, here is the patch 
>> (doesn't include the prior patch).
>>
>>
>>
>>
>> On Jul 12, 2010, at 12:13 PM, Grzegorz Maj wrote:
>>
>>> 2010/7/12 Ralph Castain :
 Dug around a bit and found the problem!!

 I have no idea who or why this was done, but somebody set a limit of 
 64 separate jobids in the dynamic init called by ompi_comm_set, which 
 builds the intercommunicator. Unfortunately, they hard-wired the array 
 size, but never check that size before adding to it.

 So after 64 calls to connect_accept, you are overwriting other areas 
 of the code. As you found, hitting 66 causes it to segfault.

 I'll fix this on the developer's trunk (I'll also add that original 
 patch to it). Rather than my searching this thread in detail, can you 
 remind me what version you are using so I can patch it too?
>>>
>>> I'm using 1.4.2
>>> Thanks a lot and I'm looking forward for the patch.
>>>

 Thanks for your patience with this!
 Ralph


 On Jul 12, 2010, at 7:20 AM, Grzegorz Maj wrote:

> 1024 is not the problem: changing it to 2048 hasn't change anything.
> Following your advice I've run my process using gdb. Unfortunately I
> didn't get anything more than:
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0xf7e4c6c0 (LWP 20246)]
> 0xf7f39905 in ompi_comm_set () from /home/gmaj/openmpi/lib/libmpi.so.0
>
> (gdb) bt
> #0  0xf7f39905 in ompi_comm_set () from 
> /home/gmaj/openmpi/lib/libmpi.so.0
> #1  0xf7e3ba95 in connect_accept () from
> /home/gmaj/openmpi/lib/openmpi/mca_dpm_orte.so
> #2  0xf7f62013 in PMPI_Comm_connect () from 
> /home/gmaj/openmpi/lib/libmpi.so.0
> #3  0x080489ed in main (argc=825832753, argv=0x34393638) at 
> client.c:43
>
> What's more: when I've added a breakpoint on ompi_comm_set in 66th
> process and stepped a couple of instructions, one of the other
> processes crashed (as usualy on ompi_comm_set) earlier than 66th did.
>
> Finally I decided to recompile openmpi using -g flag for gcc. In 

Re: [OMPI users] Partitioning problem set data

2010-07-20 Thread Eugene Loh




The reason so many different distributions are described is because
what is optimal depends so much on your own case.

Even if one disregards CYCLIC axes, there are still all those BLOCK
choices you mention.  It isn't just a matter of choosing which axes
will be * since * is just a special case of BLOCK.  You have to choose
the shape or aspect ratio of the subgrids.

In some problems, it is desirable to minimize the surface-to-volume
ratio of each local subgrid so that interprocess communication costs
are minimized relative to the computational work each process has to
do.  So, you would want to make each subgrid as "cubic" as possible. 
Some sort of BLOCK,BLOCK,BLOCK.

The "physics" may argue otherwise.  Not all directions may be the
same.  E.g., in atmospheric models, the vertical direction is very
different from the horizontal ones.

Algorithms may also drive your choice.  E.g., for multidimensional
FFTs, you might want one axis to be local.  Then, you would transpose
axes to make another one local.

You might also want the "innermost axis" (in the process's linear
address space) to be as long as possible to benefit from
software/hardware vectorization of computationally expensive loops.

Lots of choices.  It depends on your problem.

Alexandru Blidaru wrote:
If there is an already existing implementation of the
*Block or Block* methods that splits the array and sends the individual
pieces to the proper nodes, can you point me to it please? 
  
  On Tue, Jul 20, 2010 at 9:52 AM, Alexandru
Blidaru 
wrote:
  I
have a 3D array, which I need to split into equal n parts, so that each
part would run on a different node. I found the picture in the
attachment from this website (https://computing.llnl.gov/tutorials/parallel_comp/#DesignPartitioning) on
the different ways to partition data. I am interested in the block
methods, as the cyclic methods wouldn't really work for me at all.
Obviously the *, BLOCK and the BLOCK, * methods would be really easy to
implement for 3D arrays, assuming that the 2D picture would be looking
at the array from the top. My question is if there are other better
ways to do it from a performance standpoint?