Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

Philippe Thu, 22 Jul 2010 08:48:53 -0400

On Wed, Jul 21, 2010 at 10:44 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
> On Jul 21, 2010, at 7:44 AM, Philippe wrote:
>
>> Ralph,
>>
>> Sorry for the late reply -- I was away on vacation.
>
> no problem at all!
>
>>
>> regarding your earlier question about how many processes where
>> involved when the memory was entirely allocated, it was only two, a
>> sender and a receiver. I'm still trying to pinpoint what can be
>> different between the standalone case and the "integrated" case. I
>> will try to find out what part of the code is allocating memory in a
>> loop.
>
> hmmm....that sounds like a bug in your program. let me know what you find
>
>>
>>
>> On Tue, Jul 20, 2010 at 12:51 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>> Well, I finally managed to make this work without the required ompi-server 
>>> rendezvous point. The fix is only in the devel trunk right now - I'll have 
>>> to ask the release managers for 1.5 and 1.4 if they want it ported to those 
>>> series.
>>>
>>
>> great -- i'll give it a try
>>
>>> On the notion of integrating OMPI to your launch environment: remember that 
>>> we don't necessarily require that you use mpiexec for that purpose. If your 
>>> launch environment provides just a little info in the environment of the 
>>> launched procs, we can usually devise a method that allows the procs to 
>>> perform an MPI_Init as a single job without all this work you are doing.
>>>
>>
>> I'm working on creating operators using MPI for the IBM product
>> "InfoSphere Streams". It has its own launching mechanism to start the
>> processes. However I can pass some information to the processes that
>> belong to the same job (Streams job -- which should neatly map to MPI
>> job).
>>
>>> Only difference is that your procs will all block in MPI_Init until they 
>>> -all- have executed that function. If that isn't a problem, this would be a 
>>> much more scalable and reliable method than doing it thru massive calls to 
>>> MPI_Port_connect.
>>>
>>
>> in the general case, that would be a problem, but for my prototype,
>> this is acceptable.
>>
>> In general, each process is composed of operators, some may be MPI
>> related and some may not. But in my case, I know ahead of time which
>> processes will be part of the MPI job, so I can easily deal with the
>> fact that they would block on MPI_init (actually -- MPI_thread_init
>> since its using a lot of threads).
>
> We have talked in the past about creating a non-blocking MPI_Init as an 
> extension to the standard. It would lock you to Open MPI, though...
>
> Regardless, at some point you would have to know how many processes are going 
> to be part of the job so you can know when MPI_Init is complete. I would 
> think you would require that info for the singleton wireup anyway - yes? 
> Otherwise, how would you know when to quit running connect-accept?
>


the short answer is yes... although, the longer answer is a bit more
complicated. currently I do know the number of connect I need to do on
a per-port basis. a job can contains an arbitrary number of MPI
processes, each opening one or more ports. so i know the count port by
ports but I dont need to worry about how many MPI processes there is
globally. to make things a bit more complicated, each MPI operator can
be "fused" with other operators to make a process. each fused operator
may or may not require MPI. the bottom line is, to get the total
number of processes to calculate rank&size, I need to reverse engineer
the fusing that the compiler may do.

but that's ok, I'm willing to do that for our prototype :-)

>>
>> Is there a documentation or example I can use to see what information
>> I can pass to the processes to enable that? Is it just environment
>> variables?
>
> No real documentation - a lack I should probably fill. At the moment, we 
> don't have a "generic" module for standalone launch, but I can create one as 
> it is pretty trivial. I would then need you to pass each process envars 
> telling it the total number of processes in the MPI job, its rank within that 
> job, and a file where some rendezvous process (can be rank=0) has provided 
> that port string. Armed with that info, I can wireup the job.
>
> Won't be as scalable as an mpirun-initiated startup, but will be much better 
> than doing it from singletons.

that would be great. I can definitely pass environment variables to
each process.

>
> Or if you prefer, we could setup an "infosphere" module that we could 
> customize for this system. Main thing here would be to provide us with some 
> kind of regex (or access to a file containing the info) that describes the 
> map of rank to node so we can construct the wireup communication pattern.
>

i think for our prototype we are fine with the first method. I'd leave
the cleaner implementation as a task for the product team ;-)

regarding the "generic" module, is that something you can put together
quickly? can I help in any way?

Thanks!
p

> Either way would work. The second is more scalable, but I don't know if you 
> have (or can construct) the map info.
>
>>
>> Many thanks!
>> p.
>>
>>>
>>> On Jul 18, 2010, at 4:09 PM, Philippe wrote:
>>>
>>>> Ralph,
>>>>
>>>> thanks for investigating.
>>>>
>>>> I've applied the two patches you mentioned earlier and ran with the
>>>> ompi server. Although i was able to runn our standalone test, when I
>>>> integrated the changes to our code, the processes entered a crazy loop
>>>> and allocated all the memory available when calling MPI_Port_Connect.
>>>> I was not able to identify why it works standalone but not integrated
>>>> with our code. If I found why, I'll let your know.
>>>>
>>>> looking forward to your findings. We'll be happy to test any patches
>>>> if you have some!
>>>>
>>>> p.
>>>>
>>>> On Sat, Jul 17, 2010 at 9:47 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>> Okay, I can reproduce this problem. Frankly, I don't think this ever 
>>>>> worked with OMPI, and I'm not sure how the choice of BTL makes a 
>>>>> difference.
>>>>>
>>>>> The program is crashing in the communicator definition, which involves a 
>>>>> communication over our internal out-of-band messaging system. That system 
>>>>> has zero connection to any BTL, so it should crash either way.
>>>>>
>>>>> Regardless, I will play with this a little as time allows. Thanks for the 
>>>>> reproducer!
>>>>>
>>>>>
>>>>> On Jun 25, 2010, at 7:23 AM, Philippe wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to run a test program which consists of a server creating a
>>>>>> port using MPI_Open_port and N clients using MPI_Comm_connect to
>>>>>> connect to the server.
>>>>>>
>>>>>> I'm able to do so with 1 server and 2 clients, but with 1 server + 3
>>>>>> clients, I get the following error message:
>>>>>>
>>>>>>   [node003:32274] [[37084,0],0]:route_callback tried routing message
>>>>>> from [[37084,1],0] to [[40912,1],0]:102, can't find route
>>>>>>
>>>>>> This is only happening with the openib BTL. With tcp BTL it works
>>>>>> perfectly fine (ofud also works as a matter of fact...). This has been
>>>>>> tested on two completely different clusters, with identical results.
>>>>>> In either cases, the IB frabic works normally.
>>>>>>
>>>>>> Any help would be greatly appreciated! Several people in my team
>>>>>> looked at the problem. Google and the mailing list archive did not
>>>>>> provide any clue. I believe that from an MPI standpoint, my test
>>>>>> program is valid (and it works with TCP, which make me feel better
>>>>>> about the sequence of MPI calls)
>>>>>>
>>>>>> Regards,
>>>>>> Philippe.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Background:
>>>>>>
>>>>>> I intend to use openMPI to transport data inside a much larger
>>>>>> application. Because of that, I cannot used mpiexec. Each process is
>>>>>> started by our own "job management" and use a name server to find
>>>>>> about each others. Once all the clients are connected, I would like
>>>>>> the server to do MPI_Recv to get the data from all the client. I dont
>>>>>> care about the order or which client are sending data, as long as I
>>>>>> can receive it with on call. Do do that, the clients and the server
>>>>>> are going through a series of Comm_accept/Conn_connect/Intercomm_merge
>>>>>> so that at the end, all the clients and the server are inside the same
>>>>>> intracomm.
>>>>>>
>>>>>> Steps:
>>>>>>
>>>>>> I have a sample program that show the issue. I tried to make it as
>>>>>> short as possible. It needs to be executed on a shared file system
>>>>>> like NFS because the server write the port info to a file that the
>>>>>> client will read. To reproduce the issue, the following steps should
>>>>>> be performed:
>>>>>>
>>>>>> 0. compile the test with "mpicc -o ben12 ben12.c"
>>>>>> 1. ssh to the machine that will be the server
>>>>>> 2. run ./ben12 3 1
>>>>>> 3. ssh to the machine that will be the client #1
>>>>>> 4. run ./ben12 3 0
>>>>>> 5. repeat step 3-4 for client #2 and #3
>>>>>>
>>>>>> the server accept the connection from client #1 and merge it in a new
>>>>>> intracomm. It then accept connection from client #2 and merge it. when
>>>>>> the client #3 arrives, the server accept the connection, but that
>>>>>> cause client #1 and #2 to die with the error above (see the complete
>>>>>> trace in the tarball).
>>>>>>
>>>>>> The exact steps are:
>>>>>>
>>>>>>     - server open port
>>>>>>     - server does accept
>>>>>>     - client #1 does connect
>>>>>>     - server and client #1 do merge
>>>>>>     - server does accept
>>>>>>     - client #2 does connect
>>>>>>     - server, client #1 and client #2 do merge
>>>>>>     - server does accept
>>>>>>     - client #3 does connect
>>>>>>     - server, client #1, client #2 and client #3 do merge
>>>>>>
>>>>>>
>>>>>> My infiniband network works normally with other test programs or
>>>>>> applications (MPI or others like Verbs).
>>>>>>
>>>>>> Info about my setup:
>>>>>>
>>>>>>    openMPI version = 1.4.1 (I also tried 1.4.2, nightly snapshot of
>>>>>> 1.4.3, nightly snapshot of 1.5 --- all show the same error)
>>>>>>    config.log in the tarball
>>>>>>    "ompi_info --all" in the tarball
>>>>>>    OFED version = 1.3 installed from RHEL 5.3
>>>>>>    Distro = RedHat Entreprise Linux 5.3
>>>>>>    Kernel = 2.6.18-128.4.1.el5 x86_64
>>>>>>    subnet manager = built-in SM from the cisco/topspin switch
>>>>>>    output of ibv_devinfo included in the tarball (there are no "bad" 
>>>>>> nodes)
>>>>>>    "ulimit -l" says "unlimited"
>>>>>>
>>>>>> The tarball contains:
>>>>>>
>>>>>>   - ben12.c: my test program showing the behavior
>>>>>>   - config.log / config.out / make.out / make-install.out /
>>>>>> ifconfig.txt / ibv-devinfo.txt / ompi_info.txt
>>>>>>   - trace-tcp.txt: output of the server and each client when it works
>>>>>> with TCP (I added "btl = tcp,self" in ~/.openmpi/mca-params.conf)
>>>>>>   - trace-ib.txt: output of the server and each client when it fails
>>>>>> with IB (I added "btl = openib,self" in ~/.openmpi/mca-params.conf)
>>>>>>
>>>>>> I hope I provided enough info for somebody to reproduce the problem...
>>>>>> <ompi-output.tar.bz2>_______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

Reply via email to