Re: [OMPI users] MPI process dies with a route error when usingdynamic process calls to connect more than 2 clients to aserver with InfiniBand

Jeff Squyres Thu, 22 Jul 2010 20:22:13 -0400

It's worth noting that this new component will likely get pulled into 1.5.1 
(we're refreshing a bunch of stuff in 1.5.1 -- this new component will be 
included in that refresh).


No specific timeline on 1.5.1 yet, though.


On Jul 22, 2010, at 5:53 PM, Ralph Castain wrote:

> Dev trunk looks okay right now - I think you'll be fine using it. My new 
> component -might- work with 1.5, but probably not with 1.4. I haven't checked 
> either of them.
> 
> Anything at r23478 or above will have the new module. Let me know how it 
> works for you. I haven't tested it myself, but am pretty sure it should work.
> 
> 
> On Jul 22, 2010, at 3:22 PM, Philippe wrote:
> 
> > Ralph,
> >
> > Thank you so much!!
> >
> > I'll give it a try and let you know.
> >
> > I know it's a tough question, but how stable is the dev trunk? Can I
> > just grab the latest and run, or am I better off taking your changes
> > and copy them back in a stable release? (if so, which one? 1.4? 1.5?)
> >
> > p.
> >
> > On Thu, Jul 22, 2010 at 3:50 PM, Ralph Castain <r...@open-mpi.org> wrote:
> >> It was easier for me to just construct this module than to explain how to 
> >> do so :-)
> >>
> >> I will commit it this evening (couple of hours from now) as that is our 
> >> standard practice. You'll need to use the developer's trunk, though, to 
> >> use it.
> >>
> >> Here are the envars you'll need to provide:
> >>
> >> Each process needs to get the same following values:
> >>
> >> * OMPI_MCA_ess=generic
> >> * OMPI_MCA_orte_num_procs=<number of MPI procs>
> >> * OMPI_MCA_orte_nodes=<a comma-separated list of nodenames where MPI procs 
> >> reside>
> >> * OMPI_MCA_orte_ppn=<number of procs/node>
> >>
> >> Note that I have assumed this last value is a constant for simplicity. If 
> >> that isn't the case, let me know - you could instead provide it as a 
> >> comma-separated list of values with an entry for each node.
> >>
> >> In addition, you need to provide the following value that will be unique 
> >> to each process:
> >>
> >> * OMPI_MCA_orte_rank=<MPI rank>
> >>
> >> Finally, you have to provide a range of static TCP ports for use by the 
> >> processes. Pick any range that you know will be available across all the 
> >> nodes. You then need to ensure that each process sees the following envar:
> >>
> >> * OMPI_MCA_oob_tcp_static_ports=6000-6010  <== obviously, replace this 
> >> with your range
> >>
> >> You will need a port range that is at least equal to the ppn for the job 
> >> (each proc on a node will take one of the provided ports).
> >>
> >> That should do it. I compute everything else I need from those values.
> >>
> >> Does that work for you?
> >> Ralph
> >>
> >>
> >> On Jul 22, 2010, at 6:48 AM, Philippe wrote:
> >>
> >>> On Wed, Jul 21, 2010 at 10:44 AM, Ralph Castain <r...@open-mpi.org> wrote:
> >>>>
> >>>> On Jul 21, 2010, at 7:44 AM, Philippe wrote:
> >>>>
> >>>>> Ralph,
> >>>>>
> >>>>> Sorry for the late reply -- I was away on vacation.
> >>>>
> >>>> no problem at all!
> >>>>
> >>>>>
> >>>>> regarding your earlier question about how many processes where
> >>>>> involved when the memory was entirely allocated, it was only two, a
> >>>>> sender and a receiver. I'm still trying to pinpoint what can be
> >>>>> different between the standalone case and the "integrated" case. I
> >>>>> will try to find out what part of the code is allocating memory in a
> >>>>> loop.
> >>>>
> >>>> hmmm....that sounds like a bug in your program. let me know what you find
> >>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Jul 20, 2010 at 12:51 AM, Ralph Castain <r...@open-mpi.org> 
> >>>>> wrote:
> >>>>>> Well, I finally managed to make this work without the required 
> >>>>>> ompi-server rendezvous point. The fix is only in the devel trunk right 
> >>>>>> now - I'll have to ask the release managers for 1.5 and 1.4 if they 
> >>>>>> want it ported to those series.
> >>>>>>
> >>>>>
> >>>>> great -- i'll give it a try
> >>>>>
> >>>>>> On the notion of integrating OMPI to your launch environment: remember 
> >>>>>> that we don't necessarily require that you use mpiexec for that 
> >>>>>> purpose. If your launch environment provides just a little info in the 
> >>>>>> environment of the launched procs, we can usually devise a method that 
> >>>>>> allows the procs to perform an MPI_Init as a single job without all 
> >>>>>> this work you are doing.
> >>>>>>
> >>>>>
> >>>>> I'm working on creating operators using MPI for the IBM product
> >>>>> "InfoSphere Streams". It has its own launching mechanism to start the
> >>>>> processes. However I can pass some information to the processes that
> >>>>> belong to the same job (Streams job -- which should neatly map to MPI
> >>>>> job).
> >>>>>
> >>>>>> Only difference is that your procs will all block in MPI_Init until 
> >>>>>> they -all- have executed that function. If that isn't a problem, this 
> >>>>>> would be a much more scalable and reliable method than doing it thru 
> >>>>>> massive calls to MPI_Port_connect.
> >>>>>>
> >>>>>
> >>>>> in the general case, that would be a problem, but for my prototype,
> >>>>> this is acceptable.
> >>>>>
> >>>>> In general, each process is composed of operators, some may be MPI
> >>>>> related and some may not. But in my case, I know ahead of time which
> >>>>> processes will be part of the MPI job, so I can easily deal with the
> >>>>> fact that they would block on MPI_init (actually -- MPI_thread_init
> >>>>> since its using a lot of threads).
> >>>>
> >>>> We have talked in the past about creating a non-blocking MPI_Init as an 
> >>>> extension to the standard. It would lock you to Open MPI, though...
> >>>>
> >>>> Regardless, at some point you would have to know how many processes are 
> >>>> going to be part of the job so you can know when MPI_Init is complete. I 
> >>>> would think you would require that info for the singleton wireup anyway 
> >>>> - yes? Otherwise, how would you know when to quit running connect-accept?
> >>>>
> >>>
> >>> the short answer is yes... although, the longer answer is a bit more
> >>> complicated. currently I do know the number of connect I need to do on
> >>> a per-port basis. a job can contains an arbitrary number of MPI
> >>> processes, each opening one or more ports. so i know the count port by
> >>> ports but I dont need to worry about how many MPI processes there is
> >>> globally. to make things a bit more complicated, each MPI operator can
> >>> be "fused" with other operators to make a process. each fused operator
> >>> may or may not require MPI. the bottom line is, to get the total
> >>> number of processes to calculate rank&size, I need to reverse engineer
> >>> the fusing that the compiler may do.
> >>>
> >>> but that's ok, I'm willing to do that for our prototype :-)
> >>>
> >>>>>
> >>>>> Is there a documentation or example I can use to see what information
> >>>>> I can pass to the processes to enable that? Is it just environment
> >>>>> variables?
> >>>>
> >>>> No real documentation - a lack I should probably fill. At the moment, we 
> >>>> don't have a "generic" module for standalone launch, but I can create 
> >>>> one as it is pretty trivial. I would then need you to pass each process 
> >>>> envars telling it the total number of processes in the MPI job, its rank 
> >>>> within that job, and a file where some rendezvous process (can be 
> >>>> rank=0) has provided that port string. Armed with that info, I can 
> >>>> wireup the job.
> >>>>
> >>>> Won't be as scalable as an mpirun-initiated startup, but will be much 
> >>>> better than doing it from singletons.
> >>>
> >>> that would be great. I can definitely pass environment variables to
> >>> each process.
> >>>
> >>>>
> >>>> Or if you prefer, we could setup an "infosphere" module that we could 
> >>>> customize for this system. Main thing here would be to provide us with 
> >>>> some kind of regex (or access to a file containing the info) that 
> >>>> describes the map of rank to node so we can construct the wireup 
> >>>> communication pattern.
> >>>>
> >>>
> >>> i think for our prototype we are fine with the first method. I'd leave
> >>> the cleaner implementation as a task for the product team ;-)
> >>>
> >>> regarding the "generic" module, is that something you can put together
> >>> quickly? can I help in any way?
> >>>
> >>> Thanks!
> >>> p
> >>>
> >>>> Either way would work. The second is more scalable, but I don't know if 
> >>>> you have (or can construct) the map info.
> >>>>
> >>>>>
> >>>>> Many thanks!
> >>>>> p.
> >>>>>
> >>>>>>
> >>>>>> On Jul 18, 2010, at 4:09 PM, Philippe wrote:
> >>>>>>
> >>>>>>> Ralph,
> >>>>>>>
> >>>>>>> thanks for investigating.
> >>>>>>>
> >>>>>>> I've applied the two patches you mentioned earlier and ran with the
> >>>>>>> ompi server. Although i was able to runn our standalone test, when I
> >>>>>>> integrated the changes to our code, the processes entered a crazy loop
> >>>>>>> and allocated all the memory available when calling MPI_Port_Connect.
> >>>>>>> I was not able to identify why it works standalone but not integrated
> >>>>>>> with our code. If I found why, I'll let your know.
> >>>>>>>
> >>>>>>> looking forward to your findings. We'll be happy to test any patches
> >>>>>>> if you have some!
> >>>>>>>
> >>>>>>> p.
> >>>>>>>
> >>>>>>> On Sat, Jul 17, 2010 at 9:47 PM, Ralph Castain <r...@open-mpi.org> 
> >>>>>>> wrote:
> >>>>>>>> Okay, I can reproduce this problem. Frankly, I don't think this ever 
> >>>>>>>> worked with OMPI, and I'm not sure how the choice of BTL makes a 
> >>>>>>>> difference.
> >>>>>>>>
> >>>>>>>> The program is crashing in the communicator definition, which 
> >>>>>>>> involves a communication over our internal out-of-band messaging 
> >>>>>>>> system. That system has zero connection to any BTL, so it should 
> >>>>>>>> crash either way.
> >>>>>>>>
> >>>>>>>> Regardless, I will play with this a little as time allows. Thanks 
> >>>>>>>> for the reproducer!
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Jun 25, 2010, at 7:23 AM, Philippe wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I'm trying to run a test program which consists of a server 
> >>>>>>>>> creating a
> >>>>>>>>> port using MPI_Open_port and N clients using MPI_Comm_connect to
> >>>>>>>>> connect to the server.
> >>>>>>>>>
> >>>>>>>>> I'm able to do so with 1 server and 2 clients, but with 1 server + 3
> >>>>>>>>> clients, I get the following error message:
> >>>>>>>>>
> >>>>>>>>>   [node003:32274] [[37084,0],0]:route_callback tried routing message
> >>>>>>>>> from [[37084,1],0] to [[40912,1],0]:102, can't find route
> >>>>>>>>>
> >>>>>>>>> This is only happening with the openib BTL. With tcp BTL it works
> >>>>>>>>> perfectly fine (ofud also works as a matter of fact...). This has 
> >>>>>>>>> been
> >>>>>>>>> tested on two completely different clusters, with identical results.
> >>>>>>>>> In either cases, the IB frabic works normally.
> >>>>>>>>>
> >>>>>>>>> Any help would be greatly appreciated! Several people in my team
> >>>>>>>>> looked at the problem. Google and the mailing list archive did not
> >>>>>>>>> provide any clue. I believe that from an MPI standpoint, my test
> >>>>>>>>> program is valid (and it works with TCP, which make me feel better
> >>>>>>>>> about the sequence of MPI calls)
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Philippe.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Background:
> >>>>>>>>>
> >>>>>>>>> I intend to use openMPI to transport data inside a much larger
> >>>>>>>>> application. Because of that, I cannot used mpiexec. Each process is
> >>>>>>>>> started by our own "job management" and use a name server to find
> >>>>>>>>> about each others. Once all the clients are connected, I would like
> >>>>>>>>> the server to do MPI_Recv to get the data from all the client. I 
> >>>>>>>>> dont
> >>>>>>>>> care about the order or which client are sending data, as long as I
> >>>>>>>>> can receive it with on call. Do do that, the clients and the server
> >>>>>>>>> are going through a series of 
> >>>>>>>>> Comm_accept/Conn_connect/Intercomm_merge
> >>>>>>>>> so that at the end, all the clients and the server are inside the 
> >>>>>>>>> same
> >>>>>>>>> intracomm.
> >>>>>>>>>
> >>>>>>>>> Steps:
> >>>>>>>>>
> >>>>>>>>> I have a sample program that show the issue. I tried to make it as
> >>>>>>>>> short as possible. It needs to be executed on a shared file system
> >>>>>>>>> like NFS because the server write the port info to a file that the
> >>>>>>>>> client will read. To reproduce the issue, the following steps should
> >>>>>>>>> be performed:
> >>>>>>>>>
> >>>>>>>>> 0. compile the test with "mpicc -o ben12 ben12.c"
> >>>>>>>>> 1. ssh to the machine that will be the server
> >>>>>>>>> 2. run ./ben12 3 1
> >>>>>>>>> 3. ssh to the machine that will be the client #1
> >>>>>>>>> 4. run ./ben12 3 0
> >>>>>>>>> 5. repeat step 3-4 for client #2 and #3
> >>>>>>>>>
> >>>>>>>>> the server accept the connection from client #1 and merge it in a 
> >>>>>>>>> new
> >>>>>>>>> intracomm. It then accept connection from client #2 and merge it. 
> >>>>>>>>> when
> >>>>>>>>> the client #3 arrives, the server accept the connection, but that
> >>>>>>>>> cause client #1 and #2 to die with the error above (see the complete
> >>>>>>>>> trace in the tarball).
> >>>>>>>>>
> >>>>>>>>> The exact steps are:
> >>>>>>>>>
> >>>>>>>>>     - server open port
> >>>>>>>>>     - server does accept
> >>>>>>>>>     - client #1 does connect
> >>>>>>>>>     - server and client #1 do merge
> >>>>>>>>>     - server does accept
> >>>>>>>>>     - client #2 does connect
> >>>>>>>>>     - server, client #1 and client #2 do merge
> >>>>>>>>>     - server does accept
> >>>>>>>>>     - client #3 does connect
> >>>>>>>>>     - server, client #1, client #2 and client #3 do merge
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> My infiniband network works normally with other test programs or
> >>>>>>>>> applications (MPI or others like Verbs).
> >>>>>>>>>
> >>>>>>>>> Info about my setup:
> >>>>>>>>>
> >>>>>>>>>    openMPI version = 1.4.1 (I also tried 1.4.2, nightly snapshot of
> >>>>>>>>> 1.4.3, nightly snapshot of 1.5 --- all show the same error)
> >>>>>>>>>    config.log in the tarball
> >>>>>>>>>    "ompi_info --all" in the tarball
> >>>>>>>>>    OFED version = 1.3 installed from RHEL 5.3
> >>>>>>>>>    Distro = RedHat Entreprise Linux 5.3
> >>>>>>>>>    Kernel = 2.6.18-128.4.1.el5 x86_64
> >>>>>>>>>    subnet manager = built-in SM from the cisco/topspin switch
> >>>>>>>>>    output of ibv_devinfo included in the tarball (there are no 
> >>>>>>>>> "bad" nodes)
> >>>>>>>>>    "ulimit -l" says "unlimited"
> >>>>>>>>>
> >>>>>>>>> The tarball contains:
> >>>>>>>>>
> >>>>>>>>>   - ben12.c: my test program showing the behavior
> >>>>>>>>>   - config.log / config.out / make.out / make-install.out /
> >>>>>>>>> ifconfig.txt / ibv-devinfo.txt / ompi_info.txt
> >>>>>>>>>   - trace-tcp.txt: output of the server and each client when it 
> >>>>>>>>> works
> >>>>>>>>> with TCP (I added "btl = tcp,self" in ~/.openmpi/mca-params.conf)
> >>>>>>>>>   - trace-ib.txt: output of the server and each client when it fails
> >>>>>>>>> with IB (I added "btl = openib,self" in ~/.openmpi/mca-params.conf)
> >>>>>>>>>
> >>>>>>>>> I hope I provided enough info for somebody to reproduce the 
> >>>>>>>>> problem...
> >>>>>>>>> <ompi-output.tar.bz2>_______________________________________________
> >>>>>>>>> users mailing list
> >>>>>>>>> us...@open-mpi.org
> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> users mailing list
> >>>>>>>> us...@open-mpi.org
> >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> us...@open-mpi.org
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>> us...@open-mpi.org
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> us...@open-mpi.org
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] MPI process dies with a route error when usingdynamic process calls to connect more than 2 clients to aserver with InfiniBand

Reply via email to