On Wed, Jul 21, 2010 at 10:44 AM, Ralph Castain <r...@open-mpi.org> wrote: > > On Jul 21, 2010, at 7:44 AM, Philippe wrote: > >> Ralph, >> >> Sorry for the late reply -- I was away on vacation. > > no problem at all! > >> >> regarding your earlier question about how many processes where >> involved when the memory was entirely allocated, it was only two, a >> sender and a receiver. I'm still trying to pinpoint what can be >> different between the standalone case and the "integrated" case. I >> will try to find out what part of the code is allocating memory in a >> loop. > > hmmm....that sounds like a bug in your program. let me know what you find > >> >> >> On Tue, Jul 20, 2010 at 12:51 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> Well, I finally managed to make this work without the required ompi-server >>> rendezvous point. The fix is only in the devel trunk right now - I'll have >>> to ask the release managers for 1.5 and 1.4 if they want it ported to those >>> series. >>> >> >> great -- i'll give it a try >> >>> On the notion of integrating OMPI to your launch environment: remember that >>> we don't necessarily require that you use mpiexec for that purpose. If your >>> launch environment provides just a little info in the environment of the >>> launched procs, we can usually devise a method that allows the procs to >>> perform an MPI_Init as a single job without all this work you are doing. >>> >> >> I'm working on creating operators using MPI for the IBM product >> "InfoSphere Streams". It has its own launching mechanism to start the >> processes. However I can pass some information to the processes that >> belong to the same job (Streams job -- which should neatly map to MPI >> job). >> >>> Only difference is that your procs will all block in MPI_Init until they >>> -all- have executed that function. If that isn't a problem, this would be a >>> much more scalable and reliable method than doing it thru massive calls to >>> MPI_Port_connect. >>> >> >> in the general case, that would be a problem, but for my prototype, >> this is acceptable. >> >> In general, each process is composed of operators, some may be MPI >> related and some may not. But in my case, I know ahead of time which >> processes will be part of the MPI job, so I can easily deal with the >> fact that they would block on MPI_init (actually -- MPI_thread_init >> since its using a lot of threads). > > We have talked in the past about creating a non-blocking MPI_Init as an > extension to the standard. It would lock you to Open MPI, though... > > Regardless, at some point you would have to know how many processes are going > to be part of the job so you can know when MPI_Init is complete. I would > think you would require that info for the singleton wireup anyway - yes? > Otherwise, how would you know when to quit running connect-accept? >
the short answer is yes... although, the longer answer is a bit more complicated. currently I do know the number of connect I need to do on a per-port basis. a job can contains an arbitrary number of MPI processes, each opening one or more ports. so i know the count port by ports but I dont need to worry about how many MPI processes there is globally. to make things a bit more complicated, each MPI operator can be "fused" with other operators to make a process. each fused operator may or may not require MPI. the bottom line is, to get the total number of processes to calculate rank&size, I need to reverse engineer the fusing that the compiler may do. but that's ok, I'm willing to do that for our prototype :-) >> >> Is there a documentation or example I can use to see what information >> I can pass to the processes to enable that? Is it just environment >> variables? > > No real documentation - a lack I should probably fill. At the moment, we > don't have a "generic" module for standalone launch, but I can create one as > it is pretty trivial. I would then need you to pass each process envars > telling it the total number of processes in the MPI job, its rank within that > job, and a file where some rendezvous process (can be rank=0) has provided > that port string. Armed with that info, I can wireup the job. > > Won't be as scalable as an mpirun-initiated startup, but will be much better > than doing it from singletons. that would be great. I can definitely pass environment variables to each process. > > Or if you prefer, we could setup an "infosphere" module that we could > customize for this system. Main thing here would be to provide us with some > kind of regex (or access to a file containing the info) that describes the > map of rank to node so we can construct the wireup communication pattern. > i think for our prototype we are fine with the first method. I'd leave the cleaner implementation as a task for the product team ;-) regarding the "generic" module, is that something you can put together quickly? can I help in any way? Thanks! p > Either way would work. The second is more scalable, but I don't know if you > have (or can construct) the map info. > >> >> Many thanks! >> p. >> >>> >>> On Jul 18, 2010, at 4:09 PM, Philippe wrote: >>> >>>> Ralph, >>>> >>>> thanks for investigating. >>>> >>>> I've applied the two patches you mentioned earlier and ran with the >>>> ompi server. Although i was able to runn our standalone test, when I >>>> integrated the changes to our code, the processes entered a crazy loop >>>> and allocated all the memory available when calling MPI_Port_Connect. >>>> I was not able to identify why it works standalone but not integrated >>>> with our code. If I found why, I'll let your know. >>>> >>>> looking forward to your findings. We'll be happy to test any patches >>>> if you have some! >>>> >>>> p. >>>> >>>> On Sat, Jul 17, 2010 at 9:47 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> Okay, I can reproduce this problem. Frankly, I don't think this ever >>>>> worked with OMPI, and I'm not sure how the choice of BTL makes a >>>>> difference. >>>>> >>>>> The program is crashing in the communicator definition, which involves a >>>>> communication over our internal out-of-band messaging system. That system >>>>> has zero connection to any BTL, so it should crash either way. >>>>> >>>>> Regardless, I will play with this a little as time allows. Thanks for the >>>>> reproducer! >>>>> >>>>> >>>>> On Jun 25, 2010, at 7:23 AM, Philippe wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm trying to run a test program which consists of a server creating a >>>>>> port using MPI_Open_port and N clients using MPI_Comm_connect to >>>>>> connect to the server. >>>>>> >>>>>> I'm able to do so with 1 server and 2 clients, but with 1 server + 3 >>>>>> clients, I get the following error message: >>>>>> >>>>>> [node003:32274] [[37084,0],0]:route_callback tried routing message >>>>>> from [[37084,1],0] to [[40912,1],0]:102, can't find route >>>>>> >>>>>> This is only happening with the openib BTL. With tcp BTL it works >>>>>> perfectly fine (ofud also works as a matter of fact...). This has been >>>>>> tested on two completely different clusters, with identical results. >>>>>> In either cases, the IB frabic works normally. >>>>>> >>>>>> Any help would be greatly appreciated! Several people in my team >>>>>> looked at the problem. Google and the mailing list archive did not >>>>>> provide any clue. I believe that from an MPI standpoint, my test >>>>>> program is valid (and it works with TCP, which make me feel better >>>>>> about the sequence of MPI calls) >>>>>> >>>>>> Regards, >>>>>> Philippe. >>>>>> >>>>>> >>>>>> >>>>>> Background: >>>>>> >>>>>> I intend to use openMPI to transport data inside a much larger >>>>>> application. Because of that, I cannot used mpiexec. Each process is >>>>>> started by our own "job management" and use a name server to find >>>>>> about each others. Once all the clients are connected, I would like >>>>>> the server to do MPI_Recv to get the data from all the client. I dont >>>>>> care about the order or which client are sending data, as long as I >>>>>> can receive it with on call. Do do that, the clients and the server >>>>>> are going through a series of Comm_accept/Conn_connect/Intercomm_merge >>>>>> so that at the end, all the clients and the server are inside the same >>>>>> intracomm. >>>>>> >>>>>> Steps: >>>>>> >>>>>> I have a sample program that show the issue. I tried to make it as >>>>>> short as possible. It needs to be executed on a shared file system >>>>>> like NFS because the server write the port info to a file that the >>>>>> client will read. To reproduce the issue, the following steps should >>>>>> be performed: >>>>>> >>>>>> 0. compile the test with "mpicc -o ben12 ben12.c" >>>>>> 1. ssh to the machine that will be the server >>>>>> 2. run ./ben12 3 1 >>>>>> 3. ssh to the machine that will be the client #1 >>>>>> 4. run ./ben12 3 0 >>>>>> 5. repeat step 3-4 for client #2 and #3 >>>>>> >>>>>> the server accept the connection from client #1 and merge it in a new >>>>>> intracomm. It then accept connection from client #2 and merge it. when >>>>>> the client #3 arrives, the server accept the connection, but that >>>>>> cause client #1 and #2 to die with the error above (see the complete >>>>>> trace in the tarball). >>>>>> >>>>>> The exact steps are: >>>>>> >>>>>> - server open port >>>>>> - server does accept >>>>>> - client #1 does connect >>>>>> - server and client #1 do merge >>>>>> - server does accept >>>>>> - client #2 does connect >>>>>> - server, client #1 and client #2 do merge >>>>>> - server does accept >>>>>> - client #3 does connect >>>>>> - server, client #1, client #2 and client #3 do merge >>>>>> >>>>>> >>>>>> My infiniband network works normally with other test programs or >>>>>> applications (MPI or others like Verbs). >>>>>> >>>>>> Info about my setup: >>>>>> >>>>>> openMPI version = 1.4.1 (I also tried 1.4.2, nightly snapshot of >>>>>> 1.4.3, nightly snapshot of 1.5 --- all show the same error) >>>>>> config.log in the tarball >>>>>> "ompi_info --all" in the tarball >>>>>> OFED version = 1.3 installed from RHEL 5.3 >>>>>> Distro = RedHat Entreprise Linux 5.3 >>>>>> Kernel = 2.6.18-128.4.1.el5 x86_64 >>>>>> subnet manager = built-in SM from the cisco/topspin switch >>>>>> output of ibv_devinfo included in the tarball (there are no "bad" >>>>>> nodes) >>>>>> "ulimit -l" says "unlimited" >>>>>> >>>>>> The tarball contains: >>>>>> >>>>>> - ben12.c: my test program showing the behavior >>>>>> - config.log / config.out / make.out / make-install.out / >>>>>> ifconfig.txt / ibv-devinfo.txt / ompi_info.txt >>>>>> - trace-tcp.txt: output of the server and each client when it works >>>>>> with TCP (I added "btl = tcp,self" in ~/.openmpi/mca-params.conf) >>>>>> - trace-ib.txt: output of the server and each client when it fails >>>>>> with IB (I added "btl = openib,self" in ~/.openmpi/mca-params.conf) >>>>>> >>>>>> I hope I provided enough info for somebody to reproduce the problem... >>>>>> <ompi-output.tar.bz2>_______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >