Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

Edgar Gabriel Wed, 28 Jul 2010 08:52:39 -0400

hm, this looks actually correct. The question now basically is, why the
intermediate hand-shake by the processes with rank 0 on the
inter-communicator is not finishing.
I am wandering whether this could be related to a problem reported in
another thread (Processes stuck after MPI_Waitall() in 1.4.1)?


http://www.open-mpi.org/community/lists/users/2010/07/13720.php




On 7/28/2010 4:01 AM, Grzegorz Maj wrote:
> I've attached gdb to the client which has just connected to the grid.
> Its bt is almost exactly the same as the server's one:
> #0  0x428066d7 in sched_yield () from /lib/libc.so.6
> #1  0x00933cbf in opal_progress () at ../../opal/runtime/opal_progress.c:220
> #2  0x00d460b8 in opal_condition_wait (c=0xdc3160, m=0xdc31a0) at
> ../../opal/threads/condition.h:99
> #3  0x00d463cc in ompi_request_default_wait_all (count=2,
> requests=0xff8a36d0, statuses=0x0) at
> ../../ompi/request/req_wait.c:262
> #4  0x00a1431f in mca_coll_inter_allgatherv_inter (sbuf=0xff8a3794,
> scount=1, sdtype=0x8049400, rbuf=0xff8a3750, rcounts=0x80948e0,
> disps=0x8093938, rdtype=0x8049400, comm=0x8094fb8, module=0x80954a0)
>     at ../../../../../ompi/mca/coll/inter/coll_inter_allgatherv.c:127
> #5  0x00d3198f in ompi_comm_determine_first (intercomm=0x8094fb8,
> high=1) at ../../ompi/communicator/comm.c:1199
> #6  0x00d75833 in PMPI_Intercomm_merge (intercomm=0x8094fb8, high=1,
> newcomm=0xff8a4c00) at pintercomm_merge.c:84
> #7  0x08048a16 in main (argc=892352312, argv=0x32323038) at client.c:28
> 
> I've tried both scenarios described: when hangs a client connecting
> from machines B and C. In both cases bt looks the same.
> How does it look like?
> Shall I repost that using a different subject as Ralph suggested?
> 
> Regards,
> Grzegorz
> 
> 
> 
> 2010/7/27 Edgar Gabriel <gabr...@cs.uh.edu>:
>> based on your output shown here, there is absolutely nothing wrong
>> (yet). Both processes are in the same function and do what they are
>> supposed to do.
>>
>> However, I am fairly sure that the client process bt that you show is
>> already part of current_intracomm. Could you try to create a bt of the
>> process that is not yet part of current_intracomm (If I understand your
>> code correctly, the intercommunicator is n-1 configuration, with each
>> client process being part of n after the intercomm_merge). It would be
>> interesting to see where that process is...
>>
>> Thanks
>> Edgar
>>
>> On 7/27/2010 1:42 PM, Ralph Castain wrote:
>>> This slides outside of my purview - I would suggest you post this question 
>>> with a different subject line specifically mentioning failure of 
>>> intercomm_merge to work so it attracts the attention of those with 
>>> knowledge of that area.
>>>
>>>
>>> On Jul 27, 2010, at 9:30 AM, Grzegorz Maj wrote:
>>>
>>>> So now I have a new question.
>>>> When I run my server and a lot of clients on the same machine,
>>>> everything looks fine.
>>>>
>>>> But when I try to run the clients on several machines the most
>>>> frequent scenario is:
>>>> * server is stared on machine A
>>>> * X (= 1, 4, 10, ..) clients are started on machine B and they connect
>>>> successfully
>>>> * the first client starting on machine C connects successfully to the
>>>> server, but the whole grid hangs on MPI_Comm_merge (all the processes
>>>> from intercommunicator get there).
>>>>
>>>> As I said it's the most frequent scenario. Sometimes I can connect the
>>>> clients from several machines. Sometimes it hangs (always on
>>>> MPI_Comm_merge) when connecting the clients from machine B.
>>>> The interesting thing is, that if before MPI_Comm_merge I send a dummy
>>>> message on the intercommunicator from process rank 0 in one group to
>>>> process rank 0 in the other one, it will not hang on MPI_Comm_merge.
>>>>
>>>> I've tried both versions with and without the first patch (ompi-server
>>>> as orted) but it doesn't change the behavior.
>>>>
>>>> I've attached gdb to my server, this is bt:
>>>> #0  0xffffe410 in __kernel_vsyscall ()
>>>> #1  0x00637afc in sched_yield () from /lib/libc.so.6
>>>> #2  0xf7e8ce31 in opal_progress () at 
>>>> ../../opal/runtime/opal_progress.c:220
>>>> #3  0xf7f60ad4 in opal_condition_wait (c=0xf7fd7dc0, m=0xf7fd7e00) at
>>>> ../../opal/threads/condition.h:99
>>>> #4  0xf7f60dee in ompi_request_default_wait_all (count=2,
>>>> requests=0xff8d7754, statuses=0x0) at
>>>> ../../ompi/request/req_wait.c:262
>>>> #5  0xf7d3e221 in mca_coll_inter_allgatherv_inter (sbuf=0xff8d7824,
>>>> scount=1, sdtype=0x8049200, rbuf=0xff8d77e0, rcounts=0x9783df8,
>>>> disps=0x9755520, rdtype=0x8049200, comm=0x978c2a8, module=0x9794b08)
>>>>    at ../../../../../ompi/mca/coll/inter/coll_inter_allgatherv.c:127
>>>> #6  0xf7f4c615 in ompi_comm_determine_first (intercomm=0x978c2a8,
>>>> high=0) at ../../ompi/communicator/comm.c:1199
>>>> #7  0xf7f8d1d9 in PMPI_Intercomm_merge (intercomm=0x978c2a8, high=0,
>>>> newcomm=0xff8d78c0) at pintercomm_merge.c:84
>>>> #8  0x0804893c in main (argc=Cannot access memory at address 0xf
>>>> ) at server.c:50
>>>>
>>>> And this is bt from one of the clients:
>>>> #0  0xffffe410 in __kernel_vsyscall ()
>>>> #1  0x0064993b in poll () from /lib/libc.so.6
>>>> #2  0xf7de027f in poll_dispatch (base=0x8643fb8, arg=0x86442d8,
>>>> tv=0xff82299c) at ../../../opal/event/poll.c:168
>>>> #3  0xf7dde4b2 in opal_event_base_loop (base=0x8643fb8, flags=2) at
>>>> ../../../opal/event/event.c:807
>>>> #4  0xf7dde34f in opal_event_loop (flags=2) at 
>>>> ../../../opal/event/event.c:730
>>>> #5  0xf7dcfc77 in opal_progress () at 
>>>> ../../opal/runtime/opal_progress.c:189
>>>> #6  0xf7ea80b8 in opal_condition_wait (c=0xf7f25160, m=0xf7f251a0) at
>>>> ../../opal/threads/condition.h:99
>>>> #7  0xf7ea7ff3 in ompi_request_wait_completion (req=0x8686680) at
>>>> ../../ompi/request/request.h:375
>>>> #8  0xf7ea7ef1 in ompi_request_default_wait (req_ptr=0xff822ae8,
>>>> status=0x0) at ../../ompi/request/req_wait.c:37
>>>> #9  0xf7c663a6 in ompi_coll_tuned_bcast_intra_generic
>>>> (buffer=0xff822d20, original_count=1, datatype=0x868bd00, root=0,
>>>> comm=0x86aa7f8, module=0x868b700, count_by_segment=1, tree=0x868b3d8)
>>>>    at ../../../../../ompi/mca/coll/tuned/coll_tuned_bcast.c:237
>>>> #10 0xf7c668ea in ompi_coll_tuned_bcast_intra_binomial
>>>> (buffer=0xff822d20, count=1, datatype=0x868bd00, root=0,
>>>> comm=0x86aa7f8, module=0x868b700, segsize=0)
>>>>    at ../../../../../ompi/mca/coll/tuned/coll_tuned_bcast.c:368
>>>> #11 0xf7c5af12 in ompi_coll_tuned_bcast_intra_dec_fixed
>>>> (buff=0xff822d20, count=1, datatype=0x868bd00, root=0, comm=0x86aa7f8,
>>>> module=0x868b700)
>>>>    at ../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:256
>>>> #12 0xf7c73269 in mca_coll_sync_bcast (buff=0xff822d20, count=1,
>>>> datatype=0x868bd00, root=0, comm=0x86aa7f8, module=0x86aaa28) at
>>>> ../../../../../ompi/mca/coll/sync/coll_sync_bcast.c:44
>>>> #13 0xf7c80381 in mca_coll_inter_allgatherv_inter (sbuf=0xff822d64,
>>>> scount=0, sdtype=0x8049400, rbuf=0xff822d20, rcounts=0x868a188,
>>>> disps=0x868abb8, rdtype=0x8049400, comm=0x86aa300,
>>>>    module=0x86aae18) at
>>>> ../../../../../ompi/mca/coll/inter/coll_inter_allgatherv.c:134
>>>> #14 0xf7e9398f in ompi_comm_determine_first (intercomm=0x86aa300,
>>>> high=0) at ../../ompi/communicator/comm.c:1199
>>>> #15 0xf7ed7833 in PMPI_Intercomm_merge (intercomm=0x86aa300, high=0,
>>>> newcomm=0xff8241d0) at pintercomm_merge.c:84
>>>> #16 0x08048afd in main (argc=943274038, argv=0x33393133) at client.c:47
>>>>
>>>>
>>>>
>>>> What do you think may cause the problem?
>>>>
>>>>
>>>> 2010/7/26 Ralph Castain <r...@open-mpi.org>:
>>>>> No problem at all - glad it works!
>>>>>
>>>>> On Jul 26, 2010, at 7:58 AM, Grzegorz Maj wrote:
>>>>>
>>>>>> Hi,
>>>>>> I'm very sorry, but the problem was on my side. My installation
>>>>>> process was not always taking the newest sources of openmpi. In this
>>>>>> case it hasn't installed the version with the latest patch. Now I
>>>>>> think everything works fine - I could run over 130 processes with no
>>>>>> problems.
>>>>>> I'm sorry again that I've wasted your time. And thank you for the patch.
>>>>>>
>>>>>> 2010/7/21 Ralph Castain <r...@open-mpi.org>:
>>>>>>> We're having some problem replicating this once my patches are applied. 
>>>>>>> Can you send us your configure cmd? Just the output from "head 
>>>>>>> config.log" will do for now.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> On Jul 20, 2010, at 9:09 AM, Grzegorz Maj wrote:
>>>>>>>
>>>>>>>> My start script looks almost exactly the same as the one published by
>>>>>>>> Edgar, ie. the processes are starting one by one with no delay.
>>>>>>>>
>>>>>>>> 2010/7/20 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>> Grzegorz: something occurred to me. When you start all these 
>>>>>>>>> processes, how are you staggering their wireup? Are they flooding us, 
>>>>>>>>> or are you time-shifting them a little?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jul 19, 2010, at 10:32 AM, Edgar Gabriel wrote:
>>>>>>>>>
>>>>>>>>>> Hm, so I am not sure how to approach this. First of all, the test 
>>>>>>>>>> case
>>>>>>>>>> works for me. I used up to 80 clients, and for both optimized and
>>>>>>>>>> non-optimized compilation. I ran the tests with trunk (not with 1.4
>>>>>>>>>> series, but the communicator code is identical in both cases). 
>>>>>>>>>> Clearly,
>>>>>>>>>> the patch from Ralph is necessary to make it work.
>>>>>>>>>>
>>>>>>>>>> Additionally, I went through the communicator creation code for 
>>>>>>>>>> dynamic
>>>>>>>>>> communicators trying to find spots that could create problems. The 
>>>>>>>>>> only
>>>>>>>>>> place that I found the number 64 appear is the fortran-to-c mapping
>>>>>>>>>> arrays (e.g. for communicators), where the initial size of the table 
>>>>>>>>>> is
>>>>>>>>>> 64. I looked twice over the pointer-array code to see whether we 
>>>>>>>>>> could
>>>>>>>>>> have a problem their (since it is a key-piece of the cid allocation 
>>>>>>>>>> code
>>>>>>>>>> for communicators), but I am fairly confident that it is correct.
>>>>>>>>>>
>>>>>>>>>> Note, that we have other (non-dynamic tests), were comm_set is called
>>>>>>>>>> 100,000 times, and the code per se does not seem to have a problem 
>>>>>>>>>> due
>>>>>>>>>> to being called too often. So I am not sure what else to look at.
>>>>>>>>>>
>>>>>>>>>> Edgar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 7/13/2010 8:42 PM, Ralph Castain wrote:
>>>>>>>>>>> As far as I can tell, it appears the problem is somewhere in our 
>>>>>>>>>>> communicator setup. The people knowledgeable on that area are going 
>>>>>>>>>>> to look into it later this week.
>>>>>>>>>>>
>>>>>>>>>>> I'm creating a ticket to track the problem and will copy you on it.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Jul 13, 2010, at 6:57 AM, Ralph Castain wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Jul 13, 2010, at 3:36 AM, Grzegorz Maj wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Bad news..
>>>>>>>>>>>>> I've tried the latest patch with and without the prior one, but it
>>>>>>>>>>>>> hasn't changed anything. I've also tried using the old code but 
>>>>>>>>>>>>> with
>>>>>>>>>>>>> the OMPI_DPM_BASE_MAXJOBIDS constant changed to 80, but it also 
>>>>>>>>>>>>> didn't
>>>>>>>>>>>>> help.
>>>>>>>>>>>>> While looking through the sources of openmpi-1.4.2 I couldn't 
>>>>>>>>>>>>> find any
>>>>>>>>>>>>> call of the function ompi_dpm_base_mark_dyncomm.
>>>>>>>>>>>>
>>>>>>>>>>>> It isn't directly called - it shows in ompi_comm_set as 
>>>>>>>>>>>> ompi_dpm.mark_dyncomm. You were definitely overrunning that array, 
>>>>>>>>>>>> but I guess something else is also being hit. Have to look 
>>>>>>>>>>>> further...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2010/7/12 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>> Just so you don't have to wait for 1.4.3 release, here is the 
>>>>>>>>>>>>>> patch (doesn't include the prior patch).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jul 12, 2010, at 12:13 PM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2010/7/12 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>>>> Dug around a bit and found the problem!!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have no idea who or why this was done, but somebody set a 
>>>>>>>>>>>>>>>> limit of 64 separate jobids in the dynamic init called by 
>>>>>>>>>>>>>>>> ompi_comm_set, which builds the intercommunicator. 
>>>>>>>>>>>>>>>> Unfortunately, they hard-wired the array size, but never check 
>>>>>>>>>>>>>>>> that size before adding to it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So after 64 calls to connect_accept, you are overwriting other 
>>>>>>>>>>>>>>>> areas of the code. As you found, hitting 66 causes it to 
>>>>>>>>>>>>>>>> segfault.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'll fix this on the developer's trunk (I'll also add that 
>>>>>>>>>>>>>>>> original patch to it). Rather than my searching this thread in 
>>>>>>>>>>>>>>>> detail, can you remind me what version you are using so I can 
>>>>>>>>>>>>>>>> patch it too?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm using 1.4.2
>>>>>>>>>>>>>>> Thanks a lot and I'm looking forward for the patch.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for your patience with this!
>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Jul 12, 2010, at 7:20 AM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1024 is not the problem: changing it to 2048 hasn't change 
>>>>>>>>>>>>>>>>> anything.
>>>>>>>>>>>>>>>>> Following your advice I've run my process using gdb. 
>>>>>>>>>>>>>>>>> Unfortunately I
>>>>>>>>>>>>>>>>> didn't get anything more than:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>>>>>>>>>>> [Switching to Thread 0xf7e4c6c0 (LWP 20246)]
>>>>>>>>>>>>>>>>> 0xf7f39905 in ompi_comm_set () from 
>>>>>>>>>>>>>>>>> /home/gmaj/openmpi/lib/libmpi.so.0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (gdb) bt
>>>>>>>>>>>>>>>>> #0  0xf7f39905 in ompi_comm_set () from 
>>>>>>>>>>>>>>>>> /home/gmaj/openmpi/lib/libmpi.so.0
>>>>>>>>>>>>>>>>> #1  0xf7e3ba95 in connect_accept () from
>>>>>>>>>>>>>>>>> /home/gmaj/openmpi/lib/openmpi/mca_dpm_orte.so
>>>>>>>>>>>>>>>>> #2  0xf7f62013 in PMPI_Comm_connect () from 
>>>>>>>>>>>>>>>>> /home/gmaj/openmpi/lib/libmpi.so.0
>>>>>>>>>>>>>>>>> #3  0x080489ed in main (argc=825832753, argv=0x34393638) at 
>>>>>>>>>>>>>>>>> client.c:43
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> What's more: when I've added a breakpoint on ompi_comm_set in 
>>>>>>>>>>>>>>>>> 66th
>>>>>>>>>>>>>>>>> process and stepped a couple of instructions, one of the other
>>>>>>>>>>>>>>>>> processes crashed (as usualy on ompi_comm_set) earlier than 
>>>>>>>>>>>>>>>>> 66th did.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Finally I decided to recompile openmpi using -g flag for gcc. 
>>>>>>>>>>>>>>>>> In this
>>>>>>>>>>>>>>>>> case the 66 processes issue has gone! I was running my 
>>>>>>>>>>>>>>>>> applications
>>>>>>>>>>>>>>>>> exactly the same way as previously (even without 
>>>>>>>>>>>>>>>>> recompilation) and
>>>>>>>>>>>>>>>>> I've run successfully over 130 processes.
>>>>>>>>>>>>>>>>> When switching back to the openmpi compilation without -g it 
>>>>>>>>>>>>>>>>> again segfaults.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Any ideas? I'm really confused.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2010/7/7 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>>>>>> I would guess the #files limit of 1024. However, if it 
>>>>>>>>>>>>>>>>>> behaves the same way when spread across multiple machines, I 
>>>>>>>>>>>>>>>>>> would suspect it is somewhere in your program itself. Given 
>>>>>>>>>>>>>>>>>> that the segfault is in your process, can you use gdb to 
>>>>>>>>>>>>>>>>>> look at the core file and see where and why it fails?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Jul 7, 2010, at 10:17 AM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2010/7/7 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Jul 6, 2010, at 8:48 AM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>>>>>>>> sorry for the late response, but I couldn't find free 
>>>>>>>>>>>>>>>>>>>>> time to play
>>>>>>>>>>>>>>>>>>>>> with this. Finally I've applied the patch you prepared. 
>>>>>>>>>>>>>>>>>>>>> I've launched
>>>>>>>>>>>>>>>>>>>>> my processes in the way you've described and I think it's 
>>>>>>>>>>>>>>>>>>>>> working as
>>>>>>>>>>>>>>>>>>>>> you expected. None of my processes runs the orted daemon 
>>>>>>>>>>>>>>>>>>>>> and they can
>>>>>>>>>>>>>>>>>>>>> perform MPI operations. Unfortunately I'm still hitting 
>>>>>>>>>>>>>>>>>>>>> the 65
>>>>>>>>>>>>>>>>>>>>> processes issue :(
>>>>>>>>>>>>>>>>>>>>> Maybe I'm doing something wrong.
>>>>>>>>>>>>>>>>>>>>> I attach my source code. If anybody could have a look on 
>>>>>>>>>>>>>>>>>>>>> this, I would
>>>>>>>>>>>>>>>>>>>>> be grateful.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> When I run that code with clients_count <= 65 everything 
>>>>>>>>>>>>>>>>>>>>> works fine:
>>>>>>>>>>>>>>>>>>>>> all the processes create a common grid, exchange some 
>>>>>>>>>>>>>>>>>>>>> information and
>>>>>>>>>>>>>>>>>>>>> disconnect.
>>>>>>>>>>>>>>>>>>>>> When I set clients_count > 65 the 66th process crashes on
>>>>>>>>>>>>>>>>>>>>> MPI_Comm_connect (segmentation fault).
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I didn't have time to check the code, but my guess is that 
>>>>>>>>>>>>>>>>>>>> you are still hitting some kind of file descriptor or 
>>>>>>>>>>>>>>>>>>>> other limit. Check to see what your limits are - usually 
>>>>>>>>>>>>>>>>>>>> "ulimit" will tell you.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> My limitations are:
>>>>>>>>>>>>>>>>>>> time(seconds)        unlimited
>>>>>>>>>>>>>>>>>>> file(blocks)         unlimited
>>>>>>>>>>>>>>>>>>> data(kb)             unlimited
>>>>>>>>>>>>>>>>>>> stack(kb)            10240
>>>>>>>>>>>>>>>>>>> coredump(blocks)     0
>>>>>>>>>>>>>>>>>>> memory(kb)           unlimited
>>>>>>>>>>>>>>>>>>> locked memory(kb)    64
>>>>>>>>>>>>>>>>>>> process              200704
>>>>>>>>>>>>>>>>>>> nofiles              1024
>>>>>>>>>>>>>>>>>>> vmemory(kb)          unlimited
>>>>>>>>>>>>>>>>>>> locks                unlimited
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Which one do you think could be responsible for that?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I was trying to run all the 66 processes on one machine or 
>>>>>>>>>>>>>>>>>>> spread them
>>>>>>>>>>>>>>>>>>> across several machines and it always crashes the same way 
>>>>>>>>>>>>>>>>>>> on the 66th
>>>>>>>>>>>>>>>>>>> process.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Another thing I would like to know is if it's normal that 
>>>>>>>>>>>>>>>>>>>>> any of my
>>>>>>>>>>>>>>>>>>>>> processes when calling MPI_Comm_connect or 
>>>>>>>>>>>>>>>>>>>>> MPI_Comm_accept when the
>>>>>>>>>>>>>>>>>>>>> other side is not ready, is eating up a full CPU 
>>>>>>>>>>>>>>>>>>>>> available.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Yes - the waiting process is polling in a tight loop 
>>>>>>>>>>>>>>>>>>>> waiting for the connection to be made.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Any help would be appreciated,
>>>>>>>>>>>>>>>>>>>>> Grzegorz Maj
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2010/4/24 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>>>>>>>>>> Actually, OMPI is distributed with a daemon that does 
>>>>>>>>>>>>>>>>>>>>>> pretty much what you
>>>>>>>>>>>>>>>>>>>>>> want. Checkout "man ompi-server". I originally wrote 
>>>>>>>>>>>>>>>>>>>>>> that code to support
>>>>>>>>>>>>>>>>>>>>>> cross-application MPI publish/subscribe operations, but 
>>>>>>>>>>>>>>>>>>>>>> we can utilize it
>>>>>>>>>>>>>>>>>>>>>> here too. Have to blame me for not making it more 
>>>>>>>>>>>>>>>>>>>>>> publicly known.
>>>>>>>>>>>>>>>>>>>>>> The attached patch upgrades ompi-server and modifies the 
>>>>>>>>>>>>>>>>>>>>>> singleton startup
>>>>>>>>>>>>>>>>>>>>>> to provide your desired support. This solution works in 
>>>>>>>>>>>>>>>>>>>>>> the following
>>>>>>>>>>>>>>>>>>>>>> manner:
>>>>>>>>>>>>>>>>>>>>>> 1. launch "ompi-server -report-uri <filename>". This 
>>>>>>>>>>>>>>>>>>>>>> starts a persistent
>>>>>>>>>>>>>>>>>>>>>> daemon called "ompi-server" that acts as a rendezvous 
>>>>>>>>>>>>>>>>>>>>>> point for
>>>>>>>>>>>>>>>>>>>>>> independently started applications.  The problem with 
>>>>>>>>>>>>>>>>>>>>>> starting different
>>>>>>>>>>>>>>>>>>>>>> applications and wanting them to MPI connect/accept lies 
>>>>>>>>>>>>>>>>>>>>>> in the need to have
>>>>>>>>>>>>>>>>>>>>>> the applications find each other. If they can't discover 
>>>>>>>>>>>>>>>>>>>>>> contact info for
>>>>>>>>>>>>>>>>>>>>>> the other app, then they can't wire up their 
>>>>>>>>>>>>>>>>>>>>>> interconnects. The
>>>>>>>>>>>>>>>>>>>>>> "ompi-server" tool provides that rendezvous point. I 
>>>>>>>>>>>>>>>>>>>>>> don't like that
>>>>>>>>>>>>>>>>>>>>>> comm_accept segfaulted - should have just error'd out.
>>>>>>>>>>>>>>>>>>>>>> 2. set OMPI_MCA_orte_server=file:<filename>" in the 
>>>>>>>>>>>>>>>>>>>>>> environment where you
>>>>>>>>>>>>>>>>>>>>>> will start your processes. This will allow your 
>>>>>>>>>>>>>>>>>>>>>> singleton processes to find
>>>>>>>>>>>>>>>>>>>>>> the ompi-server. I automatically also set the envar to 
>>>>>>>>>>>>>>>>>>>>>> connect the MPI
>>>>>>>>>>>>>>>>>>>>>> publish/subscribe system for you.
>>>>>>>>>>>>>>>>>>>>>> 3. run your processes. As they think they are 
>>>>>>>>>>>>>>>>>>>>>> singletons, they will detect
>>>>>>>>>>>>>>>>>>>>>> the presence of the above envar and automatically 
>>>>>>>>>>>>>>>>>>>>>> connect themselves to the
>>>>>>>>>>>>>>>>>>>>>> "ompi-server" daemon. This provides each process with 
>>>>>>>>>>>>>>>>>>>>>> the ability to perform
>>>>>>>>>>>>>>>>>>>>>> any MPI-2 operation.
>>>>>>>>>>>>>>>>>>>>>> I tested this on my machines and it worked, so hopefully 
>>>>>>>>>>>>>>>>>>>>>> it will meet your
>>>>>>>>>>>>>>>>>>>>>> needs. You only need to run one "ompi-server" period, so 
>>>>>>>>>>>>>>>>>>>>>> long as you locate
>>>>>>>>>>>>>>>>>>>>>> it where all of the processes can find the contact file 
>>>>>>>>>>>>>>>>>>>>>> and can open a TCP
>>>>>>>>>>>>>>>>>>>>>> socket to the daemon. There is a way to knit multiple 
>>>>>>>>>>>>>>>>>>>>>> ompi-servers into a
>>>>>>>>>>>>>>>>>>>>>> broader network (e.g., to connect processes that cannot 
>>>>>>>>>>>>>>>>>>>>>> directly access a
>>>>>>>>>>>>>>>>>>>>>> server due to network segmentation), but it's a tad 
>>>>>>>>>>>>>>>>>>>>>> tricky - let me know if
>>>>>>>>>>>>>>>>>>>>>> you require it and I'll try to help.
>>>>>>>>>>>>>>>>>>>>>> If you have trouble wiring them all into a single 
>>>>>>>>>>>>>>>>>>>>>> communicator, you might
>>>>>>>>>>>>>>>>>>>>>> ask separately about that and see if one of our MPI 
>>>>>>>>>>>>>>>>>>>>>> experts can provide
>>>>>>>>>>>>>>>>>>>>>> advice (I'm just the RTE grunt).
>>>>>>>>>>>>>>>>>>>>>> HTH - let me know how this works for you and I'll 
>>>>>>>>>>>>>>>>>>>>>> incorporate it into future
>>>>>>>>>>>>>>>>>>>>>> OMPI releases.
>>>>>>>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Apr 24, 2010, at 1:49 AM, Krzysztof Zarzycki wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>>>>>>>>> I'm Krzysztof and I'm working with Grzegorz Maj on this 
>>>>>>>>>>>>>>>>>>>>>> our small
>>>>>>>>>>>>>>>>>>>>>> project/experiment.
>>>>>>>>>>>>>>>>>>>>>> We definitely would like to give your patch a try. But 
>>>>>>>>>>>>>>>>>>>>>> could you please
>>>>>>>>>>>>>>>>>>>>>> explain your solution a little more?
>>>>>>>>>>>>>>>>>>>>>> You still would like to start one mpirun per mpi grid, 
>>>>>>>>>>>>>>>>>>>>>> and then have
>>>>>>>>>>>>>>>>>>>>>> processes started by us to join the MPI comm?
>>>>>>>>>>>>>>>>>>>>>> It is a good solution of course.
>>>>>>>>>>>>>>>>>>>>>> But it would be especially preferable to have one daemon 
>>>>>>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>>>>>>> persistently on our "entry" machine that can handle 
>>>>>>>>>>>>>>>>>>>>>> several mpi grid starts.
>>>>>>>>>>>>>>>>>>>>>> Can your patch help us this way too?
>>>>>>>>>>>>>>>>>>>>>> Thanks for your help!
>>>>>>>>>>>>>>>>>>>>>> Krzysztof
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 24 April 2010 03:51, Ralph Castain 
>>>>>>>>>>>>>>>>>>>>>> <r...@open-mpi.org> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> In thinking about this, my proposed solution won't 
>>>>>>>>>>>>>>>>>>>>>>> entirely fix the
>>>>>>>>>>>>>>>>>>>>>>> problem - you'll still wind up with all those daemons. 
>>>>>>>>>>>>>>>>>>>>>>> I believe I can
>>>>>>>>>>>>>>>>>>>>>>> resolve that one as well, but it would require a patch.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Would you like me to send you something you could try? 
>>>>>>>>>>>>>>>>>>>>>>> Might take a couple
>>>>>>>>>>>>>>>>>>>>>>> of iterations to get it right...
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Apr 23, 2010, at 12:12 PM, Ralph Castain wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hmmm....I -think- this will work, but I cannot 
>>>>>>>>>>>>>>>>>>>>>>>> guarantee it:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 1. launch one process (can just be a spinner) using 
>>>>>>>>>>>>>>>>>>>>>>>> mpirun that includes
>>>>>>>>>>>>>>>>>>>>>>>> the following option:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> mpirun -report-uri file
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> where file is some filename that mpirun can create and 
>>>>>>>>>>>>>>>>>>>>>>>> insert its
>>>>>>>>>>>>>>>>>>>>>>>> contact info into it. This can be a relative or 
>>>>>>>>>>>>>>>>>>>>>>>> absolute path. This process
>>>>>>>>>>>>>>>>>>>>>>>> must remain alive throughout your application - 
>>>>>>>>>>>>>>>>>>>>>>>> doesn't matter what it does.
>>>>>>>>>>>>>>>>>>>>>>>> It's purpose is solely to keep mpirun alive.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 2. set OMPI_MCA_dpm_orte_server=FILE:file in your 
>>>>>>>>>>>>>>>>>>>>>>>> environment, where
>>>>>>>>>>>>>>>>>>>>>>>> "file" is the filename given above. This will tell 
>>>>>>>>>>>>>>>>>>>>>>>> your processes how to
>>>>>>>>>>>>>>>>>>>>>>>> find mpirun, which is acting as a meeting place to 
>>>>>>>>>>>>>>>>>>>>>>>> handle the connect/accept
>>>>>>>>>>>>>>>>>>>>>>>> operations
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Now run your processes, and have them connect/accept 
>>>>>>>>>>>>>>>>>>>>>>>> to each other.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The reason I cannot guarantee this will work is that 
>>>>>>>>>>>>>>>>>>>>>>>> these processes
>>>>>>>>>>>>>>>>>>>>>>>> will all have the same rank && name since they all 
>>>>>>>>>>>>>>>>>>>>>>>> start as singletons.
>>>>>>>>>>>>>>>>>>>>>>>> Hence, connect/accept is likely to fail.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> But it -might- work, so you might want to give it a 
>>>>>>>>>>>>>>>>>>>>>>>> try.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Apr 23, 2010, at 8:10 AM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> To be more precise: by 'server process' I mean some 
>>>>>>>>>>>>>>>>>>>>>>>>> process that I
>>>>>>>>>>>>>>>>>>>>>>>>> could run once on my system and it could help in 
>>>>>>>>>>>>>>>>>>>>>>>>> creating those
>>>>>>>>>>>>>>>>>>>>>>>>> groups.
>>>>>>>>>>>>>>>>>>>>>>>>> My typical scenario is:
>>>>>>>>>>>>>>>>>>>>>>>>> 1. run N separate processes, each without mpirun
>>>>>>>>>>>>>>>>>>>>>>>>> 2. connect them into MPI group
>>>>>>>>>>>>>>>>>>>>>>>>> 3. do some job
>>>>>>>>>>>>>>>>>>>>>>>>> 4. exit all N processes
>>>>>>>>>>>>>>>>>>>>>>>>> 5. goto 1
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 2010/4/23 Grzegorz Maj <ma...@wp.pl>:
>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you Ralph for your explanation.
>>>>>>>>>>>>>>>>>>>>>>>>>> And, apart from that descriptors' issue, is there 
>>>>>>>>>>>>>>>>>>>>>>>>>> any other way to
>>>>>>>>>>>>>>>>>>>>>>>>>> solve my problem, i.e. to run separately a number of 
>>>>>>>>>>>>>>>>>>>>>>>>>> processes,
>>>>>>>>>>>>>>>>>>>>>>>>>> without mpirun and then to collect them into an MPI 
>>>>>>>>>>>>>>>>>>>>>>>>>> intracomm group?
>>>>>>>>>>>>>>>>>>>>>>>>>> If I for example would need to run some 'server 
>>>>>>>>>>>>>>>>>>>>>>>>>> process' (even using
>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun) for this task, that's OK. Any ideas?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>> Grzegorz Maj
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 2010/4/18 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Okay, but here is the problem. If you don't use 
>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun, and are not
>>>>>>>>>>>>>>>>>>>>>>>>>>> operating in an environment we support for "direct" 
>>>>>>>>>>>>>>>>>>>>>>>>>>> launch (i.e., starting
>>>>>>>>>>>>>>>>>>>>>>>>>>> processes outside of mpirun), then every one of 
>>>>>>>>>>>>>>>>>>>>>>>>>>> those processes thinks it is
>>>>>>>>>>>>>>>>>>>>>>>>>>> a singleton - yes?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> What you may not realize is that each singleton 
>>>>>>>>>>>>>>>>>>>>>>>>>>> immediately
>>>>>>>>>>>>>>>>>>>>>>>>>>> fork/exec's an orted daemon that is configured to 
>>>>>>>>>>>>>>>>>>>>>>>>>>> behave just like mpirun.
>>>>>>>>>>>>>>>>>>>>>>>>>>> This is required in order to support MPI-2 
>>>>>>>>>>>>>>>>>>>>>>>>>>> operations such as
>>>>>>>>>>>>>>>>>>>>>>>>>>> MPI_Comm_spawn, MPI_Comm_connect/accept, etc.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> So if you launch 64 processes that think they are 
>>>>>>>>>>>>>>>>>>>>>>>>>>> singletons, then
>>>>>>>>>>>>>>>>>>>>>>>>>>> you have 64 copies of orted running as well. This 
>>>>>>>>>>>>>>>>>>>>>>>>>>> eats up a lot of file
>>>>>>>>>>>>>>>>>>>>>>>>>>> descriptors, which is probably why you are hitting 
>>>>>>>>>>>>>>>>>>>>>>>>>>> this 65 process limit -
>>>>>>>>>>>>>>>>>>>>>>>>>>> your system is probably running out of file 
>>>>>>>>>>>>>>>>>>>>>>>>>>> descriptors. You might check you
>>>>>>>>>>>>>>>>>>>>>>>>>>> system limits and see if you can get them revised 
>>>>>>>>>>>>>>>>>>>>>>>>>>> upward.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 17, 2010, at 4:24 PM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I know. The problem is that I need to use 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> some special way for
>>>>>>>>>>>>>>>>>>>>>>>>>>>> running my processes provided by the environment 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> in which I'm
>>>>>>>>>>>>>>>>>>>>>>>>>>>> working
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and unfortunately I can't use mpirun.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2010/4/18 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Guess I don't understand why you can't use mpirun 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - all it does is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> start things, provide a means to forward io, etc. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It mainly sits there
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> quietly without using any cpu unless required to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> support the job.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sounds like it would solve your problem. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Otherwise, I know of no
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> way to get all these processes into comm_world.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 17, 2010, at 2:27 PM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'd like to dynamically create a group of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> processes communicating
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MPI. Those processes need to be run without 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mpirun and create
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> intracommunicator after the startup. Any ideas 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> how to do this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> efficiently?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I came up with a solution in which the processes 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are connecting
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one by
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one using MPI_Comm_connect, but unfortunately 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> all the processes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are already in the group need to call 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MPI_Comm_accept. This means
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> when the n-th process wants to connect I need to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> collect all the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> n-1
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> processes on the MPI_Comm_accept call. After I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> run about 40
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> every subsequent call takes more and more time, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which I'd like to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> avoid.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Another problem in this solution is that when I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> try to connect
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 66-th
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> process the root of the existing group segfaults 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MPI_Comm_accept.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Maybe it's my bug, but it's weird as everything 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> works fine for at
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 65 processes. Is there any limitation I don't 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know about?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> My last question is about MPI_COMM_WORLD. When I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> run my processes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> without mpirun their MPI_COMM_WORLD is the same 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> as MPI_COMM_SELF.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there any way to change MPI_COMM_WORLD and set 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it to the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> intracommunicator that I've created?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Grzegorz Maj
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <client.c><server.c>_______________________________________________
>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> --
>> Edgar Gabriel
>> Assistant Professor
>> Parallel Software Technologies Lab      http://pstl.cs.uh.edu
>> Department of Computer Science          University of Houston
>> Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
>> Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab      http://pstl.cs.uh.edu
Department of Computer Science          University of Houston
Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335

signature.asc
Description: OpenPGP digital signature

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

Reply via email to