Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

Grzegorz Maj Tue, 20 Jul 2010 11:09:29 -0400

My start script looks almost exactly the same as the one published by
Edgar, ie. the processes are starting one by one with no delay.


2010/7/20 Ralph Castain <r...@open-mpi.org>:
> Grzegorz: something occurred to me. When you start all these processes, how 
> are you staggering their wireup? Are they flooding us, or are you 
> time-shifting them a little?
>
>
> On Jul 19, 2010, at 10:32 AM, Edgar Gabriel wrote:
>
>> Hm, so I am not sure how to approach this. First of all, the test case
>> works for me. I used up to 80 clients, and for both optimized and
>> non-optimized compilation. I ran the tests with trunk (not with 1.4
>> series, but the communicator code is identical in both cases). Clearly,
>> the patch from Ralph is necessary to make it work.
>>
>> Additionally, I went through the communicator creation code for dynamic
>> communicators trying to find spots that could create problems. The only
>> place that I found the number 64 appear is the fortran-to-c mapping
>> arrays (e.g. for communicators), where the initial size of the table is
>> 64. I looked twice over the pointer-array code to see whether we could
>> have a problem their (since it is a key-piece of the cid allocation code
>> for communicators), but I am fairly confident that it is correct.
>>
>> Note, that we have other (non-dynamic tests), were comm_set is called
>> 100,000 times, and the code per se does not seem to have a problem due
>> to being called too often. So I am not sure what else to look at.
>>
>> Edgar
>>
>>
>>
>> On 7/13/2010 8:42 PM, Ralph Castain wrote:
>>> As far as I can tell, it appears the problem is somewhere in our 
>>> communicator setup. The people knowledgeable on that area are going to look 
>>> into it later this week.
>>>
>>> I'm creating a ticket to track the problem and will copy you on it.
>>>
>>>
>>> On Jul 13, 2010, at 6:57 AM, Ralph Castain wrote:
>>>
>>>>
>>>> On Jul 13, 2010, at 3:36 AM, Grzegorz Maj wrote:
>>>>
>>>>> Bad news..
>>>>> I've tried the latest patch with and without the prior one, but it
>>>>> hasn't changed anything. I've also tried using the old code but with
>>>>> the OMPI_DPM_BASE_MAXJOBIDS constant changed to 80, but it also didn't
>>>>> help.
>>>>> While looking through the sources of openmpi-1.4.2 I couldn't find any
>>>>> call of the function ompi_dpm_base_mark_dyncomm.
>>>>
>>>> It isn't directly called - it shows in ompi_comm_set as 
>>>> ompi_dpm.mark_dyncomm. You were definitely overrunning that array, but I 
>>>> guess something else is also being hit. Have to look further...
>>>>
>>>>
>>>>>
>>>>>
>>>>> 2010/7/12 Ralph Castain <r...@open-mpi.org>:
>>>>>> Just so you don't have to wait for 1.4.3 release, here is the patch 
>>>>>> (doesn't include the prior patch).
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Jul 12, 2010, at 12:13 PM, Grzegorz Maj wrote:
>>>>>>
>>>>>>> 2010/7/12 Ralph Castain <r...@open-mpi.org>:
>>>>>>>> Dug around a bit and found the problem!!
>>>>>>>>
>>>>>>>> I have no idea who or why this was done, but somebody set a limit of 
>>>>>>>> 64 separate jobids in the dynamic init called by ompi_comm_set, which 
>>>>>>>> builds the intercommunicator. Unfortunately, they hard-wired the array 
>>>>>>>> size, but never check that size before adding to it.
>>>>>>>>
>>>>>>>> So after 64 calls to connect_accept, you are overwriting other areas 
>>>>>>>> of the code. As you found, hitting 66 causes it to segfault.
>>>>>>>>
>>>>>>>> I'll fix this on the developer's trunk (I'll also add that original 
>>>>>>>> patch to it). Rather than my searching this thread in detail, can you 
>>>>>>>> remind me what version you are using so I can patch it too?
>>>>>>>
>>>>>>> I'm using 1.4.2
>>>>>>> Thanks a lot and I'm looking forward for the patch.
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for your patience with this!
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>> On Jul 12, 2010, at 7:20 AM, Grzegorz Maj wrote:
>>>>>>>>
>>>>>>>>> 1024 is not the problem: changing it to 2048 hasn't change anything.
>>>>>>>>> Following your advice I've run my process using gdb. Unfortunately I
>>>>>>>>> didn't get anything more than:
>>>>>>>>>
>>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>>> [Switching to Thread 0xf7e4c6c0 (LWP 20246)]
>>>>>>>>> 0xf7f39905 in ompi_comm_set () from /home/gmaj/openmpi/lib/libmpi.so.0
>>>>>>>>>
>>>>>>>>> (gdb) bt
>>>>>>>>> #0  0xf7f39905 in ompi_comm_set () from 
>>>>>>>>> /home/gmaj/openmpi/lib/libmpi.so.0
>>>>>>>>> #1  0xf7e3ba95 in connect_accept () from
>>>>>>>>> /home/gmaj/openmpi/lib/openmpi/mca_dpm_orte.so
>>>>>>>>> #2  0xf7f62013 in PMPI_Comm_connect () from 
>>>>>>>>> /home/gmaj/openmpi/lib/libmpi.so.0
>>>>>>>>> #3  0x080489ed in main (argc=825832753, argv=0x34393638) at 
>>>>>>>>> client.c:43
>>>>>>>>>
>>>>>>>>> What's more: when I've added a breakpoint on ompi_comm_set in 66th
>>>>>>>>> process and stepped a couple of instructions, one of the other
>>>>>>>>> processes crashed (as usualy on ompi_comm_set) earlier than 66th did.
>>>>>>>>>
>>>>>>>>> Finally I decided to recompile openmpi using -g flag for gcc. In this
>>>>>>>>> case the 66 processes issue has gone! I was running my applications
>>>>>>>>> exactly the same way as previously (even without recompilation) and
>>>>>>>>> I've run successfully over 130 processes.
>>>>>>>>> When switching back to the openmpi compilation without -g it again 
>>>>>>>>> segfaults.
>>>>>>>>>
>>>>>>>>> Any ideas? I'm really confused.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2010/7/7 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>> I would guess the #files limit of 1024. However, if it behaves the 
>>>>>>>>>> same way when spread across multiple machines, I would suspect it is 
>>>>>>>>>> somewhere in your program itself. Given that the segfault is in your 
>>>>>>>>>> process, can you use gdb to look at the core file and see where and 
>>>>>>>>>> why it fails?
>>>>>>>>>>
>>>>>>>>>> On Jul 7, 2010, at 10:17 AM, Grzegorz Maj wrote:
>>>>>>>>>>
>>>>>>>>>>> 2010/7/7 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>
>>>>>>>>>>>> On Jul 6, 2010, at 8:48 AM, Grzegorz Maj wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>> sorry for the late response, but I couldn't find free time to play
>>>>>>>>>>>>> with this. Finally I've applied the patch you prepared. I've 
>>>>>>>>>>>>> launched
>>>>>>>>>>>>> my processes in the way you've described and I think it's working 
>>>>>>>>>>>>> as
>>>>>>>>>>>>> you expected. None of my processes runs the orted daemon and they 
>>>>>>>>>>>>> can
>>>>>>>>>>>>> perform MPI operations. Unfortunately I'm still hitting the 65
>>>>>>>>>>>>> processes issue :(
>>>>>>>>>>>>> Maybe I'm doing something wrong.
>>>>>>>>>>>>> I attach my source code. If anybody could have a look on this, I 
>>>>>>>>>>>>> would
>>>>>>>>>>>>> be grateful.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When I run that code with clients_count <= 65 everything works 
>>>>>>>>>>>>> fine:
>>>>>>>>>>>>> all the processes create a common grid, exchange some information 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> disconnect.
>>>>>>>>>>>>> When I set clients_count > 65 the 66th process crashes on
>>>>>>>>>>>>> MPI_Comm_connect (segmentation fault).
>>>>>>>>>>>>
>>>>>>>>>>>> I didn't have time to check the code, but my guess is that you are 
>>>>>>>>>>>> still hitting some kind of file descriptor or other limit. Check 
>>>>>>>>>>>> to see what your limits are - usually "ulimit" will tell you.
>>>>>>>>>>>
>>>>>>>>>>> My limitations are:
>>>>>>>>>>> time(seconds)        unlimited
>>>>>>>>>>> file(blocks)         unlimited
>>>>>>>>>>> data(kb)             unlimited
>>>>>>>>>>> stack(kb)            10240
>>>>>>>>>>> coredump(blocks)     0
>>>>>>>>>>> memory(kb)           unlimited
>>>>>>>>>>> locked memory(kb)    64
>>>>>>>>>>> process              200704
>>>>>>>>>>> nofiles              1024
>>>>>>>>>>> vmemory(kb)          unlimited
>>>>>>>>>>> locks                unlimited
>>>>>>>>>>>
>>>>>>>>>>> Which one do you think could be responsible for that?
>>>>>>>>>>>
>>>>>>>>>>> I was trying to run all the 66 processes on one machine or spread 
>>>>>>>>>>> them
>>>>>>>>>>> across several machines and it always crashes the same way on the 
>>>>>>>>>>> 66th
>>>>>>>>>>> process.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Another thing I would like to know is if it's normal that any of 
>>>>>>>>>>>>> my
>>>>>>>>>>>>> processes when calling MPI_Comm_connect or MPI_Comm_accept when 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> other side is not ready, is eating up a full CPU available.
>>>>>>>>>>>>
>>>>>>>>>>>> Yes - the waiting process is polling in a tight loop waiting for 
>>>>>>>>>>>> the connection to be made.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any help would be appreciated,
>>>>>>>>>>>>> Grzegorz Maj
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2010/4/24 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>> Actually, OMPI is distributed with a daemon that does pretty 
>>>>>>>>>>>>>> much what you
>>>>>>>>>>>>>> want. Checkout "man ompi-server". I originally wrote that code 
>>>>>>>>>>>>>> to support
>>>>>>>>>>>>>> cross-application MPI publish/subscribe operations, but we can 
>>>>>>>>>>>>>> utilize it
>>>>>>>>>>>>>> here too. Have to blame me for not making it more publicly known.
>>>>>>>>>>>>>> The attached patch upgrades ompi-server and modifies the 
>>>>>>>>>>>>>> singleton startup
>>>>>>>>>>>>>> to provide your desired support. This solution works in the 
>>>>>>>>>>>>>> following
>>>>>>>>>>>>>> manner:
>>>>>>>>>>>>>> 1. launch "ompi-server -report-uri <filename>". This starts a 
>>>>>>>>>>>>>> persistent
>>>>>>>>>>>>>> daemon called "ompi-server" that acts as a rendezvous point for
>>>>>>>>>>>>>> independently started applications.  The problem with starting 
>>>>>>>>>>>>>> different
>>>>>>>>>>>>>> applications and wanting them to MPI connect/accept lies in the 
>>>>>>>>>>>>>> need to have
>>>>>>>>>>>>>> the applications find each other. If they can't discover contact 
>>>>>>>>>>>>>> info for
>>>>>>>>>>>>>> the other app, then they can't wire up their interconnects. The
>>>>>>>>>>>>>> "ompi-server" tool provides that rendezvous point. I don't like 
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> comm_accept segfaulted - should have just error'd out.
>>>>>>>>>>>>>> 2. set OMPI_MCA_orte_server=file:<filename>" in the environment 
>>>>>>>>>>>>>> where you
>>>>>>>>>>>>>> will start your processes. This will allow your singleton 
>>>>>>>>>>>>>> processes to find
>>>>>>>>>>>>>> the ompi-server. I automatically also set the envar to connect 
>>>>>>>>>>>>>> the MPI
>>>>>>>>>>>>>> publish/subscribe system for you.
>>>>>>>>>>>>>> 3. run your processes. As they think they are singletons, they 
>>>>>>>>>>>>>> will detect
>>>>>>>>>>>>>> the presence of the above envar and automatically connect 
>>>>>>>>>>>>>> themselves to the
>>>>>>>>>>>>>> "ompi-server" daemon. This provides each process with the 
>>>>>>>>>>>>>> ability to perform
>>>>>>>>>>>>>> any MPI-2 operation.
>>>>>>>>>>>>>> I tested this on my machines and it worked, so hopefully it will 
>>>>>>>>>>>>>> meet your
>>>>>>>>>>>>>> needs. You only need to run one "ompi-server" period, so long as 
>>>>>>>>>>>>>> you locate
>>>>>>>>>>>>>> it where all of the processes can find the contact file and can 
>>>>>>>>>>>>>> open a TCP
>>>>>>>>>>>>>> socket to the daemon. There is a way to knit multiple 
>>>>>>>>>>>>>> ompi-servers into a
>>>>>>>>>>>>>> broader network (e.g., to connect processes that cannot directly 
>>>>>>>>>>>>>> access a
>>>>>>>>>>>>>> server due to network segmentation), but it's a tad tricky - let 
>>>>>>>>>>>>>> me know if
>>>>>>>>>>>>>> you require it and I'll try to help.
>>>>>>>>>>>>>> If you have trouble wiring them all into a single communicator, 
>>>>>>>>>>>>>> you might
>>>>>>>>>>>>>> ask separately about that and see if one of our MPI experts can 
>>>>>>>>>>>>>> provide
>>>>>>>>>>>>>> advice (I'm just the RTE grunt).
>>>>>>>>>>>>>> HTH - let me know how this works for you and I'll incorporate it 
>>>>>>>>>>>>>> into future
>>>>>>>>>>>>>> OMPI releases.
>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Apr 24, 2010, at 1:49 AM, Krzysztof Zarzycki wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>> I'm Krzysztof and I'm working with Grzegorz Maj on this our small
>>>>>>>>>>>>>> project/experiment.
>>>>>>>>>>>>>> We definitely would like to give your patch a try. But could you 
>>>>>>>>>>>>>> please
>>>>>>>>>>>>>> explain your solution a little more?
>>>>>>>>>>>>>> You still would like to start one mpirun per mpi grid, and then 
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>> processes started by us to join the MPI comm?
>>>>>>>>>>>>>> It is a good solution of course.
>>>>>>>>>>>>>> But it would be especially preferable to have one daemon running
>>>>>>>>>>>>>> persistently on our "entry" machine that can handle several mpi 
>>>>>>>>>>>>>> grid starts.
>>>>>>>>>>>>>> Can your patch help us this way too?
>>>>>>>>>>>>>> Thanks for your help!
>>>>>>>>>>>>>> Krzysztof
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 24 April 2010 03:51, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In thinking about this, my proposed solution won't entirely fix 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> problem - you'll still wind up with all those daemons. I 
>>>>>>>>>>>>>>> believe I can
>>>>>>>>>>>>>>> resolve that one as well, but it would require a patch.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Would you like me to send you something you could try? Might 
>>>>>>>>>>>>>>> take a couple
>>>>>>>>>>>>>>> of iterations to get it right...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Apr 23, 2010, at 12:12 PM, Ralph Castain wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hmmm....I -think- this will work, but I cannot guarantee it:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. launch one process (can just be a spinner) using mpirun 
>>>>>>>>>>>>>>>> that includes
>>>>>>>>>>>>>>>> the following option:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> mpirun -report-uri file
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> where file is some filename that mpirun can create and insert 
>>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>> contact info into it. This can be a relative or absolute path. 
>>>>>>>>>>>>>>>> This process
>>>>>>>>>>>>>>>> must remain alive throughout your application - doesn't matter 
>>>>>>>>>>>>>>>> what it does.
>>>>>>>>>>>>>>>> It's purpose is solely to keep mpirun alive.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2. set OMPI_MCA_dpm_orte_server=FILE:file in your environment, 
>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>> "file" is the filename given above. This will tell your 
>>>>>>>>>>>>>>>> processes how to
>>>>>>>>>>>>>>>> find mpirun, which is acting as a meeting place to handle the 
>>>>>>>>>>>>>>>> connect/accept
>>>>>>>>>>>>>>>> operations
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Now run your processes, and have them connect/accept to each 
>>>>>>>>>>>>>>>> other.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The reason I cannot guarantee this will work is that these 
>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>> will all have the same rank && name since they all start as 
>>>>>>>>>>>>>>>> singletons.
>>>>>>>>>>>>>>>> Hence, connect/accept is likely to fail.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But it -might- work, so you might want to give it a try.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Apr 23, 2010, at 8:10 AM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> To be more precise: by 'server process' I mean some process 
>>>>>>>>>>>>>>>>> that I
>>>>>>>>>>>>>>>>> could run once on my system and it could help in creating 
>>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>> groups.
>>>>>>>>>>>>>>>>> My typical scenario is:
>>>>>>>>>>>>>>>>> 1. run N separate processes, each without mpirun
>>>>>>>>>>>>>>>>> 2. connect them into MPI group
>>>>>>>>>>>>>>>>> 3. do some job
>>>>>>>>>>>>>>>>> 4. exit all N processes
>>>>>>>>>>>>>>>>> 5. goto 1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2010/4/23 Grzegorz Maj <ma...@wp.pl>:
>>>>>>>>>>>>>>>>>> Thank you Ralph for your explanation.
>>>>>>>>>>>>>>>>>> And, apart from that descriptors' issue, is there any other 
>>>>>>>>>>>>>>>>>> way to
>>>>>>>>>>>>>>>>>> solve my problem, i.e. to run separately a number of 
>>>>>>>>>>>>>>>>>> processes,
>>>>>>>>>>>>>>>>>> without mpirun and then to collect them into an MPI 
>>>>>>>>>>>>>>>>>> intracomm group?
>>>>>>>>>>>>>>>>>> If I for example would need to run some 'server process' 
>>>>>>>>>>>>>>>>>> (even using
>>>>>>>>>>>>>>>>>> mpirun) for this task, that's OK. Any ideas?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Grzegorz Maj
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2010/4/18 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>>>>>>> Okay, but here is the problem. If you don't use mpirun, and 
>>>>>>>>>>>>>>>>>>> are not
>>>>>>>>>>>>>>>>>>> operating in an environment we support for "direct" launch 
>>>>>>>>>>>>>>>>>>> (i.e., starting
>>>>>>>>>>>>>>>>>>> processes outside of mpirun), then every one of those 
>>>>>>>>>>>>>>>>>>> processes thinks it is
>>>>>>>>>>>>>>>>>>> a singleton - yes?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> What you may not realize is that each singleton immediately
>>>>>>>>>>>>>>>>>>> fork/exec's an orted daemon that is configured to behave 
>>>>>>>>>>>>>>>>>>> just like mpirun.
>>>>>>>>>>>>>>>>>>> This is required in order to support MPI-2 operations such 
>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>> MPI_Comm_spawn, MPI_Comm_connect/accept, etc.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So if you launch 64 processes that think they are 
>>>>>>>>>>>>>>>>>>> singletons, then
>>>>>>>>>>>>>>>>>>> you have 64 copies of orted running as well. This eats up a 
>>>>>>>>>>>>>>>>>>> lot of file
>>>>>>>>>>>>>>>>>>> descriptors, which is probably why you are hitting this 65 
>>>>>>>>>>>>>>>>>>> process limit -
>>>>>>>>>>>>>>>>>>> your system is probably running out of file descriptors. 
>>>>>>>>>>>>>>>>>>> You might check you
>>>>>>>>>>>>>>>>>>> system limits and see if you can get them revised upward.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Apr 17, 2010, at 4:24 PM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Yes, I know. The problem is that I need to use some 
>>>>>>>>>>>>>>>>>>>> special way for
>>>>>>>>>>>>>>>>>>>> running my processes provided by the environment in which 
>>>>>>>>>>>>>>>>>>>> I'm
>>>>>>>>>>>>>>>>>>>> working
>>>>>>>>>>>>>>>>>>>> and unfortunately I can't use mpirun.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2010/4/18 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>>>>>>>>> Guess I don't understand why you can't use mpirun - all 
>>>>>>>>>>>>>>>>>>>>> it does is
>>>>>>>>>>>>>>>>>>>>> start things, provide a means to forward io, etc. It 
>>>>>>>>>>>>>>>>>>>>> mainly sits there
>>>>>>>>>>>>>>>>>>>>> quietly without using any cpu unless required to support 
>>>>>>>>>>>>>>>>>>>>> the job.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Sounds like it would solve your problem. Otherwise, I 
>>>>>>>>>>>>>>>>>>>>> know of no
>>>>>>>>>>>>>>>>>>>>> way to get all these processes into comm_world.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Apr 17, 2010, at 2:27 PM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>> I'd like to dynamically create a group of processes 
>>>>>>>>>>>>>>>>>>>>>> communicating
>>>>>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>>>>> MPI. Those processes need to be run without mpirun and 
>>>>>>>>>>>>>>>>>>>>>> create
>>>>>>>>>>>>>>>>>>>>>> intracommunicator after the startup. Any ideas how to do 
>>>>>>>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>> efficiently?
>>>>>>>>>>>>>>>>>>>>>> I came up with a solution in which the processes are 
>>>>>>>>>>>>>>>>>>>>>> connecting
>>>>>>>>>>>>>>>>>>>>>> one by
>>>>>>>>>>>>>>>>>>>>>> one using MPI_Comm_connect, but unfortunately all the 
>>>>>>>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>> are already in the group need to call MPI_Comm_accept. 
>>>>>>>>>>>>>>>>>>>>>> This means
>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>> when the n-th process wants to connect I need to collect 
>>>>>>>>>>>>>>>>>>>>>> all the
>>>>>>>>>>>>>>>>>>>>>> n-1
>>>>>>>>>>>>>>>>>>>>>> processes on the MPI_Comm_accept call. After I run about 
>>>>>>>>>>>>>>>>>>>>>> 40
>>>>>>>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>>>>>>>> every subsequent call takes more and more time, which 
>>>>>>>>>>>>>>>>>>>>>> I'd like to
>>>>>>>>>>>>>>>>>>>>>> avoid.
>>>>>>>>>>>>>>>>>>>>>> Another problem in this solution is that when I try to 
>>>>>>>>>>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>>>>>>>> 66-th
>>>>>>>>>>>>>>>>>>>>>> process the root of the existing group segfaults on
>>>>>>>>>>>>>>>>>>>>>> MPI_Comm_accept.
>>>>>>>>>>>>>>>>>>>>>> Maybe it's my bug, but it's weird as everything works 
>>>>>>>>>>>>>>>>>>>>>> fine for at
>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>> 65 processes. Is there any limitation I don't know about?
>>>>>>>>>>>>>>>>>>>>>> My last question is about MPI_COMM_WORLD. When I run my 
>>>>>>>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>>>>>>>> without mpirun their MPI_COMM_WORLD is the same as 
>>>>>>>>>>>>>>>>>>>>>> MPI_COMM_SELF.
>>>>>>>>>>>>>>>>>>>>>> Is
>>>>>>>>>>>>>>>>>>>>>> there any way to change MPI_COMM_WORLD and set it to the
>>>>>>>>>>>>>>>>>>>>>> intracommunicator that I've created?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> Grzegorz Maj
>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>
>>>>>>>>>>>>> <client.c><server.c>_______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

Reply via email to