Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

Ralph Castain Fri, 23 Apr 2010 21:51:19 -0400

In thinking about this, my proposed solution won't entirely fix the problem - 
you'll still wind up with all those daemons. I believe I can resolve that one 
as well, but it would require a patch.


Would you like me to send you something you could try? Might take a couple of 
iterations to get it right...

On Apr 23, 2010, at 12:12 PM, Ralph Castain wrote:

> Hmmm....I -think- this will work, but I cannot guarantee it:
> 
> 1. launch one process (can just be a spinner) using mpirun that includes the 
> following option:
> 
> mpirun -report-uri file
> 
> where file is some filename that mpirun can create and insert its contact 
> info into it. This can be a relative or absolute path. This process must 
> remain alive throughout your application - doesn't matter what it does. It's 
> purpose is solely to keep mpirun alive.
> 
> 2. set OMPI_MCA_dpm_orte_server=FILE:file in your environment, where "file" 
> is the filename given above. This will tell your processes how to find 
> mpirun, which is acting as a meeting place to handle the connect/accept 
> operations
> 
> Now run your processes, and have them connect/accept to each other.
> 
> The reason I cannot guarantee this will work is that these processes will all 
> have the same rank && name since they all start as singletons. Hence, 
> connect/accept is likely to fail.
> 
> But it -might- work, so you might want to give it a try.
> 
> On Apr 23, 2010, at 8:10 AM, Grzegorz Maj wrote:
> 
>> To be more precise: by 'server process' I mean some process that I
>> could run once on my system and it could help in creating those
>> groups.
>> My typical scenario is:
>> 1. run N separate processes, each without mpirun
>> 2. connect them into MPI group
>> 3. do some job
>> 4. exit all N processes
>> 5. goto 1
>> 
>> 2010/4/23 Grzegorz Maj <ma...@wp.pl>:
>>> Thank you Ralph for your explanation.
>>> And, apart from that descriptors' issue, is there any other way to
>>> solve my problem, i.e. to run separately a number of processes,
>>> without mpirun and then to collect them into an MPI intracomm group?
>>> If I for example would need to run some 'server process' (even using
>>> mpirun) for this task, that's OK. Any ideas?
>>> 
>>> Thanks,
>>> Grzegorz Maj
>>> 
>>> 
>>> 2010/4/18 Ralph Castain <r...@open-mpi.org>:
>>>> Okay, but here is the problem. If you don't use mpirun, and are not 
>>>> operating in an environment we support for "direct" launch (i.e., starting 
>>>> processes outside of mpirun), then every one of those processes thinks it 
>>>> is a singleton - yes?
>>>> 
>>>> What you may not realize is that each singleton immediately fork/exec's an 
>>>> orted daemon that is configured to behave just like mpirun. This is 
>>>> required in order to support MPI-2 operations such as MPI_Comm_spawn, 
>>>> MPI_Comm_connect/accept, etc.
>>>> 
>>>> So if you launch 64 processes that think they are singletons, then you 
>>>> have 64 copies of orted running as well. This eats up a lot of file 
>>>> descriptors, which is probably why you are hitting this 65 process limit - 
>>>> your system is probably running out of file descriptors. You might check 
>>>> you system limits and see if you can get them revised upward.
>>>> 
>>>> 
>>>> On Apr 17, 2010, at 4:24 PM, Grzegorz Maj wrote:
>>>> 
>>>>> Yes, I know. The problem is that I need to use some special way for
>>>>> running my processes provided by the environment in which I'm working
>>>>> and unfortunately I can't use mpirun.
>>>>> 
>>>>> 2010/4/18 Ralph Castain <r...@open-mpi.org>:
>>>>>> Guess I don't understand why you can't use mpirun - all it does is start 
>>>>>> things, provide a means to forward io, etc. It mainly sits there quietly 
>>>>>> without using any cpu unless required to support the job.
>>>>>> 
>>>>>> Sounds like it would solve your problem. Otherwise, I know of no way to 
>>>>>> get all these processes into comm_world.
>>>>>> 
>>>>>> 
>>>>>> On Apr 17, 2010, at 2:27 PM, Grzegorz Maj wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> I'd like to dynamically create a group of processes communicating via
>>>>>>> MPI. Those processes need to be run without mpirun and create
>>>>>>> intracommunicator after the startup. Any ideas how to do this
>>>>>>> efficiently?
>>>>>>> I came up with a solution in which the processes are connecting one by
>>>>>>> one using MPI_Comm_connect, but unfortunately all the processes that
>>>>>>> are already in the group need to call MPI_Comm_accept. This means that
>>>>>>> when the n-th process wants to connect I need to collect all the n-1
>>>>>>> processes on the MPI_Comm_accept call. After I run about 40 processes
>>>>>>> every subsequent call takes more and more time, which I'd like to
>>>>>>> avoid.
>>>>>>> Another problem in this solution is that when I try to connect 66-th
>>>>>>> process the root of the existing group segfaults on MPI_Comm_accept.
>>>>>>> Maybe it's my bug, but it's weird as everything works fine for at most
>>>>>>> 65 processes. Is there any limitation I don't know about?
>>>>>>> My last question is about MPI_COMM_WORLD. When I run my processes
>>>>>>> without mpirun their MPI_COMM_WORLD is the same as MPI_COMM_SELF. Is
>>>>>>> there any way to change MPI_COMM_WORLD and set it to the
>>>>>>> intracommunicator that I've created?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Grzegorz Maj
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Dynamic processes connection and segfault on MPI_Comm_accept

Reply via email to