Yeah, I’d put my money on a race condition under that scenario. I don’t have 
anything that large I can test on, but I’ll see what I can do

> On Jun 11, 2015, at 1:17 PM, Leiter, Kenneth W CIV USARMY ARL (US) 
> <kenneth.w.leiter2....@mail.mil> wrote:
> 
> Yes, each parent launches ten children and no other parents participate in 
> that spawn (i.e. the spawn uses MPI_COMM_SELF as the communicator).
> 
> No threading. 
> 
> I am using the example from: 
> https://github.com/bfroehle/mpi4py/tree/master/demo/spawning 
> <https://github.com/bfroehle/mpi4py/tree/master/demo/spawning> 
> 
> In lieu of my actual application which has a lot more moving parts.
> 
> After rerunning many times, it sometimes completes successfully and 
> othertimes seg faults the daemon.
> 
> - Ken
> 
> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
> [r...@open-mpi.org]
> Sent: Thursday, June 11, 2015 4:09 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] orted segmentation fault in pmix on master
> 
> So to be clear: each parent launches 10 children, and no other parents 
> participate in that spawn?
> 
> And there is no threading in the app, yes?
> 
> 
>> On Jun 11, 2015, at 12:53 PM, Leiter, Kenneth W CIV USARMY ARL (US) 
>> <kenneth.w.leiter2....@mail.mil <mailto:kenneth.w.leiter2....@mail.mil>> 
>> wrote:
>> 
>> Howard,
>> 
>> I do not run into a problem when I have one parent spawning many children 
>> (tested up to 100 children ranks), but am seeing the problem when I have, 
>> for example, 8 parents launching 10 children each.
>> 
>> - Ken
>> From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] 
>> on behalf of Howard Pritchard [hpprit...@gmail.com 
>> <mailto:hpprit...@gmail.com>]
>> Sent: Thursday, June 11, 2015 2:36 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] orted segmentation fault in pmix on master
>> 
>> Hi Ken,
>> 
>> Could you post the output of your ompi_info?
>> 
>> I have PrgEnv-gnu/5.2.56 and gcc/4.9.2 loaded in my env on nersc system.  
>> Following configure line:
>> 
>> ./configure --enable-mpi-java --prefix=my_favorite_install_location
>> 
>> The general rule of thumb on cray's with master (not with older versions 
>> though) is you should be able to
>> do a ./configure (install location) and you're ready to go, no need for 
>> complicated platform files, etc.
>> to just build vanilla.
>> 
>> As you're probably guessing, I'm going to say it works for me, at least up 
>> to 68 slave ranks.
>> 
>> I do notice there's some glitch with the mapping of the ranks though.  The 
>> binding logic seems
>> to think there's oversubscription of cores even when there should not be.  I 
>> had to use the
>> 
>> --bind-to none
>> 
>> option on the command line once I asked for more than 22 slave ranks.  
>> edison system has
>> has 24 cores/node.
>> 
>> Howard
>> 
>> 
>> 
>> 2015-06-11 12:10 GMT-06:00 Leiter, Kenneth W CIV USARMY ARL (US) 
>> <kenneth.w.leiter2....@mail.mil <mailto:kenneth.w.leiter2....@mail.mil>>:
>> I will try on a non-cray machine as well.
>> 
>> - Ken
>> 
>> -----Original Message-----
>> From: users [mailto:users-boun...@open-mpi.org 
>> <mailto:users-boun...@open-mpi.org>] On Behalf Of Howard Pritchard
>> Sent: Thursday, June 11, 2015 12:21 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] orted segmentation fault in pmix on master
>> 
>> Hello Ken,
>> 
>> Could you give the details of the allocation request (qsub args) as well as 
>> the mpirun command line args? I'm trying to reproduce on the nersc system.
>> 
>> It would be interesting if you have access to a similar size non-cray 
>> cluster if you get the same problems.
>> 
>> Howard
>> 
>> 
>> 2015-06-11 9:13 GMT-06:00 Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org> <mailto:r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>> >:
>> 
>> 
>>         I don’t have a Cray, but let me see if I can reproduce this on 
>> something else
>> 
>>         > On Jun 11, 2015, at 7:26 AM, Leiter, Kenneth W CIV USARMY ARL (US) 
>> <kenneth.w.leiter2....@mail.mil 
>> <mailto:kenneth.w.leiter2....@mail.mil><mailto:kenneth.w.leiter2....@mail.mil
>>  <mailto:kenneth.w.leiter2....@mail.mil>> > wrote:
>>         >
>>         > Hello,
>>         >
>>         > I am attempting to use the openmpi development master for a code 
>> that uses
>>         > dynamic process management (i.e. MPI_Comm_spawn) on our Cray XC40 
>> at the
>>         > Army Research Laboratory. After reading through the mailing list I 
>> came to
>>         > the conclusion that the master branch is the only hope for getting 
>> this to
>>         > work on the newer Cray machines.
>>         >
>>         > To test I am using the cpi-master.c cpi-worker.c example. The test 
>> works
>>         > when executing on a small number of processors, five or less, but 
>> begins to
>>         > fail with segmentation faults in orted when using more processors. 
>> Even with
>>         > five or fewer processors, I am spreading the computation to more 
>> than one
>>         > node. I am using the cray ugni btl through the alps scheduler.
>>         >
>>         > I get a core file from orted and have the seg fault tracked down to
>>         > pmix_server_process_msgs.c:420 where req->proxy is NULL. I have 
>> tried
>>         > reading the code to understand how this happens, but am unsure. I 
>> do see
>>         > that in the if statement where I take the else branch, the other 
>> branch
>>         > specifically checks "if (NULL == req->proxy)" - however, no such 
>> check is
>>         > done the the else branch.
>>         >
>>         > I have debug output dumped for the failing runs. I can provide the 
>> output
>>         > along with ompi_info output and config.log to anyone who is 
>> interested.
>>         >
>>         > - Ken Leiter
>>         >
>>         > _______________________________________________
>>         > users mailing list
>>         > us...@open-mpi.org <mailto:us...@open-mpi.org> 
>> <mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>
>>         > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>         > Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/06/27094.php 
>> <http://www.open-mpi.org/community/lists/users/2015/06/27094.php>
>> 
>>         _______________________________________________
>>         users mailing list
>>         us...@open-mpi.org <mailto:us...@open-mpi.org> 
>> <mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>>
>>         Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>         Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/06/27095.php 
>> <http://www.open-mpi.org/community/lists/users/2015/06/27095.php>
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/06/27103.php 
>> <http://www.open-mpi.org/community/lists/users/2015/06/27103.php>
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/06/27110.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/06/27113.php 
> <http://www.open-mpi.org/community/lists/users/2015/06/27113.php>

Reply via email to