Yeah, I’d put my money on a race condition under that scenario. I don’t have anything that large I can test on, but I’ll see what I can do
> On Jun 11, 2015, at 1:17 PM, Leiter, Kenneth W CIV USARMY ARL (US) > <kenneth.w.leiter2....@mail.mil> wrote: > > Yes, each parent launches ten children and no other parents participate in > that spawn (i.e. the spawn uses MPI_COMM_SELF as the communicator). > > No threading. > > I am using the example from: > https://github.com/bfroehle/mpi4py/tree/master/demo/spawning > <https://github.com/bfroehle/mpi4py/tree/master/demo/spawning> > > In lieu of my actual application which has a lot more moving parts. > > After rerunning many times, it sometimes completes successfully and > othertimes seg faults the daemon. > > - Ken > > From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain > [r...@open-mpi.org] > Sent: Thursday, June 11, 2015 4:09 PM > To: Open MPI Users > Subject: Re: [OMPI users] orted segmentation fault in pmix on master > > So to be clear: each parent launches 10 children, and no other parents > participate in that spawn? > > And there is no threading in the app, yes? > > >> On Jun 11, 2015, at 12:53 PM, Leiter, Kenneth W CIV USARMY ARL (US) >> <kenneth.w.leiter2....@mail.mil <mailto:kenneth.w.leiter2....@mail.mil>> >> wrote: >> >> Howard, >> >> I do not run into a problem when I have one parent spawning many children >> (tested up to 100 children ranks), but am seeing the problem when I have, >> for example, 8 parents launching 10 children each. >> >> - Ken >> From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] >> on behalf of Howard Pritchard [hpprit...@gmail.com >> <mailto:hpprit...@gmail.com>] >> Sent: Thursday, June 11, 2015 2:36 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] orted segmentation fault in pmix on master >> >> Hi Ken, >> >> Could you post the output of your ompi_info? >> >> I have PrgEnv-gnu/5.2.56 and gcc/4.9.2 loaded in my env on nersc system. >> Following configure line: >> >> ./configure --enable-mpi-java --prefix=my_favorite_install_location >> >> The general rule of thumb on cray's with master (not with older versions >> though) is you should be able to >> do a ./configure (install location) and you're ready to go, no need for >> complicated platform files, etc. >> to just build vanilla. >> >> As you're probably guessing, I'm going to say it works for me, at least up >> to 68 slave ranks. >> >> I do notice there's some glitch with the mapping of the ranks though. The >> binding logic seems >> to think there's oversubscription of cores even when there should not be. I >> had to use the >> >> --bind-to none >> >> option on the command line once I asked for more than 22 slave ranks. >> edison system has >> has 24 cores/node. >> >> Howard >> >> >> >> 2015-06-11 12:10 GMT-06:00 Leiter, Kenneth W CIV USARMY ARL (US) >> <kenneth.w.leiter2....@mail.mil <mailto:kenneth.w.leiter2....@mail.mil>>: >> I will try on a non-cray machine as well. >> >> - Ken >> >> -----Original Message----- >> From: users [mailto:users-boun...@open-mpi.org >> <mailto:users-boun...@open-mpi.org>] On Behalf Of Howard Pritchard >> Sent: Thursday, June 11, 2015 12:21 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] orted segmentation fault in pmix on master >> >> Hello Ken, >> >> Could you give the details of the allocation request (qsub args) as well as >> the mpirun command line args? I'm trying to reproduce on the nersc system. >> >> It would be interesting if you have access to a similar size non-cray >> cluster if you get the same problems. >> >> Howard >> >> >> 2015-06-11 9:13 GMT-06:00 Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org> <mailto:r...@open-mpi.org >> <mailto:r...@open-mpi.org>> >: >> >> >> I don’t have a Cray, but let me see if I can reproduce this on >> something else >> >> > On Jun 11, 2015, at 7:26 AM, Leiter, Kenneth W CIV USARMY ARL (US) >> <kenneth.w.leiter2....@mail.mil >> <mailto:kenneth.w.leiter2....@mail.mil><mailto:kenneth.w.leiter2....@mail.mil >> <mailto:kenneth.w.leiter2....@mail.mil>> > wrote: >> > >> > Hello, >> > >> > I am attempting to use the openmpi development master for a code >> that uses >> > dynamic process management (i.e. MPI_Comm_spawn) on our Cray XC40 >> at the >> > Army Research Laboratory. After reading through the mailing list I >> came to >> > the conclusion that the master branch is the only hope for getting >> this to >> > work on the newer Cray machines. >> > >> > To test I am using the cpi-master.c cpi-worker.c example. The test >> works >> > when executing on a small number of processors, five or less, but >> begins to >> > fail with segmentation faults in orted when using more processors. >> Even with >> > five or fewer processors, I am spreading the computation to more >> than one >> > node. I am using the cray ugni btl through the alps scheduler. >> > >> > I get a core file from orted and have the seg fault tracked down to >> > pmix_server_process_msgs.c:420 where req->proxy is NULL. I have >> tried >> > reading the code to understand how this happens, but am unsure. I >> do see >> > that in the if statement where I take the else branch, the other >> branch >> > specifically checks "if (NULL == req->proxy)" - however, no such >> check is >> > done the the else branch. >> > >> > I have debug output dumped for the failing runs. I can provide the >> output >> > along with ompi_info output and config.log to anyone who is >> interested. >> > >> > - Ken Leiter >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org <mailto:us...@open-mpi.org> >> <mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>> >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> > Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27094.php >> <http://www.open-mpi.org/community/lists/users/2015/06/27094.php> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> <mailto:us...@open-mpi.org <mailto:us...@open-mpi.org>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27095.php >> <http://www.open-mpi.org/community/lists/users/2015/06/27095.php> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27103.php >> <http://www.open-mpi.org/community/lists/users/2015/06/27103.php> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27110.php > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27113.php > <http://www.open-mpi.org/community/lists/users/2015/06/27113.php>