Hi Ken I haven't forgotten you. We've been meeting this week, which has limited my time, but I am working on a replacement for that entire code block that should resolve the problem. Hope to have it soon.
Ralph On Thu, Jun 25, 2015 at 2:12 PM, Leiter, Kenneth W CIV USARMY ARL (US) < kenneth.w.leiter2....@mail.mil> wrote: > Hi Ralph, > > I had some time this afternoon to work on this problem further and > discovered some more info. > > I used valgrind to attach to orted and collected logs of valgrind > output. > > I get many uninitialized value errors in pmix_server_process_msgs.c > beginning at line 378. It appears that reply is never allocated. If I add > "reply > = OBJ_NEW(opal_buffer_t);" before filling reply, I get rid of those errors > from valgrind. Whether that is the correct fix I do not know. > > Unfortunately this doesn't solve my problem crashing orted. I now > consistently get a single error detected by valgrind: > > ==29602== Process terminating with default action of signal 11 > (SIGSEGV): dumping core > > ==29602== Access not within mapped region at address 0x48 > > ==29602== at 0x4E6E2FA: orte_util_print_name_args (name_fns.c:142) > > ==29602== by 0xCABE394: orte_rml_oob_send_buffer_nb (rml_oob_send.c:269) > > ==29602== by 0x4ED621E: pmix_server_process_message > (pmix_server_process_msgs.c:421) > > ==29602== by 0x4EC2606: pmix_server_recv_handler > (pmix_server_sendrecv.c:446) > > ==29602== by 0x528D31C: opal_libevent2022_event_base_loop (event.c:1321) > > ==29602== by 0x4EA3142: orte_daemon (orted_main.c:864) > > ==29602== by 0x401073: main (orted.c:60) > > From the core file I get from orted - I see that req->proxy is Null in > pmix_server_process_msgs.c:421 . How this arises, I do not know. > > Thanks, > Ken Leiter > > ------------------------------ > *From:* users [users-boun...@open-mpi.org] on behalf of Ralph Castain [ > r...@open-mpi.org] > *Sent:* Thursday, June 11, 2015 4:27 PM > > *To:* Open MPI Users > *Subject:* Re: [OMPI users] orted segmentation fault in pmix on master > > Yeah, I’d put my money on a race condition under that scenario. I don’t > have anything that large I can test on, but I’ll see what I can do > > On Jun 11, 2015, at 1:17 PM, Leiter, Kenneth W CIV USARMY ARL (US) < > kenneth.w.leiter2....@mail.mil> wrote: > > Yes, each parent launches ten children and no other parents participate > in that spawn (i.e. the spawn uses MPI_COMM_SELF as the communicator). > > No threading. > > I am using the example from: > https://github.com/bfroehle/mpi4py/tree/master/demo/spawning > > In lieu of my actual application which has a lot more moving parts. > > After rerunning many times, it sometimes completes successfully and > othertimes seg faults the daemon. > > - Ken > > ------------------------------ > *From:* users [users-boun...@open-mpi.org] on behalf of Ralph Castain [ > r...@open-mpi.org] > *Sent:* Thursday, June 11, 2015 4:09 PM > *To:* Open MPI Users > *Subject:* Re: [OMPI users] orted segmentation fault in pmix on master > > So to be clear: each parent launches 10 children, and no other parents > participate in that spawn? > > And there is no threading in the app, yes? > > > On Jun 11, 2015, at 12:53 PM, Leiter, Kenneth W CIV USARMY ARL (US) < > kenneth.w.leiter2....@mail.mil> wrote: > > Howard, > > I do not run into a problem when I have one parent spawning many > children (tested up to 100 children ranks), but am seeing the problem when > I have, for example, 8 parents launching 10 children each. > > - Ken > ------------------------------ > *From:* users [users-boun...@open-mpi.org] on behalf of Howard Pritchard [ > hpprit...@gmail.com] > *Sent:* Thursday, June 11, 2015 2:36 PM > *To:* Open MPI Users > *Subject:* Re: [OMPI users] orted segmentation fault in pmix on master > > Hi Ken, > > Could you post the output of your ompi_info? > > I have PrgEnv-gnu/5.2.56 and gcc/4.9.2 loaded in my env on nersc > system. Following configure line: > > ./configure --enable-mpi-java --prefix=my_favorite_install_location > > The general rule of thumb on cray's with master (not with older versions > though) is you should be able to > do a ./configure (install location) and you're ready to go, no need for > complicated platform files, etc. > to just build vanilla. > > As you're probably guessing, I'm going to say it works for me, at least > up to 68 slave ranks. > > I do notice there's some glitch with the mapping of the ranks though. > The binding logic seems > to think there's oversubscription of cores even when there should not be. > I had to use the > > --bind-to none > > option on the command line once I asked for more than 22 slave ranks. > edison system has > has 24 cores/node. > > Howard > > > > 2015-06-11 12:10 GMT-06:00 Leiter, Kenneth W CIV USARMY ARL (US) < > kenneth.w.leiter2....@mail.mil>: > >> I will try on a non-cray machine as well. >> >> - Ken >> >> -----Original Message----- >> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Howard >> Pritchard >> Sent: Thursday, June 11, 2015 12:21 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] orted segmentation fault in pmix on master >> >> Hello Ken, >> >> Could you give the details of the allocation request (qsub args) as well >> as the mpirun command line args? I'm trying to reproduce on the nersc >> system. >> >> It would be interesting if you have access to a similar size non-cray >> cluster if you get the same problems. >> >> Howard >> >> >> 2015-06-11 9:13 GMT-06:00 Ralph Castain <r...@open-mpi.org <mailto: >> r...@open-mpi.org> >: >> >> >> I don’t have a Cray, but let me see if I can reproduce this on >> something else >> >> > On Jun 11, 2015, at 7:26 AM, Leiter, Kenneth W CIV USARMY ARL >> (US) <kenneth.w.leiter2....@mail.mil<mailto: >> kenneth.w.leiter2....@mail.mil> > wrote: >> > >> > Hello, >> > >> > I am attempting to use the openmpi development master for a >> code that uses >> > dynamic process management (i.e. MPI_Comm_spawn) on our Cray >> XC40 at the >> > Army Research Laboratory. After reading through the mailing >> list I came to >> > the conclusion that the master branch is the only hope for >> getting this to >> > work on the newer Cray machines. >> > >> > To test I am using the cpi-master.c cpi-worker.c example. The >> test works >> > when executing on a small number of processors, five or less, >> but begins to >> > fail with segmentation faults in orted when using more >> processors. Even with >> > five or fewer processors, I am spreading the computation to >> more than one >> > node. I am using the cray ugni btl through the alps scheduler. >> > >> > I get a core file from orted and have the seg fault tracked >> down to >> > pmix_server_process_msgs.c:420 where req->proxy is NULL. I have >> tried >> > reading the code to understand how this happens, but am unsure. >> I do see >> > that in the if statement where I take the else branch, the >> other branch >> > specifically checks "if (NULL == req->proxy)" - however, no >> such check is >> > done the the else branch. >> > >> > I have debug output dumped for the failing runs. I can provide >> the output >> > along with ompi_info output and config.log to anyone who is >> interested. >> > >> > - Ken Leiter >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org <mailto:us...@open-mpi.org> >> > Subscription: >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27094.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27095.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27103.php >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27110.php > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27113.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27195.php >