Hi Ralph, Attached is the map and reservaion output (I was adjusting the number of spawned ranks using an env. variable. I had one master which spawned 23 children.
Howard 2015-06-11 12:39 GMT-06:00 Ralph Castain <r...@open-mpi.org>: > Howard: could you add —display-devel-map —display-allocation and send the > output along? I’d like to see why it things you are oversubscribed. > > Thanks > > > On Jun 11, 2015, at 11:36 AM, Howard Pritchard <hpprit...@gmail.com> > wrote: > > Hi Ken, > > Could you post the output of your ompi_info? > > I have PrgEnv-gnu/5.2.56 and gcc/4.9.2 loaded in my env on nersc system. > Following configure line: > > ./configure --enable-mpi-java --prefix=my_favorite_install_location > > The general rule of thumb on cray's with master (not with older versions > though) is you should be able to > do a ./configure (install location) and you're ready to go, no need for > complicated platform files, etc. > to just build vanilla. > > As you're probably guessing, I'm going to say it works for me, at least up > to 68 slave ranks. > > I do notice there's some glitch with the mapping of the ranks though. The > binding logic seems > to think there's oversubscription of cores even when there should not be. > I had to use the > > --bind-to none > > option on the command line once I asked for more than 22 slave ranks. > edison system has > has 24 cores/node. > > Howard > > > > 2015-06-11 12:10 GMT-06:00 Leiter, Kenneth W CIV USARMY ARL (US) < > kenneth.w.leiter2....@mail.mil>: > >> I will try on a non-cray machine as well. >> >> - Ken >> >> -----Original Message----- >> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Howard >> Pritchard >> Sent: Thursday, June 11, 2015 12:21 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] orted segmentation fault in pmix on master >> >> Hello Ken, >> >> Could you give the details of the allocation request (qsub args) as well >> as the mpirun command line args? I'm trying to reproduce on the nersc >> system. >> >> It would be interesting if you have access to a similar size non-cray >> cluster if you get the same problems. >> >> Howard >> >> >> 2015-06-11 9:13 GMT-06:00 Ralph Castain <r...@open-mpi.org <mailto: >> r...@open-mpi.org> >: >> >> >> I don’t have a Cray, but let me see if I can reproduce this on >> something else >> >> > On Jun 11, 2015, at 7:26 AM, Leiter, Kenneth W CIV USARMY ARL >> (US) <kenneth.w.leiter2....@mail.mil <mailto: >> kenneth.w.leiter2....@mail.mil> > wrote: >> > >> > Hello, >> > >> > I am attempting to use the openmpi development master for a >> code that uses >> > dynamic process management (i.e. MPI_Comm_spawn) on our Cray >> XC40 at the >> > Army Research Laboratory. After reading through the mailing >> list I came to >> > the conclusion that the master branch is the only hope for >> getting this to >> > work on the newer Cray machines. >> > >> > To test I am using the cpi-master.c cpi-worker.c example. The >> test works >> > when executing on a small number of processors, five or less, >> but begins to >> > fail with segmentation faults in orted when using more >> processors. Even with >> > five or fewer processors, I am spreading the computation to >> more than one >> > node. I am using the cray ugni btl through the alps scheduler. >> > >> > I get a core file from orted and have the seg fault tracked >> down to >> > pmix_server_process_msgs.c:420 where req->proxy is NULL. I have >> tried >> > reading the code to understand how this happens, but am unsure. >> I do see >> > that in the if statement where I take the else branch, the >> other branch >> > specifically checks "if (NULL == req->proxy)" - however, no >> such check is >> > done the the else branch. >> > >> > I have debug output dumped for the failing runs. I can provide >> the output >> > along with ompi_info output and config.log to anyone who is >> interested. >> > >> > - Ken Leiter >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org <mailto:us...@open-mpi.org> >> > Subscription: >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27094.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27095.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/06/27103.php >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27104.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27105.php >
hpp@nid02051:~>export NUM_RANKS=23 hpp@nid02051:~>mpirun -np 1 --bind-to core --oversubscribe -display-allocation --display-map ./cpi-master ====================== ALLOCATED NODES ====================== 5600: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN 5609: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN 5610: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN 5636: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN ================================================================= Data for JOB [46913,1] offset 0 ======================== JOB MAP ======================== Data for node: 5600 Num slots: 24 Max slots: 0 Num procs: 1 resolved from 5600 resolved from 10.128.22.13 resolved from 128.55.72.44 Process OMPI jobid: [46913,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../../../../../../../../../..][../../../../../../../../../../../..] ============================================================= ./cpi-master -> ./cpi-worker ====================== ALLOCATED NODES ====================== 5600: slots=24 max_slots=0 slots_inuse=1 state=UNKNOWN 5609: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN 5610: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN 5636: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN ================================================================= Data for JOB [46913,2] offset 1 ======================== JOB MAP ======================== Data for node: 5600 Num slots: 24 Max slots: 0 Num procs: 24 resolved from 5600 resolved from 10.128.22.13 resolved from 128.55.72.44 Process OMPI jobid: [46913,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../../../../../../../../../..][../../../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 0 Bound: socket 0[core 1[hwt 0-1]]:[../BB/../../../../../../../../../..][../../../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 1 Bound: socket 1[core 12[hwt 0-1]]:[../../../../../../../../../../../..][BB/../../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 2 Bound: socket 0[core 2[hwt 0-1]]:[../../BB/../../../../../../../../..][../../../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 3 Bound: socket 1[core 13[hwt 0-1]]:[../../../../../../../../../../../..][../BB/../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 4 Bound: socket 0[core 3[hwt 0-1]]:[../../../BB/../../../../../../../..][../../../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 5 Bound: socket 1[core 14[hwt 0-1]]:[../../../../../../../../../../../..][../../BB/../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 6 Bound: socket 0[core 4[hwt 0-1]]:[../../../../BB/../../../../../../..][../../../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 7 Bound: socket 1[core 15[hwt 0-1]]:[../../../../../../../../../../../..][../../../BB/../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 8 Bound: socket 0[core 5[hwt 0-1]]:[../../../../../BB/../../../../../..][../../../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 9 Bound: socket 1[core 16[hwt 0-1]]:[../../../../../../../../../../../..][../../../../BB/../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 10 Bound: socket 0[core 6[hwt 0-1]]:[../../../../../../BB/../../../../..][../../../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 11 Bound: socket 1[core 17[hwt 0-1]]:[../../../../../../../../../../../..][../../../../../BB/../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 12 Bound: socket 0[core 7[hwt 0-1]]:[../../../../../../../BB/../../../..][../../../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 13 Bound: socket 1[core 18[hwt 0-1]]:[../../../../../../../../../../../..][../../../../../../BB/../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 14 Bound: socket 0[core 8[hwt 0-1]]:[../../../../../../../../BB/../../..][../../../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 15 Bound: socket 1[core 19[hwt 0-1]]:[../../../../../../../../../../../..][../../../../../../../BB/../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 16 Bound: socket 0[core 9[hwt 0-1]]:[../../../../../../../../../BB/../..][../../../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 17 Bound: socket 1[core 20[hwt 0-1]]:[../../../../../../../../../../../..][../../../../../../../../BB/../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 18 Bound: socket 0[core 10[hwt 0-1]]:[../../../../../../../../../../BB/..][../../../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 19 Bound: socket 1[core 21[hwt 0-1]]:[../../../../../../../../../../../..][../../../../../../../../../BB/../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 20 Bound: socket 0[core 11[hwt 0-1]]:[../../../../../../../../../../../BB][../../../../../../../../../../../..] Process OMPI jobid: [46913,2] App: 0 Process rank: 21 Bound: socket 1[core 22[hwt 0-1]]:[../../../../../../../../../../../..][../../../../../../../../../../BB/..] Process OMPI jobid: [46913,2] App: 0 Process rank: 22 Bound: socket 0[core 0[hwt 0-1]]:[BB/../../../../../../../../../../..][../../../../../../../../../../../..] ============================================================= pi: 3.1416009869231249, error: 0.0000083333333318