Hi Ralph,

Attached is the map and reservaion output  (I was adjusting the number of
spawned ranks using an env. variable.
I had one master which spawned 23 children.

Howard


2015-06-11 12:39 GMT-06:00 Ralph Castain <r...@open-mpi.org>:

> Howard: could you add —display-devel-map —display-allocation and send the
> output along? I’d like to see why it things you are oversubscribed.
>
> Thanks
>
>
> On Jun 11, 2015, at 11:36 AM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
>
> Hi Ken,
>
> Could you post the output of your ompi_info?
>
> I have PrgEnv-gnu/5.2.56 and gcc/4.9.2 loaded in my env on nersc system.
> Following configure line:
>
> ./configure --enable-mpi-java --prefix=my_favorite_install_location
>
> The general rule of thumb on cray's with master (not with older versions
> though) is you should be able to
> do a ./configure (install location) and you're ready to go, no need for
> complicated platform files, etc.
> to just build vanilla.
>
> As you're probably guessing, I'm going to say it works for me, at least up
> to 68 slave ranks.
>
> I do notice there's some glitch with the mapping of the ranks though.  The
> binding logic seems
> to think there's oversubscription of cores even when there should not be.
> I had to use the
>
> --bind-to none
>
> option on the command line once I asked for more than 22 slave ranks.
>  edison system has
> has 24 cores/node.
>
> Howard
>
>
>
> 2015-06-11 12:10 GMT-06:00 Leiter, Kenneth W CIV USARMY ARL (US) <
> kenneth.w.leiter2....@mail.mil>:
>
>> I will try on a non-cray machine as well.
>>
>> - Ken
>>
>> -----Original Message-----
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Howard
>> Pritchard
>> Sent: Thursday, June 11, 2015 12:21 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] orted segmentation fault in pmix on master
>>
>> Hello Ken,
>>
>> Could you give the details of the allocation request (qsub args) as well
>> as the mpirun command line args? I'm trying to reproduce on the nersc
>> system.
>>
>> It would be interesting if you have access to a similar size non-cray
>> cluster if you get the same problems.
>>
>> Howard
>>
>>
>> 2015-06-11 9:13 GMT-06:00 Ralph Castain <r...@open-mpi.org <mailto:
>> r...@open-mpi.org> >:
>>
>>
>>         I don’t have a Cray, but let me see if I can reproduce this on
>> something else
>>
>>         > On Jun 11, 2015, at 7:26 AM, Leiter, Kenneth W CIV USARMY ARL
>> (US) <kenneth.w.leiter2....@mail.mil <mailto:
>> kenneth.w.leiter2....@mail.mil> > wrote:
>>         >
>>         > Hello,
>>         >
>>         > I am attempting to use the openmpi development master for a
>> code that uses
>>         > dynamic process management (i.e. MPI_Comm_spawn) on our Cray
>> XC40 at the
>>         > Army Research Laboratory. After reading through the mailing
>> list I came to
>>         > the conclusion that the master branch is the only hope for
>> getting this to
>>         > work on the newer Cray machines.
>>         >
>>         > To test I am using the cpi-master.c cpi-worker.c example. The
>> test works
>>         > when executing on a small number of processors, five or less,
>> but begins to
>>         > fail with segmentation faults in orted when using more
>> processors. Even with
>>         > five or fewer processors, I am spreading the computation to
>> more than one
>>         > node. I am using the cray ugni btl through the alps scheduler.
>>         >
>>         > I get a core file from orted and have the seg fault tracked
>> down to
>>         > pmix_server_process_msgs.c:420 where req->proxy is NULL. I have
>> tried
>>         > reading the code to understand how this happens, but am unsure.
>> I do see
>>         > that in the if statement where I take the else branch, the
>> other branch
>>         > specifically checks "if (NULL == req->proxy)" - however, no
>> such check is
>>         > done the the else branch.
>>         >
>>         > I have debug output dumped for the failing runs. I can provide
>> the output
>>         > along with ompi_info output and config.log to anyone who is
>> interested.
>>         >
>>         > - Ken Leiter
>>         >
>>         > _______________________________________________
>>         > users mailing list
>>         > us...@open-mpi.org <mailto:us...@open-mpi.org>
>>         > Subscription:
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>         > Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/06/27094.php
>>
>>         _______________________________________________
>>         users mailing list
>>         us...@open-mpi.org <mailto:us...@open-mpi.org>
>>         Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>         Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/06/27095.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/06/27103.php
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27104.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27105.php
>
hpp@nid02051:~>export NUM_RANKS=23
hpp@nid02051:~>mpirun -np 1 --bind-to core --oversubscribe  -display-allocation 
--display-map ./cpi-master

======================   ALLOCATED NODES   ======================
        5600: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN
        5609: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN
        5610: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN
        5636: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
 Data for JOB [46913,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: 5600    Num slots: 24   Max slots: 0    Num procs: 1    
resolved from 5600
        resolved from 10.128.22.13
        resolved from 128.55.72.44

        Process OMPI jobid: [46913,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 
0-1]]:[BB/../../../../../../../../../../..][../../../../../../../../../../../..]

 =============================================================
./cpi-master -> ./cpi-worker

======================   ALLOCATED NODES   ======================
        5600: slots=24 max_slots=0 slots_inuse=1 state=UNKNOWN
        5609: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN
        5610: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN
        5636: slots=24 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
 Data for JOB [46913,2] offset 1

 ========================   JOB MAP   ========================

 Data for node: 5600    Num slots: 24   Max slots: 0    Num procs: 24   
resolved from 5600
        resolved from 10.128.22.13
        resolved from 128.55.72.44

        Process OMPI jobid: [46913,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 
0-1]]:[BB/../../../../../../../../../../..][../../../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 0 Bound: socket 
0[core 1[hwt 
0-1]]:[../BB/../../../../../../../../../..][../../../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 1 Bound: socket 
1[core 12[hwt 
0-1]]:[../../../../../../../../../../../..][BB/../../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 2 Bound: socket 
0[core 2[hwt 
0-1]]:[../../BB/../../../../../../../../..][../../../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 3 Bound: socket 
1[core 13[hwt 
0-1]]:[../../../../../../../../../../../..][../BB/../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 4 Bound: socket 
0[core 3[hwt 
0-1]]:[../../../BB/../../../../../../../..][../../../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 5 Bound: socket 
1[core 14[hwt 
0-1]]:[../../../../../../../../../../../..][../../BB/../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 6 Bound: socket 
0[core 4[hwt 
0-1]]:[../../../../BB/../../../../../../..][../../../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 7 Bound: socket 
1[core 15[hwt 
0-1]]:[../../../../../../../../../../../..][../../../BB/../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 8 Bound: socket 
0[core 5[hwt 
0-1]]:[../../../../../BB/../../../../../..][../../../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 9 Bound: socket 
1[core 16[hwt 
0-1]]:[../../../../../../../../../../../..][../../../../BB/../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 10 Bound: socket 
0[core 6[hwt 
0-1]]:[../../../../../../BB/../../../../..][../../../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 11 Bound: socket 
1[core 17[hwt 
0-1]]:[../../../../../../../../../../../..][../../../../../BB/../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 12 Bound: socket 
0[core 7[hwt 
0-1]]:[../../../../../../../BB/../../../..][../../../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 13 Bound: socket 
1[core 18[hwt 
0-1]]:[../../../../../../../../../../../..][../../../../../../BB/../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 14 Bound: socket 
0[core 8[hwt 
0-1]]:[../../../../../../../../BB/../../..][../../../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 15 Bound: socket 
1[core 19[hwt 
0-1]]:[../../../../../../../../../../../..][../../../../../../../BB/../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 16 Bound: socket 
0[core 9[hwt 
0-1]]:[../../../../../../../../../BB/../..][../../../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 17 Bound: socket 
1[core 20[hwt 
0-1]]:[../../../../../../../../../../../..][../../../../../../../../BB/../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 18 Bound: socket 
0[core 10[hwt 
0-1]]:[../../../../../../../../../../BB/..][../../../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 19 Bound: socket 
1[core 21[hwt 
0-1]]:[../../../../../../../../../../../..][../../../../../../../../../BB/../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 20 Bound: socket 
0[core 11[hwt 
0-1]]:[../../../../../../../../../../../BB][../../../../../../../../../../../..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 21 Bound: socket 
1[core 22[hwt 
0-1]]:[../../../../../../../../../../../..][../../../../../../../../../../BB/..]
        Process OMPI jobid: [46913,2] App: 0 Process rank: 22 Bound: socket 
0[core 0[hwt 
0-1]]:[BB/../../../../../../../../../../..][../../../../../../../../../../../..]

 =============================================================
pi: 3.1416009869231249, error: 0.0000083333333318

Reply via email to