Re: [OMPI users] default num_procs of round_robin_mapper with cpus-per-proc option

Ralph Castain Sat, 25 Jan 2014 11:48:04 -0500 (EST)

Been tied up the last few days, but I did spend some time thinking about this 
some more - and I think I'm going to leave the current behavior as-is, adding a 
check to see if you specify map-by core along with cpus-per-proc to generate an 
error in that situation. My reasoning is that map-by core is a very specific 
directive - you are telling me to map each process to a specific core. If you 
then tell me to bind that process to multiple cpus, you are creating an 
inherent conflict that I don't readily know how to resolve.


IMO, the best solution is to generate an error and suggest you map-by slot 
instead. This frees me to bind as many cpus to that allocated slot as you care 
to specify, and removes the conflict.

HTH
Ralph

On Jan 22, 2014, at 9:37 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Thanks for your explanation, Ralph.
> 
> But it's really subtle to understand for me ...
> Anyway, I'd like to report what I found through verbose output.
> 
> "-map-by core" calls "bind in place" as below:
> [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts -report-bindings
> -cpus-per-proc 4 -map-by core -mca rmaps_base_v
> erbose 10 ~/mis/openmpi/demos/myprog
> ...
> [manage.cluster:11362] mca:rmaps: compute bindings for job [8729,1] with
> policy CORE
> [manage.cluster:11362] mca:rmaps: bindings for job [8729,1] - core to core
> [manage.cluster:11362] mca:rmaps: bind in place for job [8729,1] with
> bindings CORE
> ...
> 
> On the other hand, "-map-by slot" calls "bind downward" as below:
> [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts -report-bindings
> -cpus-per-proc 4 -map-by slot -mca rmaps_base_v
> erbose 10 ~/mis/openmpi/demos/myprog
> ...
> [manage.cluster:12032] mca:rmaps: compute bindings for job [8571,1] with
> policy CORE
> [manage.cluster:12032] mca:rmaps: bind downward for job [8571,1] with
> bindings CORE
> ...
> 
> I think your best guess is right and something is wrong with
> bind_in_place function. I have to say the logic of source code
> is so complex that I could not figure it out.
> 
> Regards,
> Tetsuya Mishima
> 
>> On Jan 22, 2014, at 8:08 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Thanks, Ralph.
>>> 
>>> I have one more question. I'm sorry to ask you many things ...
>> 
>> Not a problem
>> 
>>> 
>>> Could you tell me the difference between "map-by slot" and "map-by
> core".
>>> From my understanding, slot is the synonym of core.
>> 
>> Not really - see below
>> 
>>> But those behaviors
>>> using openmpi-1.7.4rc2 with the cpus-per-proc option are quite
> different
>>> as shown below. I tried to browse the source code but I could not make
> it
>>> clear so far.
>>> 
>> 
>> It is a little subtle, I fear. When you tell us "map-by slot", we assign
> each process to an allocated slot without associating it to any specific
> cpu or core. When we then bind to core (as we do by
>> default), we balance the binding across the sockets to improve
> performance.
>> 
>> When you tell us "map-by core", then we directly associate each process
> with a specific core. So when we bind, we bind you to that core. This will
> cause us to fully use all the cores on the first
>> socket before we move to the next.
>> 
>> I'm a little puzzled by your output as it appears that cpus-per-proc was
> ignored, so that's something I'd have to look at more carefully. Best guess
> is that we aren't skipping cores to account for
>> the cpus-per-core setting, and thus the procs are being mapped to
> consecutive cores - which wouldn't be very good if we then bound them to
> multiple neighboring cores as they'd fall on top of each
>> other.
>> 
>> 
>>> Regards,
>>> Tetsuya Mishima
>>> 
>>> [ un-managed environment] (node05,06 has 8 cores each)
>>> 
>>> [mishima@manage work]$ cat pbs_hosts
>>> node05
>>> node05
>>> node05
>>> node05
>>> node05
>>> node05
>>> node05
>>> node05
>>> node06
>>> node06
>>> node06
>>> node06
>>> node06
>>> node06
>>> node06
>>> node06
>>> [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts
> -report-bindings
>>> -cpus-per-proc 4 -map-by slot ~/mis/openmpi/dem
>>> os/myprog
>>> [node05.cluster:23949] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> socket
>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>> [node05.cluster:23949] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>> [node06.cluster:22139] MCW rank 3 bound to socket 1[core 4[hwt 0]],
> socket
>>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>>> [node06.cluster:22139] MCW rank 2 bound to socket 0[core 0[hwt 0]],
> socket
>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>> Hello world from process 0 of 4
>>> Hello world from process 1 of 4
>>> Hello world from process 3 of 4
>>> Hello world from process 2 of 4
>>> [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts
> -report-bindings
>>> -cpus-per-proc 4 -map-by core ~/mis/openmpi/dem
>>> os/myprog
>>> [node05.cluster:23985] MCW rank 1 bound to socket 0[core 1[hwt 0]]:
>>> [./B/./.][./././.]
>>> [node05.cluster:23985] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
>>> [B/././.][./././.]
>>> [node06.cluster:22175] MCW rank 3 bound to socket 0[core 1[hwt 0]]:
>>> [./B/./.][./././.]
>>> [node06.cluster:22175] MCW rank 2 bound to socket 0[core 0[hwt 0]]:
>>> [B/././.][./././.]
>>> Hello world from process 2 of 4
>>> Hello world from process 3 of 4
>>> Hello world from process 0 of 4
>>> Hello world from process 1 of 4
>>> 
>>> (note) I have the same behavior in the managed environment by Torque
>>> 
>>>> Seems like a reasonable, minimal risk request - will do
>>>> 
>>>> On Jan 22, 2014, at 4:28 PM, tmish...@jcity.maeda.co.jp wrote:
>>>> 
>>>>> 
>>>>> Hi Ralph, I want to ask you one more thing about default setting of
>>>>> num_procs
>>>>> when we don't specify the -np option and we set the cpus-per-proc >
> 1.
>>>>> 
>>>>> In this case, the round_robin_mapper sets num_procs = num_slots as
>>> below:
>>>>> 
>>>>> rmaps_rr.c:
>>>>> 130        if (0 == app->num_procs) {
>>>>> 131            /* set the num_procs to equal the number of slots on
>>> these
>>>>> mapped nodes */
>>>>> 132            app->num_procs = num_slots;
>>>>> 133        }
>>>>> 
>>>>> However, because of cpus_per_rank > 1, this num_procs will be refused
>>> at
>>>>> the
>>>>> line 61 in rmaps_rr_mappers.c as below, unless we switch on the
>>>>> oversubscribe
>>>>> directive.
>>>>> 
>>>>> rmaps_rr_mappers.c:
>>>>> 61    if (num_slots < ((int)app->num_procs *
>>>>> orte_rmaps_base.cpus_per_rank)) {
>>>>> 62        if (ORTE_MAPPING_NO_OVERSUBSCRIBE &
>>> ORTE_GET_MAPPING_DIRECTIVE
>>>>> (jdata->map->mapping)) {
>>>>> 63            orte_show_help("help-orte-rmaps-base.txt",
>>>>> "orte-rmaps-base:alloc-error",
>>>>> 64                           true, app->num_procs, app->app);
>>>>> 65            return ORTE_ERR_SILENT;
>>>>> 66        }
>>>>> 67    }
>>>>> 
>>>>> Therefore, I think the default num_procs should be equal to the
> number
>>> of
>>>>> num_slots divided by cpus/rank:
>>>>> 
>>>>>         app->num_procs = num_slots / orte_rmaps_base.cpus_per_rank;
>>>>> 
>>>>> This would be more convinient for most of people who want to use the
>>>>> -cpus-per-proc option. I already confirmed it worked well. Please
>>> consider
>>>>> to apply this fix to 1.7.4.
>>>>> 
>>>>> Regards,
>>>>> Tetsuya Mishima
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] default num_procs of round_robin_mapper with cpus-per-proc option

Reply via email to