Been tied up the last few days, but I did spend some time thinking about this some more - and I think I'm going to leave the current behavior as-is, adding a check to see if you specify map-by core along with cpus-per-proc to generate an error in that situation. My reasoning is that map-by core is a very specific directive - you are telling me to map each process to a specific core. If you then tell me to bind that process to multiple cpus, you are creating an inherent conflict that I don't readily know how to resolve.
IMO, the best solution is to generate an error and suggest you map-by slot instead. This frees me to bind as many cpus to that allocated slot as you care to specify, and removes the conflict. HTH Ralph On Jan 22, 2014, at 9:37 PM, tmish...@jcity.maeda.co.jp wrote: > > > Thanks for your explanation, Ralph. > > But it's really subtle to understand for me ... > Anyway, I'd like to report what I found through verbose output. > > "-map-by core" calls "bind in place" as below: > [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts -report-bindings > -cpus-per-proc 4 -map-by core -mca rmaps_base_v > erbose 10 ~/mis/openmpi/demos/myprog > ... > [manage.cluster:11362] mca:rmaps: compute bindings for job [8729,1] with > policy CORE > [manage.cluster:11362] mca:rmaps: bindings for job [8729,1] - core to core > [manage.cluster:11362] mca:rmaps: bind in place for job [8729,1] with > bindings CORE > ... > > On the other hand, "-map-by slot" calls "bind downward" as below: > [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts -report-bindings > -cpus-per-proc 4 -map-by slot -mca rmaps_base_v > erbose 10 ~/mis/openmpi/demos/myprog > ... > [manage.cluster:12032] mca:rmaps: compute bindings for job [8571,1] with > policy CORE > [manage.cluster:12032] mca:rmaps: bind downward for job [8571,1] with > bindings CORE > ... > > I think your best guess is right and something is wrong with > bind_in_place function. I have to say the logic of source code > is so complex that I could not figure it out. > > Regards, > Tetsuya Mishima > >> On Jan 22, 2014, at 8:08 PM, tmish...@jcity.maeda.co.jp wrote: >> >>> >>> >>> Thanks, Ralph. >>> >>> I have one more question. I'm sorry to ask you many things ... >> >> Not a problem >> >>> >>> Could you tell me the difference between "map-by slot" and "map-by > core". >>> From my understanding, slot is the synonym of core. >> >> Not really - see below >> >>> But those behaviors >>> using openmpi-1.7.4rc2 with the cpus-per-proc option are quite > different >>> as shown below. I tried to browse the source code but I could not make > it >>> clear so far. >>> >> >> It is a little subtle, I fear. When you tell us "map-by slot", we assign > each process to an allocated slot without associating it to any specific > cpu or core. When we then bind to core (as we do by >> default), we balance the binding across the sockets to improve > performance. >> >> When you tell us "map-by core", then we directly associate each process > with a specific core. So when we bind, we bind you to that core. This will > cause us to fully use all the cores on the first >> socket before we move to the next. >> >> I'm a little puzzled by your output as it appears that cpus-per-proc was > ignored, so that's something I'd have to look at more carefully. Best guess > is that we aren't skipping cores to account for >> the cpus-per-core setting, and thus the procs are being mapped to > consecutive cores - which wouldn't be very good if we then bound them to > multiple neighboring cores as they'd fall on top of each >> other. >> >> >>> Regards, >>> Tetsuya Mishima >>> >>> [ un-managed environment] (node05,06 has 8 cores each) >>> >>> [mishima@manage work]$ cat pbs_hosts >>> node05 >>> node05 >>> node05 >>> node05 >>> node05 >>> node05 >>> node05 >>> node05 >>> node06 >>> node06 >>> node06 >>> node06 >>> node06 >>> node06 >>> node06 >>> node06 >>> [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts > -report-bindings >>> -cpus-per-proc 4 -map-by slot ~/mis/openmpi/dem >>> os/myprog >>> [node05.cluster:23949] MCW rank 1 bound to socket 1[core 4[hwt 0]], > socket >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>> [node05.cluster:23949] MCW rank 0 bound to socket 0[core 0[hwt 0]], > socket >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>> [node06.cluster:22139] MCW rank 3 bound to socket 1[core 4[hwt 0]], > socket >>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so >>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] >>> [node06.cluster:22139] MCW rank 2 bound to socket 0[core 0[hwt 0]], > socket >>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so >>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] >>> Hello world from process 0 of 4 >>> Hello world from process 1 of 4 >>> Hello world from process 3 of 4 >>> Hello world from process 2 of 4 >>> [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts > -report-bindings >>> -cpus-per-proc 4 -map-by core ~/mis/openmpi/dem >>> os/myprog >>> [node05.cluster:23985] MCW rank 1 bound to socket 0[core 1[hwt 0]]: >>> [./B/./.][./././.] >>> [node05.cluster:23985] MCW rank 0 bound to socket 0[core 0[hwt 0]]: >>> [B/././.][./././.] >>> [node06.cluster:22175] MCW rank 3 bound to socket 0[core 1[hwt 0]]: >>> [./B/./.][./././.] >>> [node06.cluster:22175] MCW rank 2 bound to socket 0[core 0[hwt 0]]: >>> [B/././.][./././.] >>> Hello world from process 2 of 4 >>> Hello world from process 3 of 4 >>> Hello world from process 0 of 4 >>> Hello world from process 1 of 4 >>> >>> (note) I have the same behavior in the managed environment by Torque >>> >>>> Seems like a reasonable, minimal risk request - will do >>>> >>>> On Jan 22, 2014, at 4:28 PM, tmish...@jcity.maeda.co.jp wrote: >>>> >>>>> >>>>> Hi Ralph, I want to ask you one more thing about default setting of >>>>> num_procs >>>>> when we don't specify the -np option and we set the cpus-per-proc > > 1. >>>>> >>>>> In this case, the round_robin_mapper sets num_procs = num_slots as >>> below: >>>>> >>>>> rmaps_rr.c: >>>>> 130 if (0 == app->num_procs) { >>>>> 131 /* set the num_procs to equal the number of slots on >>> these >>>>> mapped nodes */ >>>>> 132 app->num_procs = num_slots; >>>>> 133 } >>>>> >>>>> However, because of cpus_per_rank > 1, this num_procs will be refused >>> at >>>>> the >>>>> line 61 in rmaps_rr_mappers.c as below, unless we switch on the >>>>> oversubscribe >>>>> directive. >>>>> >>>>> rmaps_rr_mappers.c: >>>>> 61 if (num_slots < ((int)app->num_procs * >>>>> orte_rmaps_base.cpus_per_rank)) { >>>>> 62 if (ORTE_MAPPING_NO_OVERSUBSCRIBE & >>> ORTE_GET_MAPPING_DIRECTIVE >>>>> (jdata->map->mapping)) { >>>>> 63 orte_show_help("help-orte-rmaps-base.txt", >>>>> "orte-rmaps-base:alloc-error", >>>>> 64 true, app->num_procs, app->app); >>>>> 65 return ORTE_ERR_SILENT; >>>>> 66 } >>>>> 67 } >>>>> >>>>> Therefore, I think the default num_procs should be equal to the > number >>> of >>>>> num_slots divided by cpus/rank: >>>>> >>>>> app->num_procs = num_slots / orte_rmaps_base.cpus_per_rank; >>>>> >>>>> This would be more convinient for most of people who want to use the >>>>> -cpus-per-proc option. I already confirmed it worked well. Please >>> consider >>>>> to apply this fix to 1.7.4. >>>>> >>>>> Regards, >>>>> Tetsuya Mishima >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users