Thanks for your explanation, Ralph.
But it's really subtle to understand for me ... Anyway, I'd like to report what I found through verbose output. "-map-by core" calls "bind in place" as below: [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts -report-bindings -cpus-per-proc 4 -map-by core -mca rmaps_base_v erbose 10 ~/mis/openmpi/demos/myprog ... [manage.cluster:11362] mca:rmaps: compute bindings for job [8729,1] with policy CORE [manage.cluster:11362] mca:rmaps: bindings for job [8729,1] - core to core [manage.cluster:11362] mca:rmaps: bind in place for job [8729,1] with bindings CORE ... On the other hand, "-map-by slot" calls "bind downward" as below: [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts -report-bindings -cpus-per-proc 4 -map-by slot -mca rmaps_base_v erbose 10 ~/mis/openmpi/demos/myprog ... [manage.cluster:12032] mca:rmaps: compute bindings for job [8571,1] with policy CORE [manage.cluster:12032] mca:rmaps: bind downward for job [8571,1] with bindings CORE ... I think your best guess is right and something is wrong with bind_in_place function. I have to say the logic of source code is so complex that I could not figure it out. Regards, Tetsuya Mishima > On Jan 22, 2014, at 8:08 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > Thanks, Ralph. > > > > I have one more question. I'm sorry to ask you many things ... > > Not a problem > > > > > Could you tell me the difference between "map-by slot" and "map-by core". > > From my understanding, slot is the synonym of core. > > Not really - see below > > > But those behaviors > > using openmpi-1.7.4rc2 with the cpus-per-proc option are quite different > > as shown below. I tried to browse the source code but I could not make it > > clear so far. > > > > It is a little subtle, I fear. When you tell us "map-by slot", we assign each process to an allocated slot without associating it to any specific cpu or core. When we then bind to core (as we do by > default), we balance the binding across the sockets to improve performance. > > When you tell us "map-by core", then we directly associate each process with a specific core. So when we bind, we bind you to that core. This will cause us to fully use all the cores on the first > socket before we move to the next. > > I'm a little puzzled by your output as it appears that cpus-per-proc was ignored, so that's something I'd have to look at more carefully. Best guess is that we aren't skipping cores to account for > the cpus-per-core setting, and thus the procs are being mapped to consecutive cores - which wouldn't be very good if we then bound them to multiple neighboring cores as they'd fall on top of each > other. > > > > Regards, > > Tetsuya Mishima > > > > [ un-managed environment] (node05,06 has 8 cores each) > > > > [mishima@manage work]$ cat pbs_hosts > > node05 > > node05 > > node05 > > node05 > > node05 > > node05 > > node05 > > node05 > > node06 > > node06 > > node06 > > node06 > > node06 > > node06 > > node06 > > node06 > > [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts -report-bindings > > -cpus-per-proc 4 -map-by slot ~/mis/openmpi/dem > > os/myprog > > [node05.cluster:23949] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket > > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > > [node05.cluster:23949] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket > > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > > [node06.cluster:22139] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket > > 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so > > cket 1[core 7[hwt 0]]: [./././.][B/B/B/B] > > [node06.cluster:22139] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket > > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > cket 0[core 3[hwt 0]]: [B/B/B/B][./././.] > > Hello world from process 0 of 4 > > Hello world from process 1 of 4 > > Hello world from process 3 of 4 > > Hello world from process 2 of 4 > > [mishima@manage work]$ mpirun -np 4 -hostfile pbs_hosts -report-bindings > > -cpus-per-proc 4 -map-by core ~/mis/openmpi/dem > > os/myprog > > [node05.cluster:23985] MCW rank 1 bound to socket 0[core 1[hwt 0]]: > > [./B/./.][./././.] > > [node05.cluster:23985] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > > [B/././.][./././.] > > [node06.cluster:22175] MCW rank 3 bound to socket 0[core 1[hwt 0]]: > > [./B/./.][./././.] > > [node06.cluster:22175] MCW rank 2 bound to socket 0[core 0[hwt 0]]: > > [B/././.][./././.] > > Hello world from process 2 of 4 > > Hello world from process 3 of 4 > > Hello world from process 0 of 4 > > Hello world from process 1 of 4 > > > > (note) I have the same behavior in the managed environment by Torque > > > >> Seems like a reasonable, minimal risk request - will do > >> > >> On Jan 22, 2014, at 4:28 PM, tmish...@jcity.maeda.co.jp wrote: > >> > >>> > >>> Hi Ralph, I want to ask you one more thing about default setting of > >>> num_procs > >>> when we don't specify the -np option and we set the cpus-per-proc > 1. > >>> > >>> In this case, the round_robin_mapper sets num_procs = num_slots as > > below: > >>> > >>> rmaps_rr.c: > >>> 130 if (0 == app->num_procs) { > >>> 131 /* set the num_procs to equal the number of slots on > > these > >>> mapped nodes */ > >>> 132 app->num_procs = num_slots; > >>> 133 } > >>> > >>> However, because of cpus_per_rank > 1, this num_procs will be refused > > at > >>> the > >>> line 61 in rmaps_rr_mappers.c as below, unless we switch on the > >>> oversubscribe > >>> directive. > >>> > >>> rmaps_rr_mappers.c: > >>> 61 if (num_slots < ((int)app->num_procs * > >>> orte_rmaps_base.cpus_per_rank)) { > >>> 62 if (ORTE_MAPPING_NO_OVERSUBSCRIBE & > > ORTE_GET_MAPPING_DIRECTIVE > >>> (jdata->map->mapping)) { > >>> 63 orte_show_help("help-orte-rmaps-base.txt", > >>> "orte-rmaps-base:alloc-error", > >>> 64 true, app->num_procs, app->app); > >>> 65 return ORTE_ERR_SILENT; > >>> 66 } > >>> 67 } > >>> > >>> Therefore, I think the default num_procs should be equal to the number > > of > >>> num_slots divided by cpus/rank: > >>> > >>> app->num_procs = num_slots / orte_rmaps_base.cpus_per_rank; > >>> > >>> This would be more convinient for most of people who want to use the > >>> -cpus-per-proc option. I already confirmed it worked well. Please > > consider > >>> to apply this fix to 1.7.4. > >>> > >>> Regards, > >>> Tetsuya Mishima > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users