I would consider that a bug, myself - if there is some resource available, we should use it
> On Oct 4, 2015, at 5:42 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > Marcin, > > i ran a simple test with v1.10.1rc1 under a cpuset with > - one core (two threads 0,16) on socket 0 > - two cores (two threads each 8,9,24,25) on socket 1 > > $ mpirun -np 3 -bind-to core ./hello_c > -------------------------------------------------------------------------- > A request was made to bind to that would result in binding more > processes than cpus on a resource: > > Bind to: CORE > Node: rapid > #processes: 2 > #cpus: 1 > > You can override this protection by adding the "overload-allowed" > option to your binding directive. > -------------------------------------------------------------------------- > > as you already pointed, default mapping is by socket. > > so on one hand, we can consider this behavior is a feature : > we try to bind two processes to socket 0, so the --oversubscribe option is > required > (and it does what it should : > $ mpirun -np 3 -bind-to core --oversubscribe -report-bindings ./hello_c > [rapid:16278] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: > [BB/../../../../../../..][../../../../../../../..] > [rapid:16278] MCW rank 1 bound to socket 1[core 8[hwt 0-1]]: > [../../../../../../../..][BB/../../../../../../..] > [rapid:16278] MCW rank 2 bound to socket 0[core 0[hwt 0-1]]: > [BB/../../../../../../..][../../../../../../../..] > Hello, world, I am 1 of 3, (Open MPI v1.10.1rc1, package: Open MPI > gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: v1.10.0-84-g15ae63f, > Oct 03, 2015, 128) > Hello, world, I am 2 of 3, (Open MPI v1.10.1rc1, package: Open MPI > gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: v1.10.0-84-g15ae63f, > Oct 03, 2015, 128) > Hello, world, I am 0 of 3, (Open MPI v1.10.1rc1, package: Open MPI > gilles@rapid Distribution, ident: 1.10.1rc1, repo rev: v1.10.0-84-g15ae63f, > Oct 03, 2015, 128) > > and on the other hand, we could consider ompi should be a bit smarter, and > uses socket 1 for task 2 since socket 0 is fully allocated and there is room > on socket 1. > > Ralph, any thoughts ? bug or feature ? > > > Marcin, > > you mentionned you had one failure with 1.10.1rc1 and -bind-to core > could you please send the full details (script, allocation and output) > in your slurm script, you can do > srun -N $SLURM_NNODES -n $SLURM_NNODES --cpu_bind=none -l grep > Cpus_allowed_list /proc/self/status > before invoking mpirun > > Cheers, > > Gilles > > On 10/4/2015 11:55 PM, marcin.krotkiewski wrote: >> Hi, all, >> >> I played a bit more and it seems that the problem results from >> >> trg_obj = opal_hwloc_base_find_min_bound_target_under_obj() >> >> called in rmaps_base_binding.c / bind_downwards being wrong. I do not know >> the reason, but I think I know when the problem happens (at least on >> 1.10.1rc1). It seems that by default openmpi maps by socket. The error >> happens when for a given compute node there is a different number of cores >> used on each socket. Consider previously studied case (the debug outputs I >> sent in last post). c1-8, which was source of error, has 5 mpi processes >> assigned, and the cpuset is the following: >> >> 0, 5, 9, 13, 14, 16, 21, 25, 29, 30 >> >> Cores 0,5 are on socket 0, cores 9, 13, 14 are on socket 1. Binding >> progresses correctly up to and including core 13 (see end of file >> out.1.10.1rc2, before the error). That is 2 cores on socket 0, and 2 cores >> on socket 1. Error is thrown when core 14 should be bound - extra core on >> socket 1 with no corresponding core on socket 0. At that point the returned >> trg_obj points to the first core on the node (os_index 0, socket 0). >> >> I have submitted a few other jobs and I always had an error in such >> situation. Moreover, if I now use --map-by core instead of socket, the error >> is gone, and I get my expected binding: >> >> rank 0 @ compute-1-2.local 1, 17, >> rank 1 @ compute-1-2.local 2, 18, >> rank 2 @ compute-1-2.local 3, 19, >> rank 3 @ compute-1-2.local 4, 20, >> rank 4 @ compute-1-4.local 1, 17, >> rank 5 @ compute-1-4.local 15, 31, >> rank 6 @ compute-1-8.local 0, 16, >> rank 7 @ compute-1-8.local 5, 21, >> rank 8 @ compute-1-8.local 9, 25, >> rank 9 @ compute-1-8.local 13, 29, >> rank 10 @ compute-1-8.local 14, 30, >> rank 11 @ compute-1-13.local 3, 19, >> rank 12 @ compute-1-13.local 4, 20, >> rank 13 @ compute-1-13.local 5, 21, >> rank 14 @ compute-1-13.local 6, 22, >> rank 15 @ compute-1-13.local 7, 23, >> rank 16 @ compute-1-16.local 12, 28, >> rank 17 @ compute-1-16.local 13, 29, >> rank 18 @ compute-1-16.local 14, 30, >> rank 19 @ compute-1-16.local 15, 31, >> rank 20 @ compute-1-23.local 2, 18, >> rank 29 @ compute-1-26.local 11, 27, >> rank 21 @ compute-1-23.local 3, 19, >> rank 30 @ compute-1-26.local 13, 29, >> rank 22 @ compute-1-23.local 4, 20, >> rank 31 @ compute-1-26.local 15, 31, >> rank 23 @ compute-1-23.local 8, 24, >> rank 27 @ compute-1-26.local 1, 17, >> rank 24 @ compute-1-23.local 13, 29, >> rank 28 @ compute-1-26.local 6, 22, >> rank 25 @ compute-1-23.local 14, 30, >> rank 26 @ compute-1-23.local 15, 31, >> >> Using --map-by core seems to fix the issue on 1.8.8, 1.10.0 and 1.10.1rc1. >> However, there is still a difference in behavior between 1.10.1rc1 and >> earlier versions. In the SLURM job described in last post, 1.10.1rc1 fails >> to bind only in 1 case, while the earlier versions fail in 21 out of 32 >> cases. You mentioned there was a bug in hwloc. Not sure if it can explain >> the difference in behavior. >> >> Hope this helps to nail this down. >> >> Marcin >> >> >> >> >> On 10/04/2015 09:55 AM, Gilles Gouaillardet wrote: >>> Ralph, >>> >>> I suspect ompi tries to bind to threads outside the cpuset. >>> this could be pretty similar to a previous issue when ompi tried to bind to >>> cores outside the cpuset. >>> /* when a core has more than one thread, would ompi assume all the threads >>> are available if the core is available ? */ >>> I will investigate this from tomorrow >>> >>> Cheers, >>> >>> Gilles >>> >>> On Sunday, October 4, 2015, Ralph Castain < >>> <mailto:r...@open-mpi.org>r...@open-mpi.org <mailto:r...@open-mpi.org>> >>> wrote: >>> Thanks - please go ahead and release that allocation as I’m not going to >>> get to this immediately. I’ve got several hot irons in the fire right now, >>> and I’m not sure when I’ll get a chance to track this down. >>> >>> Gilles or anyone else who might have time - feel free to take a gander and >>> see if something pops out at you. >>> >>> Ralph >>> >>> >>>> On Oct 3, 2015, at 11:05 AM, marcin.krotkiewski >>>> <marcin.krotkiew...@gmail.com >>>> <javascript:_e(%7B%7D,'cvml','marcin.krotkiew...@gmail.com');>> wrote: >>>> >>>> >>>> Done. I have compiled 1.10.0 and 1.10.rc1 with --enable-debug and executed >>>> >>>> mpirun --mca rmaps_base_verbose 10 --hetero-nodes --report-bindings >>>> --bind-to core -np 32 ./affinity >>>> >>>> In case of 1.10.rc1 I have also added :overload-allowed - output in a >>>> separate file. This option did not make much difference for 1.10.0, so I >>>> did not attach it here. >>>> >>>> First thing I noted for 1.10.0 are lines like >>>> >>>> [login-0-1.local:03399] [[37945,0],0] GOT 1 CPUS >>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] BITMAP >>>> [login-0-1.local:03399] [[37945,0],0] PROC [[37945,1],27] ON c1-26 IS NOT >>>> BOUND >>>> >>>> with an empty BITMAP. >>>> >>>> The SLURM environment is >>>> >>>> set | grep SLURM >>>> SLURM_JOBID=12714491 >>>> SLURM_JOB_CPUS_PER_NODE='4,2,5(x2),4,7,5' >>>> SLURM_JOB_ID=12714491 >>>> SLURM_JOB_NODELIST='c1-[2,4,8,13,16,23,26]' >>>> SLURM_JOB_NUM_NODES=7 >>>> SLURM_JOB_PARTITION=normal >>>> SLURM_MEM_PER_CPU=2048 >>>> SLURM_NNODES=7 >>>> SLURM_NODELIST='c1-[2,4,8,13,16,23,26]' >>>> SLURM_NODE_ALIASES='(null)' >>>> SLURM_NPROCS=32 >>>> SLURM_NTASKS=32 >>>> SLURM_SUBMIT_DIR=/cluster/home/marcink >>>> SLURM_SUBMIT_HOST=login-0-1.local >>>> SLURM_TASKS_PER_NODE='4,2,5(x2),4,7,5' >>>> >>>> I have submitted an interactive job on screen for 120 hours now to work >>>> with one example, and not change it for every post :) >>>> >>>> If you need anything else, let me know. I could introduce some >>>> patch/printfs and recompile, if you need it. >>>> >>>> Marcin >>>> >>>> >>>> >>>> On 10/03/2015 07:17 PM, Ralph Castain wrote: >>>>> Rats - just realized I have no way to test this as none of the machines I >>>>> can access are setup for cgroup-based multi-tenant. Is this a debug >>>>> version of OMPI? If not, can you rebuild OMPI with —enable-debug? >>>>> >>>>> Then please run it with —mca rmaps_base_verbose 10 and pass along the >>>>> output. >>>>> >>>>> Thanks >>>>> Ralph >>>>> >>>>> >>>>>> On Oct 3, 2015, at 10:09 AM, Ralph Castain <r...@open-mpi.org >>>>>> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote: >>>>>> >>>>>> What version of slurm is this? I might try to debug it here. I’m not >>>>>> sure where the problem lies just yet. >>>>>> >>>>>> >>>>>>> On Oct 3, 2015, at 8:59 AM, marcin.krotkiewski < >>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com >>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote: >>>>>>> >>>>>>> Here is the output of lstopo. In short, (0,16) are core 0, (1,17) - >>>>>>> core 1 etc. >>>>>>> >>>>>>> Machine (64GB) >>>>>>> NUMANode L#0 (P#0 32GB) >>>>>>> Socket L#0 + L3 L#0 (20MB) >>>>>>> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 >>>>>>> PU L#0 (P#0) >>>>>>> PU L#1 (P#16) >>>>>>> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 >>>>>>> PU L#2 (P#1) >>>>>>> PU L#3 (P#17) >>>>>>> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 >>>>>>> PU L#4 (P#2) >>>>>>> PU L#5 (P#18) >>>>>>> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 >>>>>>> PU L#6 (P#3) >>>>>>> PU L#7 (P#19) >>>>>>> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 >>>>>>> PU L#8 (P#4) >>>>>>> PU L#9 (P#20) >>>>>>> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 >>>>>>> PU L#10 (P#5) >>>>>>> PU L#11 (P#21) >>>>>>> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 >>>>>>> PU L#12 (P#6) >>>>>>> PU L#13 (P#22) >>>>>>> L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 >>>>>>> PU L#14 (P#7) >>>>>>> PU L#15 (P#23) >>>>>>> HostBridge L#0 >>>>>>> PCIBridge >>>>>>> PCI 8086:1521 >>>>>>> Net L#0 "eth0" >>>>>>> PCI 8086:1521 >>>>>>> Net L#1 "eth1" >>>>>>> PCIBridge >>>>>>> PCI 15b3:1003 >>>>>>> Net L#2 "ib0" >>>>>>> OpenFabrics L#3 "mlx4_0" >>>>>>> PCIBridge >>>>>>> PCI 102b:0532 >>>>>>> PCI 8086:1d02 >>>>>>> Block L#4 "sda" >>>>>>> NUMANode L#1 (P#1 32GB) + Socket L#1 + L3 L#1 (20MB) >>>>>>> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 >>>>>>> PU L#16 (P#8) >>>>>>> PU L#17 (P#24) >>>>>>> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 >>>>>>> PU L#18 (P#9) >>>>>>> PU L#19 (P#25) >>>>>>> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 >>>>>>> PU L#20 (P#10) >>>>>>> PU L#21 (P#26) >>>>>>> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 >>>>>>> PU L#22 (P#11) >>>>>>> PU L#23 (P#27) >>>>>>> L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 >>>>>>> PU L#24 (P#12) >>>>>>> PU L#25 (P#28) >>>>>>> L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 >>>>>>> PU L#26 (P#13) >>>>>>> PU L#27 (P#29) >>>>>>> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 >>>>>>> PU L#28 (P#14) >>>>>>> PU L#29 (P#30) >>>>>>> L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 >>>>>>> PU L#30 (P#15) >>>>>>> PU L#31 (P#31) >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 10/03/2015 05:46 PM, Ralph Castain wrote: >>>>>>>> Maybe I’m just misreading your HT map - that slurm nodelist syntax is >>>>>>>> a new one to me, but they tend to change things around. Could you run >>>>>>>> lstopo on one of those compute nodes and send the output? >>>>>>>> >>>>>>>> I’m just suspicious because I’m not seeing a clear pairing of HT >>>>>>>> numbers in your output, but HT numbering is BIOS-specific and I may >>>>>>>> just not be understanding your particular pattern. Our error message >>>>>>>> is clearly indicating that we are seeing individual HTs (and not >>>>>>>> complete cores) assigned, and I don’t know the source of that >>>>>>>> confusion. >>>>>>>> >>>>>>>> >>>>>>>>> On Oct 3, 2015, at 8:28 AM, marcin.krotkiewski < >>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com >>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 10/03/2015 04:38 PM, Ralph Castain wrote: >>>>>>>>>> If mpirun isn’t trying to do any binding, then you will of course >>>>>>>>>> get the right mapping as we’ll just inherit whatever we received. >>>>>>>>> Yes. I meant that whatever you received (what SLURM gives) is a >>>>>>>>> correct cpu map and assigns _whole_ CPUs, not a single HT to MPI >>>>>>>>> processes. In the case mentioned earlier openmpi should start 6 tasks >>>>>>>>> on c1-30. If HT would be treated as separate and independent cores, >>>>>>>>> sched_getaffinity of an MPI process started on c1-30 would return a >>>>>>>>> map with 6 entries only. In my case it returns a map with 12 entries >>>>>>>>> - 2 for each core. So one process is in fact allocated both HTs, not >>>>>>>>> only one. Is what I'm saying correct? >>>>>>>>> >>>>>>>>>> Looking at your output, it’s pretty clear that you are getting >>>>>>>>>> independent HTs assigned and not full cores. >>>>>>>>> How do you mean? Is the above understanding wrong? I would expect >>>>>>>>> that on c1-30 with --bind-to core openmpi should bind to logical >>>>>>>>> cores 0 and 16 (rank 0), 1 and 17 >>>>>>>>> (rank 2) and so on. All those logical cores are >>>>>>>>> available in sched_getaffinity map, and there is twice as many >>>>>>>>> logical cores as there are MPI processes started on the node. >>>>>>>>> >>>>>>>>>> My guess is that something in slurm has changed such that it detects >>>>>>>>>> that HT has been enabled, and then begins treating the HTs as >>>>>>>>>> completely independent cpus. >>>>>>>>>> >>>>>>>>>> Try changing “-bind-to core” to “-bind-to hwthread >>>>>>>>>> -use-hwthread-cpus” and see if that works >>>>>>>>>> >>>>>>>>> I have and the binding is wrong. For example, I got this output >>>>>>>>> >>>>>>>>> rank 0 @ compute-1-30.local 0, >>>>>>>>> rank 1 @ compute-1-30.local 16, >>>>>>>>> >>>>>>>>> Which means that two ranks have been bound to the same physical core >>>>>>>>> (logical cores 0 and 16 are two HTs of the same core). If I use >>>>>>>>> --bind-to core, I get the following correct binding >>>>>>>>> >>>>>>>>> rank 0 @ compute-1-30.local 0, 16, >>>>>>>>> >>>>>>>>> The problem is many other ranks get bad binding with 'rank XXX is not >>>>>>>>> bound (or bound to all available processors)' warning. >>>>>>>>> >>>>>>>>> But I think I was not entirely correct saying that 1.10.1rc1 did not >>>>>>>>> fix things. It still might have improved something, but not >>>>>>>>> everything. Consider this job: >>>>>>>>> >>>>>>>>> SLURM_JOB_CPUS_PER_NODE='5,4,6,5(x2),7,5,9,5,7,6' >>>>>>>>> SLURM_JOB_NODELIST='c8-[31,34],c9-[30-32,35-36],c10-[31-34]' >>>>>>>>> >>>>>>>>> If I run 32 tasks as follows (with 1.10.1rc1) >>>>>>>>> >>>>>>>>> mpirun --hetero-nodes --report-bindings --bind-to core -np 32 >>>>>>>>> ./affinity >>>>>>>>> >>>>>>>>> I get the following error: >>>>>>>>> >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> A request was made to bind to that would result in binding more >>>>>>>>> processes than cpus on a resource: >>>>>>>>> >>>>>>>>> Bind to: CORE >>>>>>>>> Node: c9-31 >>>>>>>>> #processes: 2 >>>>>>>>> #cpus: 1 >>>>>>>>> >>>>>>>>> You can override this protection by adding the "overload-allowed" >>>>>>>>> option to your binding directive. >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> >>>>>>>>> If I now use --bind-to core:overload-allowed, then openmpi starts and >>>>>>>>> _most_ of the threads are bound correctly (i.e., map contains two >>>>>>>>> logical cores in ALL cases), except this case that required the >>>>>>>>> overload flag: >>>>>>>>> >>>>>>>>> rank 15 @ compute-9-31.local 1, 17, >>>>>>>>> rank 16 @ compute-9-31.local 11, 27, >>>>>>>>> rank 17 @ compute-9-31.local 2, 18, >>>>>>>>> rank 18 @ compute-9-31.local 12, 28, >>>>>>>>> rank 19 @ compute-9-31.local 1, 17, >>>>>>>>> >>>>>>>>> Note pair 1,17 is used twice. The original SLURM delivered map (no >>>>>>>>> binding) on this node is >>>>>>>>> >>>>>>>>> rank 15 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, >>>>>>>>> rank 16 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, >>>>>>>>> rank 17 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, >>>>>>>>> rank 18 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, >>>>>>>>> rank 19 @ compute-9-31.local 1, 2, 11, 12, 13, 17, 18, 27, 28, 29, >>>>>>>>> >>>>>>>>> Why does openmpi use cores (1,17) twice instead of using core >>>>>>>>> (13,29)? Clearly, the original SLURM-delivered map has 5 CPUs >>>>>>>>> included, enough for 5 MPI processes. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Marcin >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Oct 3, 2015, at 7:12 AM, marcin.krotkiewski < >>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>marcin.krotkiew...@gmail.com >>>>>>>>>>> <mailto:marcin.krotkiew...@gmail.com>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 10/03/2015 01:06 PM, Ralph Castain wrote: >>>>>>>>>>>> Thanks Marcin. Looking at this, I’m guessing that Slurm may be >>>>>>>>>>>> treating HTs as “cores” - i.e., as independent cpus. Any chance >>>>>>>>>>>> that is true? >>>>>>>>>>> Not to the best of my knowledge, and at least not intentionally. >>>>>>>>>>> SLURM starts as many processes as there are physical cores, not >>>>>>>>>>> threads. To verify this, consider this >>>>>>>>>>> test case: >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/10/27790.php >>> <http://www.open-mpi.org/community/lists/users/2015/10/27790.php> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> <http://www.open-mpi.org/mailman/listinfo.cgi/users> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/10/27791.php >> <http://www.open-mpi.org/community/lists/users/2015/10/27791.php> > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/10/27792.php