Hi Ralph,
I did valgrind and found uninitialised value errors. All of them occured in opal_tree_add_child as shown at the bottom. As a quick fix, I puted one line in "opal_tree.c", although it's not elegant: void opal_tree_init(opal_tree_t *tree, opal_tree_comp_fn_t comp, opal_tree_item_serialize_fn_t serialize, opal_tree_item_deserialize_fn_t deserialize, opal_tree_get_key_fn_t get_key) { tree->comp = comp; tree->serialize = serialize; tree->deserialize = deserialize; tree->get_key = get_key; opal_tree_get_root(tree)->opal_tree_num_children = 0 ; /* added by tmishima */ } Then, these errors all disappeared and openmpi with lama worked fine. As I told you before, I built openmpi with PGI 13.10. As far as I checked, no error was detected by valgrind with openmpi built by GNU compiler. Therefore, it might depend on compiler... Anyway, I would like to ask you (or openmpi team) to continue further investigation. Regards, Tetsuya Mishima valgrind -v --error-limit=no --leak-check=yes --show-reachable=no mpirun -np 1 -mca rmaps lama -report-bindings -mca rmaps_base_verbose 100 --display-map ~/Desktop/openmpi-1.7/demos/myprog 2>&1 | tee valgrind.log .... ==27313== Conditional jump or move depends on uninitialised value(s) ==27313== at 0x4EC52A4: opal_tree_add_child (opal_tree.c:191) ==27313== by 0x81E3314: rmaps_lama_convert_hwloc_subtree (rmaps_lama_max_tree.c:320) ==27313== by 0x81E321D: rmaps_lama_convert_hwloc_tree_to_opal_tree (rmaps_lama_max_tree.c:267) ==27313== by 0x81E2EE8: rmaps_lama_build_max_tree (rmaps_lama_max_tree.c:154) ==27313== by 0x81E0E58: orte_rmaps_lama_map_core (rmaps_lama_module.c:664) ==27313== by 0x81E02D7: orte_rmaps_lama_map (rmaps_lama_module.c:303) ==27313== by 0x4C6468B: orte_rmaps_base_map_job (rmaps_base_map_job.c:204) ==27313== by 0x4F094CC: event_process_active_single_queue (event.c:1366) ==27313== by 0x4F090D8: event_process_active (event.c:1434) ==27313== by 0x4F050FF: opal_libevent2021_event_base_loop (event.c:1645) ==27313== by 0x4079A6: orterun (orterun.c:1049) ==27313== by 0x40694A: main (main.c:13) ..... ==27313== Conditional jump or move depends on uninitialised value(s) ==27313== at 0x4EC52A4: opal_tree_add_child (opal_tree.c:191) ==27313== by 0x4EC5D0E: deserialize_add_tree_item (opal_tree.c:496) ==27313== by 0x4EC5578: opal_tree_deserialize (opal_tree.c:524) ==27313== by 0x4EC5609: opal_tree_dup (opal_tree.c:544) ==27313== by 0x81E2FF6: rmaps_lama_build_max_tree (rmaps_lama_max_tree.c:202) ==27313== by 0x81E0E58: orte_rmaps_lama_map_core (rmaps_lama_module.c:664) ==27313== by 0x81E02D7: orte_rmaps_lama_map (rmaps_lama_module.c:303) ==27313== by 0x4C6468B: orte_rmaps_base_map_job (rmaps_base_map_job.c:204) ==27313== by 0x4F094CC: event_process_active_single_queue (event.c:1366) ==27313== by 0x4F090D8: event_process_active (event.c:1434) ==27313== by 0x4F050FF: opal_libevent2021_event_base_loop (event.c:1645) ==27313== by 0x4079A6: orterun (orterun.c:1049) .... ==27313== Conditional jump or move depends on uninitialised value(s) ==27313== at 0x4EC52A4: opal_tree_add_child (opal_tree.c:191) ==27313== by 0x4EC5D0E: deserialize_add_tree_item (opal_tree.c:496) ==27313== by 0x4EC5578: opal_tree_deserialize (opal_tree.c:524) ==27313== by 0x4EC5609: opal_tree_dup (opal_tree.c:544) ==27313== by 0x81E2FF6: ??? ==27313== by 0x81E0E58: ??? ==27313== by 0x81E02D7: ??? ==27313== by 0x4C6468B: orte_rmaps_base_map_job (rmaps_base_map_job.c:204) ==27313== by 0x4F094CC: event_process_active_single_queue (event.c:1366) ==27313== by 0x4F090D8: event_process_active (event.c:1434) ==27313== by 0x4F050FF: opal_libevent2021_event_base_loop (event.c:1645) ==27313== by 0x4079A6: orterun (orterun.c:1049) ..... ==27313== Conditional jump or move depends on uninitialised value(s) ==27313== at 0x4EC52A4: opal_tree_add_child (opal_tree.c:191) ==27313== by 0x81E3314: ??? ==27313== by 0x81E321D: ??? ==27313== by 0x81E2EE8: ??? ==27313== by 0x81E0E58: ??? ==27313== by 0x81E02D7: ??? ==27313== by 0x4C6468B: orte_rmaps_base_map_job (rmaps_base_map_job.c:204) ==27313== by 0x4F094CC: event_process_active_single_queue (event.c:1366) ==27313== by 0x4F090D8: event_process_active (event.c:1434) ==27313== by 0x4F050FF: opal_libevent2021_event_base_loop (event.c:1645) ==27313== by 0x4079A6: orterun (orterun.c:1049) ==27313== by 0x40694A: main (main.c:13) > Hi Ralph, > > Here is the output when I put "-mca rmaps_base_verbose 10 --display-map" > and where it stopped(by gdb), which shows it stopped in a function of lama. > > I usually use PGI 13.10, so I tried to change it to gnu compiler. > Then, it works. Therefore, this problem depends on compiler. > > That's all what I could find today. > > Regards, > Tetsuya Mishima > > [mishima@manage ~]$ gdb > GNU gdb (GDB) CentOS (7.0.1-42.el5.centos.1) > .... > (gdb) attach 14666 > .... > 0x00002aaaab4c5c33 in rmaps_lama_prune_max_tree () > at ./rmaps_lama_max_tree.c:814 > > [mishima@manage demos]$ mpirun -np 2 -mca rmaps lama -report-bindings -mca > rmaps_base_verbose 10 --display-map myprog > [manage.cluster:21503] mca: base: components_register: registering rmaps > components > [manage.cluster:21503] mca: base: components_register: found loaded > component lama > [manage.cluster:21503] mca:rmaps:lama: Priority 0 > [manage.cluster:21503] mca:rmaps:lama: Map : NULL > [manage.cluster:21503] mca:rmaps:lama: Bind : NULL > [manage.cluster:21503] mca:rmaps:lama: MPPR : NULL > [manage.cluster:21503] mca:rmaps:lama: Order : NULL > [manage.cluster:21503] mca: base: components_register: component lama > register function successful > [manage.cluster:21503] mca: base: components_open: opening rmaps components > [manage.cluster:21503] mca: base: components_open: found loaded component > lama > [manage.cluster:21503] mca:rmaps:select: checking available component lama > [manage.cluster:21503] mca:rmaps:select: Querying component [lama] > [manage.cluster:21503] [[23940,0],0]: Final mapper priorities > [manage.cluster:21503] Mapper: lama Priority: 0 > [manage.cluster:21503] mca:rmaps: mapping job [23940,1] > [manage.cluster:21503] mca:rmaps: creating new map for job [23940,1] > [manage.cluster:21503] mca:rmaps: nprocs 2 > [manage.cluster:21503] mca:rmaps:lama: Mapping job [23940,1] > [manage.cluster:21503] mca:rmaps:lama: Revised Parameters ----- > [manage.cluster:21503] mca:rmaps:lama: Map : csbnh > [manage.cluster:21503] mca:rmaps:lama: Bind : 1c > [manage.cluster:21503] mca:rmaps:lama: MPPR : (null) > [manage.cluster:21503] mca:rmaps:lama: Order : s > [manage.cluster:21503] mca:rmaps:lama: --------------------------------- > [manage.cluster:21503] mca:rmaps:lama: ----- Binding : [1c] > [manage.cluster:21503] mca:rmaps:lama: ----- Binding : 1 x Core > [manage.cluster:21503] mca:rmaps:lama: --------------------------------- > [manage.cluster:21503] mca:rmaps:lama: ----- Mapping : [csbnh] > [manage.cluster:21503] mca:rmaps:lama: ----- Mapping : (0) Core (7 > vs 0) > [manage.cluster:21503] mca:rmaps:lama: ----- Mapping : (1) Socket (3 > vs 1) > [manage.cluster:21503] mca:rmaps:lama: ----- Mapping : (2) Board (1 > vs 3) > [manage.cluster:21503] mca:rmaps:lama: ----- Mapping : (3) Machine (0 > vs 7) > [manage.cluster:21503] mca:rmaps:lama: ----- Mapping : (4) Hw. Thread (8 > vs 8) > [manage.cluster:21503] mca:rmaps:lama: --------------------------------- > [manage.cluster:21503] mca:rmaps:lama: ----- MPPR : [(null)] > [manage.cluster:21503] mca:rmaps:lama: --------------------------------- > [manage.cluster:21503] mca:rmaps:lama: ----- Ordering : [s] > [manage.cluster:21503] mca:rmaps:lama: ----- Ordering : Sequential > [manage.cluster:21503] mca:rmaps:lama: --------------------------------- > [manage.cluster:21503] AVAILABLE NODES FOR MAPPING: > [manage.cluster:21503] node: manage daemon: 0 > [manage.cluster:21503] mca:rmaps:lama: --------------------------------- > [manage.cluster:21503] mca:rmaps:lama: ----- Building the Max Tree... > [manage.cluster:21503] mca:rmaps:lama: --------------------------------- > [manage.cluster:21503] mca:rmaps:lama: ----- Converting Remote Tree: manage > > [mishima@manage demos]$ ompi_info | grep "C compiler family" > C compiler family name: GNU > [mishima@manage demos]$ mpirun -np 2 -mca rmaps lama myprog > Hello world from process 0 of 2 > Hello world from process 1 of 2 > > > > > On Dec 21, 2013, at 8:16 PM, tmish...@jcity.maeda.co.jp wrote: > > > > > > > > > > > Ralph, thanks. I'll try it on Tuseday. > > > > > > Let me confirm one thing. I don't put "-with-libevent" when I build > > > openmpi. > > > Is there any possibility to build with external libevent automatically? > > > > No - only happens if you direct it > > > > > > > > > > Tetsuya Mishima > > > > > > > > >> Not entirely sure - add "-mca rmaps_base_verbose 10 --display-map" to > > > your cmd line and let's see if it finishes the mapping. > > >> > > >> Unless you specifically built with an external libevent (which I > doubt), > > > there is no conflict. The connection issue is unlikely to be a factor > here > > > as it works when not using the lama mapper. > > >> > > >> > > >> On Dec 21, 2013, at 3:43 PM, tmish...@jcity.maeda.co.jp wrote: > > >> > > >>> > > >>> > > >>> Thank you, Ralph. > > >>> > > >>> Then, this problem should depend on our environment. > > >>> But, at least, inversion problem is not the cause because > > >>> node05 has normal hier order. > > >>> > > >>> I can not connect to our cluster now. Tuesday, going > > >>> back to my office, I'll send you further report. > > >>> > > >>> Before that, please let me know your configuration. I will > > >>> follow your configuation as much as possible. Our configuraion > > >>> is very simple, only -with-tm -with-ibverbs -disable-ipv6. > > >>> (on CentOS 5.8) > > >>> > > >>> The 1.7 series is a llite bit unstable on our cluster yet. > > >>> > > >>> Similar freezing(hang up) was observed with 1.7.3. At that > > >>> time, lama worked well but putting "-rank-by something" caused > > >>> same freezing (curiously, rank-by works with 1.7.4rc1). > > >>> I checked where it stopped using gdb, then I found that it > > >>> stopped to wait for event in a function of libevent(I can not > > >>> recall the name). > > >>> > > >>> Is this related to your "connection issue in the OOB > > >>> subsystem"? Or libevent version conflict? I guess these two > > >>> problems are related each other. They stopped at very early > > >>> stage before reaching mapping function because no message > > >>> appeared before freezing, which is my random guess. > > >>> > > >>> Could you give me any hint or comment? > > >>> > > >>> Regards, > > >>> Tetsuya Mishima > > >>> > > >>> > > >>>> It seems to be working fine for me: > > >>>> > > >>>> [rhc@bend001 tcp]$ mpirun -np 2 -host bend001 -report-bindings -mca > > >>> rmaps_lama_bind 1c -mca rmaps lama hostname > > >>>> bend001 > > >>>> [bend001:17005] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: > > >>> [../BB/../../../..][../../../../../..] > > >>>> [bend001:17005] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: > > >>> [BB/../../../../..][../../../../../..] > > >>>> bend001 > > >>>> [rhc@bend001 tcp]$ > > >>>> > > >>>> (I also checked the internals using "-mca rmaps_base_verbose 10") so > > > it > > >>> could be your hier inversion causing problems again. Or it could be > > > that > > >>> you are hitting a connection issue we are seeing in > > >>>> some scenarios in the OOB subsystem - though if you are able to run > > > using > > >>> a non-lama mapper, that would seem unlikely. > > >>>> > > >>>> > > >>>> On Dec 20, 2013, at 8:09 PM, tmish...@jcity.maeda.co.jp wrote: > > >>>> > > >>>> > > >>>> > > >>>> Hi Ralph, > > >>>> > > >>>> Thank you very much. I tried many things such as: > > >>>> > > >>>> mpirun -np 2 -host node05 -report-bindings -mca rmaps lama -mca > > >>>> rmaps_lama_bind 1c myprog > > >>>> > > >>>> But every try failed. At least they were accepted by openmpi-1.7.3 > as > > > far > > >>>> as I remember. > > >>>> Anyway, please check it when you have a time, because using lama > comes > > >>> from > > >>>> my curiosity. > > >>>> > > >>>> Regards, > > >>>> Tetsuya Mishima > > >>>> > > >>>> > > >>>> I'll try to take a look at it - my expectation is that lama might > get > > >>>> stuck because you didn't tell it a pattern to map, and I doubt that > > > code > > >>>> path has seen much testing. > > >>>> > > >>>> > > >>>> On Dec 20, 2013, at 5:52 PM, tmish...@jcity.maeda.co.jp wrote: > > >>>> > > >>>> > > >>>> > > >>>> Hi Ralph, I'm glad to hear that, thanks. > > >>>> > > >>>> By the way, yesterday I tried to check how lama in 1.7.4rc treat > numa > > >>>> node. > > >>>> > > >>>> Then, even wiht this simple command line, it freezed without any > > >>>> massage: > > >>>> > > >>>> mpirun -np 2 -host node05 -mca rmaps lama myprog > > >>>> > > >>>> Could you check what happened? > > >>>> > > >>>> Is it better to open new thread or continue this thread? > > >>>> > > >>>> Regards, > > >>>> Tetsuya Mishima > > >>>> > > >>>> > > >>>> I'll make it work so that NUMA can be either above or below socket > > >>>> > > >>>> On Dec 20, 2013, at 2:57 AM, tmish...@jcity.maeda.co.jp wrote: > > >>>> > > >>>> > > >>>> > > >>>> Hi Brice, > > >>>> > > >>>> Thank you for your comment. I understand what you mean. > > >>>> > > >>>> My opinion was made just considering easy way to adjust the code for > > >>>> inversion of hierarchy in object tree. > > >>>> > > >>>> Tetsuya Mishima > > >>>> > > >>>> > > >>>> I don't think there's any such difference. > > >>>> Also, all these NUMA architectures are reported the same by hwloc, > > >>>> and > > >>>> therefore used the same in Open MPI. > > >>>> > > >>>> And yes, L3 and NUMA are topologically-identical on AMD Magny-Cours > > >>>> (and > > >>>> most recent AMD and Intel platforms). > > >>>> > > >>>> Brice > > >>>> > > >>>> > > >>>> > > >>>> Le 20/12/2013 11:33, tmish...@jcity.maeda.co.jp a écrit : > > >>>> > > >>>> Hi Ralph, > > >>>> > > >>>> The numa-node in AMD Mangy-Cours/Interlagos is so called cc(cache > > >>>> coherent)NUMA, > > >>>> which seems to be a little bit different from the traditional numa > > >>>> defined > > >>>> in openmpi. > > >>>> > > >>>> I notice that ccNUMA object is almost same as L3cache object. > > >>>> So "-bind-to l3cache" or "-map-by l3cache" is valid for what I want > > >>>> to > > >>>> do. > > >>>> Therefore, "do not touch it" is one of the solution, I think ... > > >>>> > > >>>> Anyway, mixing up these two types of numa is the problem. > > >>>> > > >>>> Regards, > > >>>> Tetsuya Mishima > > >>>> > > >>>> I can wait it'll be fixed in 1.7.5 or later, because putting > > >>>> "-bind-to > > >>>> numa" > > >>>> and "-map-by numa" at the same time works as a workaround. > > >>>> > > >>>> Thanks, > > >>>> Tetsuya Mishima > > >>>> > > >>>> Yeah, it will impact everything that uses hwloc topology maps, I > > >>>> fear. > > >>>> > > >>>> One side note: you'll need to add --hetero-nodes to your cmd > > >>>> line. > > >>>> If > > >>>> we > > >>>> don't see that, we assume that all the node topologies are > > >>>> identical > > >>>> - > > >>>> which clearly isn't true here. > > >>>> I'll try to resolve the hier inversion over the holiday - won't > > >>>> be > > >>>> for > > >>>> 1.7.4, but hopefully for 1.7.5 > > >>>> Thanks > > >>>> Ralph > > >>>> > > >>>> On Dec 18, 2013, at 9:44 PM, tmish...@jcity.maeda.co.jp wrote: > > >>>> > > >>>> > > >>>> I think it's normal for AMD opteron having 8/16 cores such as > > >>>> magny cours or interlagos. Because it usually has 2 numa nodes > > >>>> in a cpu(socket), numa-node can not include a socket. This type > > >>>> of hierarchy would be natural. > > >>>> > > >>>> (node03 is Dell PowerEdge R815 and maybe quite common, I guess) > > >>>> > > >>>> By the way, I think this inversion should affect rmaps_lama > > >>>> mapping. > > >>>> > > >>>> Tetsuya Mishima > > >>>> > > >>>> Ick - yeah, that would be a problem. I haven't seen that type > > >>>> of > > >>>> hierarchical inversion before - is node03 a different type of > > >>>> chip? > > >>>> Might take awhile for me to adjust the code to handle hier > > >>>> inversion... :-( > > >>>> On Dec 18, 2013, at 9:05 PM, tmish...@jcity.maeda.co.jp wrote: > > >>>> > > >>>> > > >>>> Hi Ralph, > > >>>> > > >>>> I found the reason. I attached the main part of output with 32 > > >>>> core node(node03) and 8 core node(node05) at the bottom. > > >>>> > > >>>> From this information, socket of node03 includes numa-node. > > >>>> On the other hand, numa-node of node05 includes socket. > > >>>> The direction of object tree is opposite. > > >>>> > > >>>> Since "-map-by socket" may be assumed as default, > > >>>> for node05, "-bind-to numa and -map-by socket" means > > >>>> upward search. For node03, this should be downward. > > >>>> > > >>>> I guess that openmpi-1.7.4rc1 will always assume numa-node > > >>>> includes socket. Is it right? Then, upward search is assumed > > >>>> in orte_rmaps_base_compute_bindings even for node03 when I > > >>>> put "-bind-to numa and -map-by socket" option. > > >>>> > > >>>> [node03.cluster:15508] [[38286,0],0] rmaps:base:compute_usage > > >>>> [node03.cluster:15508] mca:rmaps: compute bindings for job > > >>>> [38286,1] > > >>>> with > > >>>> policy NUMA > > >>>> [node03.cluster:15508] mca:rmaps: bind upwards for job > > >>>> [38286,1] > > >>>> with > > >>>> bindings NUMA > > >>>> [node03.cluster:15508] [[38286,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Machine > > >>>> > > >>>> That's the reason of this trouble. Therefore, adding "-map-by > > >>>> core" > > >>>> works. > > >>>> (mapping pattern seems to be strange ...) > > >>>> > > >>>> [mishima@node03 demos]$ mpirun -np 8 -bind-to numa -map-by > > >>>> core > > >>>> -report-bindings myprog > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> NUMANode > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> NUMANode > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> NUMANode > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> NUMANode > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> NUMANode > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode> >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> NUMANode > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> NUMANode > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Cache > > >>>> [node03.cluster:15885] [[38679,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> NUMANode > > >>>> [node03.cluster:15885] MCW rank 2 bound to socket 0[core 0[hwt > > >>>> 0]], > > >>>> socket > > >>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > >>>> cket 0[core 3[hwt 0]]: > > >>>> > > >>>> > > >>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > > >>>> [node03.cluster:15885] MCW rank 3 bound to socket 0[core 0[hwt > > >>>> 0]], > > >>>> socket > > >>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > >>>> cket 0[core 3[hwt 0]]: > > >>>> > > >>>> > > >>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > > >>>> [node03.cluster:15885] MCW rank 4 bound to socket 0[core 4[hwt > > >>>> 0]], > > >>>> socket > > >>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > > >>>> cket 0[core 7[hwt 0]]: > > >>>> > > >>>> > > >>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > > >>>> [node03.cluster:15885] MCW rank 5 bound to socket 0[core 4[hwt > > >>>> 0]], > > >>>> socket > > >>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > > >>>> cket 0[core 7[hwt 0]]: > > >>>> > > >>>> > > >>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > > >>>> [node03.cluster:15885] MCW rank 6 bound to socket 0[core 4[hwt > > >>>> 0]], > > >>>> socket > > >>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > > >>>> cket 0[core 7[hwt 0]]: > > >>>> > > >>>> > > >>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > > >>>> [node03.cluster:15885] MCW rank 7 bound to socket 0[core 4[hwt > > >>>> 0]], > > >>>> socket > > >>>> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so > > >>>> cket 0[core 7[hwt 0]]: > > >>>> > > >>>> > > >>>> [././././B/B/B/B][./././././././.][./././././././.][./././././././.] > > >>>> [node03.cluster:15885] MCW rank 0 bound to socket 0[core 0[hwt > > >>>> 0]], > > >>>> socket > > >>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > >>>> cket 0[core 3[hwt 0]]: > > >>>> > > >>>> > > >>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > > >>>> [node03.cluster:15885] MCW rank 1 bound to socket 0[core 0[hwt > > >>>> 0]], > > >>>> socket > > >>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so > > >>>> cket 0[core 3[hwt 0]]: > > >>>> > > >>>> > > >>>> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.] > > >>>> Hello world from process 6 of 8 > > >>>> Hello world from process 5 of 8 > > >>>> Hello world from process 0 of 8 > > >>>> Hello world from process 7 of 8 > > >>>> Hello world from process 3 of 8 > > >>>> Hello world from process 4 of 8 > > >>>> Hello world from process 2 of 8 > > >>>> Hello world from process 1 of 8 > > >>>> > > >>>> Regards, > > >>>> Tetsuya Mishima > > >>>> > > >>>> [node03.cluster:15508] Type: Machine Number of child objects: > > >>>> 4 > > >>>> Name=NULL > > >>>> total=132358820KB > > >>>> Backend=Linux > > >>>> OSName=Linux > > >>>> OSRelease=2.6.18-308.16.1.el5 > > >>>> OSVersion="#1 SMP Tue Oct 2 22:01:43 EDT 2012" > > >>>> Architecture=x86_64 > > >>>> Cpuset: 0xffffffff > > >>>> Online: 0xffffffff > > >>>> Allowed: 0xffffffff > > >>>> Bind CPU proc: TRUE > > >>>> Bind CPU thread: TRUE > > >>>> Bind MEM proc: FALSE > > >>>> Bind MEM thread: TRUE > > >>>> Type: Socket Number of child objects: 2 > > >>>> Name=NULL > > >>>> total=33071780KB > > >>>> CPUModel="AMD Opteron(tm) Processor 6136" > > >>>> Cpuset: 0x000000ff > > >>>> Online: 0x000000ff > > >>>> Allowed: 0x000000ff > > >>>> Type: NUMANode Number of child objects: 1 > > >>>> > > >>>> > > >>>> [node05.cluster:21750] Type: Machine Number of child objects: > > >>>> 2 > > >>>> Name=NULL > > >>>> total=33080072KB > > >>>> Backend=Linux>>>> OSName=Linux > > >>>> OSRelease=2.6.18-308.16.1.el5 > > >>>> OSVersion="#1 SMP Tue Oct 2 22:01:43 EDT 2012" > > >>>> Architecture=x86_64 > > >>>> Cpuset: 0x000000ff > > >>>> Online: 0x000000ff > > >>>> Allowed: 0x000000ff > > >>>> Bind CPU proc: TRUE > > >>>> Bind CPU thread: TRUE > > >>>> Bind MEM proc: FALSE > > >>>> Bind MEM thread: TRUE > > >>>> Type: NUMANode Number of child objects: 1 > > >>>> Name=NULL > > >>>> local=16532232KB > > >>>> total=16532232KB > > >>>> Cpuset: 0x0000000f > > >>>> Online: 0x0000000f > > >>>> Allowed: 0x0000000f > > >>>> Type: Socket Number of child objects: 1 > > >>>> > > >>>> > > >>>> Hmm...try adding "-mca rmaps_base_verbose 10 -mca > > >>>> ess_base_verbose > > >>>> 5" > > >>>> to > > >>>> your cmd line and let's see what it thinks it found. > > >>>> > > >>>> On Dec 18, 2013, at 6:55 PM, tmish...@jcity.maeda.co.jp > > >>>> wrote: > > >>>> > > >>>> > > >>>> Hi, I report one more problem with openmpi-1.7.4rc1, > > >>>> which is more serious. > > >>>> > > >>>> For our 32 core nodes(AMD magny cours based) which has > > >>>> 8 numa-nodes, "-bind-to numa" does not work. Without > > >>>> this option, it works. For your infomation, at the > > >>>> bottom of this mail, I added the lstopo information > > >>>> of the node. > > >>>> > > >>>> Regards, > > >>>> Tetsuya Mishima > > >>>> > > >>>> [mishima@manage ~]$ qsub -I -l nodes=1:ppn=32>> qsub: waiting for > job > > > 8352.manage.cluster to start > > >>>> qsub: job 8352.manage.cluster ready > > >>>> > > >>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings > > >>>> -bind-to > > >>>> numa > > >>>> myprog > > >>>> [node03.cluster:15316] [[37582,0],0] bind:upward target > > >>>> NUMANode > > >>>> type > > >>>> Machine > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>> > > > > -------------------------------------------------------------------------- > > >>>> A request was made to bind to NUMA, but an appropriate > > >>>> target > > >>>> could > > >>>> not > > >>>> be found on node node03. > > >>>> > > >>>> > > >>>> > > >>>> > > >>>> > > >>> > > > > -------------------------------------------------------------------------- > > >>>> [mishima@node03 ~]$ cd ~/Desktop/openmpi-1.7/demos/ > > >>>> [mishima@node03 demos]$ mpirun -np 8 -report-bindings myprog > > >>>> [node03.cluster:15282] MCW rank 2 bound to socket 1[core 8 > > >>>> [hwt > > >>>> 0]]: > > >>>> [./././././././.][B/././././././.][./././././././.][ > > >>>> ./././././././.]>>>>>>>>>>>> [node03.cluster:15282] MCW rank > > >>>> 3 bound to socket 1[core 9[hwt > > >>>> 0]]: > > >>>> [./././././././.][./B/./././././.][./././././././.][ > > >>>> ./././././././.] > > >>>> [node03.cluster:15282] MCW rank 4 bound to socket 2[core 16 > > >>>> [hwt > > >>>> 0]]: > > >>>> [./././././././.][./././././././.][B/././././././.] > > >>>> [./././././././.] > > >>>> [node03.cluster:15282] MCW rank 5 bound to socket 2[core 17 > > >>>> [hwt > > >>>> 0]]: > > >>>> [./././././././.][./././././././.][./B/./././././.] > > >>>> [./././././././.] > > >>>> [node03.cluster:15282] MCW rank 6 bound to socket 3[core 24 > > >>>> [hwt > > >>>> 0]]: > > >>>> [./././././././.][./././././././.][./././././././.] > > >>>> [B/././././././.] > > >>>> [node03.cluster:15282] MCW rank 7 bound to socket 3[core 25 > > >>>> [hwt > > >>>> 0]]: > > >>>> [./././././././.][./././././././.][./././././././.] > > >>>> [./B/./././././.] > > >>>> [node03.cluster:15282] MCW rank 0 bound to socket 0[core 0 > > >>>> [hwt > > >>>> 0]]: > > >>>> [B/././././././.][./././././././.][./././././././.][ > > >>>> ./././././././.] > > >>>> [node03.cluster:15282] MCW rank 1 bound to socket 0[core 1 > > >>>> [hwt > > >>>> 0]]: > > >>>> [./B/./././././.][./././././././.][./././././././.][ > > >>>> ./././././././.] > > >>>> Hello world from process 2 of 8 > > >>>> Hello world from process 5 of 8 > > >>>> Hello world from process 4 of 8 > > >>>> Hello world from process 3 of 8>>>>>>>>>> Hello world from > > >>>> process 1 of 8 > > >>>> Hello world from process 7 of 8 > > >>>> Hello world from process 6 of 8 > > >>>> Hello world from process 0 of 8 > > >>>> [mishima@node03 demos]$ ~/opt/hwloc/bin/lstopo-no-graphics > > >>>> Machine (126GB) > > >>>> Socket L#0 (32GB) > > >>>> NUMANode L#0 (P#0 16GB) + L3 L#0 (5118KB) > > >>>> L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 > > >>>> + > > >>>> PU > > >>>> L#0 > > >>>> (P#0) > > >>>> L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 > > >>>> + > > >>>> PU > > >>>> L#1 > > >>>> (P#1) > > >>>> L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 > > >>>> + > > >>>> PU > > >>>> L#2 > > >>>> (P#2) > > >>>> L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 > > >>>> + > > >>>> PU > > >>>> L#3 > > >>>> (P#3) > > >>>> NUMANode L#1 (P#1 16GB) + L3 L#1 (5118KB) > > >>>> L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 > > >>>> + > > >>>> PU > > >>>> L#4 > > >>>> (P#4) > > >>>> L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 > > >>>> + > > >>>> PU > > >>>> L#5 > > >>>> (P#5) > > >>>> L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 > > >>>> + > > >>>> PU > > >>>> L#6 > > >>>> (P#6) > > >>>> L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 > > >>>> + > > >>>> PU>>>>>> L#7 > > >>>> (P#7) > > >>>> Socket L#1 (32GB) > > >>>> NUMANode L#2 (P#6 16GB) + L3 L#2 (5118KB) > > >>>> L2 L#8 (512KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 > > >>>> + > > >>>> PU > > >>>> L#8 > > >>>> (P#8) > > >>>> L2 L#9 (512KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 > > >>>> + > > >>>> PU > > >>>> L#9 > > >>>> (P#9) > > >>>> L2 L#10 (512KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core > > >>>> L#10 > > >>>> + > > >>>> PU > > >>>> L#10 (P#10) > > >>>> L2 L#11 (512KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core > > >>>> L#11 > > >>>> + > > >>>> PU > > >>>> L#11 (P#11) > > >>>> NUMANode L#3 (P#7 16GB) + L3 L#3 (5118KB) > > >>>> L2 L#12 (512KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core > > >>>> L#12 > > >>>> + > > >>>> PU > > >>>> L#12 (P#12) > > >>>> L2 L#13 (512KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core > > >>>> L#13 > > >>>> + > > >>>> PU > > >>>> L#13 (P#13) > > >>>> L2 L#14 (512KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core > > >>>> L#14 > > >>>> + > > >>>> PU > > >>>> L#14 (P#14) > > >>>> L2 L#15 (512KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core > > >>>> L#15 > > >>>> + > > >>>> PU > > >>>> L#15 (P#15) > > >>>> Socket L#2 (32GB) > > >>>> NUMANode L#4 (P#4 16GB) + L3 L#4 (5118KB) > > >>>> L2 L#16 (512KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core > > >>>> L#16 > > >>>> + > > >>>> PU > > >>>> L#16 (P#16) > > >>>> L2 L#17 (512KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core > > >>>> L#17 > > >>>> + > > >>>> PU > > >>>> L#17 (P#17)> >>>>> L2 L#18 (512KB) + L1d L#18 (64KB) + > > >>>> L1i > > >>>> L#18 (64KB) + Core L#18 > > >>>> + > > >>>> PU > > >>>> L#18 (P#18) > > >>>> L2 L#19 (512KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core > > >>>> L#19 > > >>>> + > > >>>> PU > > >>>> L#19 (P#19) > > >>>> NUMANode L#5 (P#5 16GB) + L3 L#5 (5118KB) > > >>>> L2 L#20 (512KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core > > >>>> L#20 > > >>>> + > > >>>> PU > > >>>> L#20 (P#20) > > >>>> L2 L#21 (512KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core > > >>>> L#21 > > >>>> + > > >>>> PU > > >>>> L#21 (P#21) > > >>>> L2 L#22 (512KB) + L1d L#22 (64KB) + L1i L#22 (64KB) + Core > > >>>> L#22 > > >>>> + > > >>>> PU > > >>>> L#22 (P#22) > > >>>> L2 L#23 (512KB) + L1d L#23 (64KB) + L1i L#23 (64KB) + Core > > >>>> L#23 > > >>>> + > > >>>> PU > > >>>> L#23 (P#23) > > >>>> Socket L#3 (32GB) > > >>>> NUMANode L#6 (P#2 16GB) + L3 L#6 (5118KB) > > >>>> L2 L#24 (512KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core > > >>>> L#24 > > >>>> + > > >>>> PU > > >>>> L#24 (P#24)>>>>> L2 L#25 (512KB) + L1d L#25 (64KB) + L1i > > >>>> L#25 > > >>>> (64KB) + Core L#25 + > > >>>> PU > > >>>> L#25 (P#25) > > >>>> L2 L#26 (512KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core > > >>>> L#26 > > >>>> + > > >>>> PU > > >>>> L#26 (P#26) > > >>>> L2 L#27 (512KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core > > >>>> L#27 > > >>>> + > > >>>> PU > > >>>> L#27 (P#27) > > >>>> NUMANode L#7 (P#3 16GB) + L3 L#7 (5118KB) > > >>>> L2 L#28 (512KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core > > >>>> L#28 > > >>>> + > > >>>> PU > > >>>> L#28 (P#28) > > >>>> L2 L#29 (512KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core > > >>>> L#29 > > >>>> + > > >>>> PU > > >>>> L#29 (P#29) > > >>>> L2 L#30 (512KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core > > >>>> L#30 > > >>>> + > > >>>> PU > > >>>> L#30 (P#30) > > >>>> L2 L#31 (512KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core > > >>>> L#31 > > >>>> + > > >>>> PU > > >>>> L#31 (P#31) > > >>>> HostBridge L#0 > > >>>> PCIBridge > > >>>> PCI 14e4:1639 > > >>>> Net L#0 "eth0" > > >>>> PCI 14e4:1639 > > >>>> Net L#1 "eth1" > > >>>> PCIBridge > > >>>> PCI 14e4:1639 > > >>>> Net L#2 "eth2" > > >>>> PCI 14e4:1639 > > >>>> Net L#3 "eth3" > > >>>> PCIBridge > > >>>> PCIBridge > > >>>> PCIBridge > > >>>> PCI 1000:0072 > > >>>> Block L#4 "sdb" > > >>>> Block L#5 "sda" > > >>>> PCI 1002:4390 > > >>>> Block L#6 "sr0" > > >>>> PCIBridge > > >>>> PCI 102b:0532 > > >>>> HostBridge L#7 > > >>>> PCIBridge > > >>>> PCI 15b3:6274 > > >>>> Net L#7 "ib0" > > >>>> OpenFabrics L#8 "mthca0" > > >>>> > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org>> > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> > > >>>> _______________________________________________> >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> > > >>>> _______________________________ ________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> > > >>> > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users_______________________________________________ > > > > > > >>> > > >>>> users mailing list > > >>>> users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users > > >>> > > >>> _______________________________________________ > > >>> users mailing list > > >>> us...@open-mpi.org > > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > >> _______________________________________________ > > >> users mailing list > > >> us...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users