Re: [OMPI users] new hwloc error

Noam Bernstein Mon, 1 Jun 2015 16:53:46 -0400 (EDT)

> On Apr 30, 2015, at 1:16 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil> 
> wrote:
> 
>> On Apr 30, 2015, at 12:03 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> The planning is pretty simple: at startup, mpirun launches a daemon on each 
>> node. If —hetero-nodes is provided, each daemon returns the topology 
>> discovered by hwloc - otherwise, only the first daemon does. Mpirun then 
>> assigns procs to each node in a round-robin fashion (assuming you haven’t 
>> told it something different).


Now that I’ve solved my problem, I thought I’d summarize it on the list, as a 
cautionary tale.  I’d like to thank everyone who helped me, too.

Basically, this information from Ralph should have clued me in, but didn’t.  It 
turns out that the nodes were only _supposed_ to be identical, so I thought the 
—hetero-nodes things is irrelevant.   As it happens, hyperthreading got turned 
on on one of the nodes.  When that node was the head node, the binding mask 
assumed 32 (HT) cores, and when it got to a node that only had 16 (real) cores, 
and no HT, it failed.  Turning off HT on that one rogue node fixed the problem.

Things that helped make this hard to debug:
1. The node that was problematic was not the one that failed.  The node next to 
it (since our scheduler by default assigns adjacent nodes) is the one that 
claimed to fail. This is just the nature of the problem.
2. Openmpi’s bindings listing appears to happen only after the binding is 
complete, or maybe just the output fails to be flushed, since I never got 
output (with hwloc_report_binding) from the node that was actually failing to 
bind.  And I didn’t know the format of the reported binding, so I didn’t know 
that “BB” meant both HT virtual cores bound, and all the numbers (on the HT 
node with 32 virtual cores) were 0-15, not 0-31.

Anyway, problem solved, and thanks again for the help.

                                                                                
Noam

smime.p7s
Description: S/MIME cryptographic signature

Re: [OMPI users] new hwloc error

Reply via email to