This probably isn’t very helpful, but fwiw: we added an automatic “fingerprint” 
capability in the later OMPI versions just to detect things like this. If the 
fingerprint of a backend node doesn’t match the head node, we automatically 
assume hetero-nodes. It isn’t foolproof, but it would have picked this one up.

Sorry you had that trouble.
Ralph


> On Jun 1, 2015, at 1:53 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil> 
> wrote:
> 
>> On Apr 30, 2015, at 1:16 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil> 
>> wrote:
>> 
>>> On Apr 30, 2015, at 12:03 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>> The planning is pretty simple: at startup, mpirun launches a daemon on each 
>>> node. If —hetero-nodes is provided, each daemon returns the topology 
>>> discovered by hwloc - otherwise, only the first daemon does. Mpirun then 
>>> assigns procs to each node in a round-robin fashion (assuming you haven’t 
>>> told it something different).
> 
> Now that I’ve solved my problem, I thought I’d summarize it on the list, as a 
> cautionary tale.  I’d like to thank everyone who helped me, too.
> 
> Basically, this information from Ralph should have clued me in, but didn’t.  
> It turns out that the nodes were only _supposed_ to be identical, so I 
> thought the —hetero-nodes things is irrelevant.   As it happens, 
> hyperthreading got turned on on one of the nodes.  When that node was the 
> head node, the binding mask assumed 32 (HT) cores, and when it got to a node 
> that only had 16 (real) cores, and no HT, it failed.  Turning off HT on that 
> one rogue node fixed the problem.
> 
> Things that helped make this hard to debug:
> 1. The node that was problematic was not the one that failed.  The node next 
> to it (since our scheduler by default assigns adjacent nodes) is the one that 
> claimed to fail. This is just the nature of the problem.
> 2. Openmpi’s bindings listing appears to happen only after the binding is 
> complete, or maybe just the output fails to be flushed, since I never got 
> output (with hwloc_report_binding) from the node that was actually failing to 
> bind.  And I didn’t know the format of the reported binding, so I didn’t know 
> that “BB” meant both HT virtual cores bound, and all the numbers (on the HT 
> node with 32 virtual cores) were 0-15, not 0-31.
> 
> Anyway, problem solved, and thanks again for the help.
> 
>                                                                               
> Noam_______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/06/27020.php

Reply via email to