> On Jun 1, 2015, at 5:09 PM, Ralph Castain wrote:
>
> This probably isn’t very helpful, but fwiw: we added an automatic
> “fingerprint” capability in the later OMPI versions just to detect things
> like this. If the fingerprint of a backend node doesn’t match the head node,
> we automatically
This probably isn’t very helpful, but fwiw: we added an automatic “fingerprint”
capability in the later OMPI versions just to detect things like this. If the
fingerprint of a backend node doesn’t match the head node, we automatically
assume hetero-nodes. It isn’t foolproof, but it would have pic
> On Apr 30, 2015, at 1:16 PM, Noam Bernstein
> wrote:
>
>> On Apr 30, 2015, at 12:03 PM, Ralph Castain wrote:
>>
>> The planning is pretty simple: at startup, mpirun launches a daemon on each
>> node. If —hetero-nodes is provided, each daemon returns the topology
>> discovered by hwloc - ot
The planning is pretty simple: at startup, mpirun launches a daemon on each
node. If —hetero-nodes is provided, each daemon returns the topology discovered
by hwloc - otherwise, only the first daemon does. Mpirun then assigns procs to
each node in a round-robin fashion (assuming you haven’t told
> On Apr 29, 2015, at 5:59 PM, Ralph Castain wrote:
>
> Try adding —hetero-nodes to the cmd line and see if that helps resolve the
> problem. Of course, if all the machines are identical, then it won’t
They are identical, and the problem is new. That’s what’s most mysterious
about it.
Can
Try adding —hetero-nodes to the cmd line and see if that helps resolve the
problem. Of course, if all the machines are identical, then it won’t
> On Apr 29, 2015, at 1:43 PM, Brice Goglin wrote:
>
> Le 29/04/2015 22:25, Noam Bernstein a écrit :
>>> On Apr 29, 2015, at 4:09 PM, Brice Goglin wr
Le 29/04/2015 22:25, Noam Bernstein a écrit :
>> On Apr 29, 2015, at 4:09 PM, Brice Goglin wrote:
>>
>> Nothing wrong in that XML. I don't see what could be happening besides a
>> node rebooting with hyper-threading enabled for random reasons.
>> Please run "lstopo foo.xml" again on the node next
> On Apr 29, 2015, at 4:09 PM, Brice Goglin wrote:
>
> Nothing wrong in that XML. I don't see what could be happening besides a
> node rebooting with hyper-threading enabled for random reasons.
> Please run "lstopo foo.xml" again on the node next time you get the OMPI
> failure (assuming you get
Le 29/04/2015 18:55, Noam Bernstein a écrit :
>> On Apr 29, 2015, at 12:47 PM, Brice Goglin wrote:
>>
>> Thanks. It's indeed normal that OMPI failed to bind to cpuset 0,16 since
>> 16 doesn't exist at all.
>> Can you run "lstopo foo.xml" on one node where it failed, and send the
>> foo.xml that go
> On Apr 29, 2015, at 12:47 PM, Brice Goglin wrote:
>
> Thanks. It's indeed normal that OMPI failed to bind to cpuset 0,16 since
> 16 doesn't exist at all.
> Can you run "lstopo foo.xml" on one node where it failed, and send the
> foo.xml that got generated? Just want to make sure we don't have
Le 29/04/2015 14:53, Noam Bernstein a écrit :
> They’re dual 8-core processor, so the 16 cores are physical ones. lstopo
> output looks identical on nodes where this does happen, and nodes where it
> never does. My next step is to see if I can reproduce the behavior at will -
> I’m still n
> On Apr 28, 2015, at 4:54 PM, Brice Goglin wrote:
>
> Hello,
> Can you build hwloc and run lstopo on these nodes to check that everything
> looks similar?
> You have hyperthreading enabled on all nodes, and you're trying to bind
> processes to entire cores, right?
> Does 0,16 correspond to two
Hello,
Can you build hwloc and run lstopo on these nodes to check that
everything looks similar?
You have hyperthreading enabled on all nodes, and you're trying to bind
processes to entire cores, right?
Does 0,16 correspond to two hyperthreads within a single core on these
nodes? (lstopo -p should
Hi all - we’re having a new error, despite the fact that as far as I can tell
we haven’t changed anything recently, and I was wondering if anyone had any
ideas as to what might be going on.
The symptom is that we sometimes get an error when starting a new mpi job:
Open MPI tried to bind a new p
14 matches
Mail list logo