Re: [OMPI users] new hwloc error

2015-06-01 Thread Noam Bernstein
> On Jun 1, 2015, at 5:09 PM, Ralph Castain wrote: > > This probably isn’t very helpful, but fwiw: we added an automatic > “fingerprint” capability in the later OMPI versions just to detect things > like this. If the fingerprint of a backend node doesn’t match the head node, > we automatically

Re: [OMPI users] new hwloc error

2015-06-01 Thread Ralph Castain
This probably isn’t very helpful, but fwiw: we added an automatic “fingerprint” capability in the later OMPI versions just to detect things like this. If the fingerprint of a backend node doesn’t match the head node, we automatically assume hetero-nodes. It isn’t foolproof, but it would have pic

Re: [OMPI users] new hwloc error

2015-06-01 Thread Noam Bernstein
> On Apr 30, 2015, at 1:16 PM, Noam Bernstein > wrote: > >> On Apr 30, 2015, at 12:03 PM, Ralph Castain wrote: >> >> The planning is pretty simple: at startup, mpirun launches a daemon on each >> node. If —hetero-nodes is provided, each daemon returns the topology >> discovered by hwloc - ot

Re: [OMPI users] new hwloc error

2015-04-30 Thread Ralph Castain
The planning is pretty simple: at startup, mpirun launches a daemon on each node. If —hetero-nodes is provided, each daemon returns the topology discovered by hwloc - otherwise, only the first daemon does. Mpirun then assigns procs to each node in a round-robin fashion (assuming you haven’t told

Re: [OMPI users] new hwloc error

2015-04-30 Thread Noam Bernstein
> On Apr 29, 2015, at 5:59 PM, Ralph Castain wrote: > > Try adding —hetero-nodes to the cmd line and see if that helps resolve the > problem. Of course, if all the machines are identical, then it won’t They are identical, and the problem is new. That’s what’s most mysterious about it. Can

Re: [OMPI users] new hwloc error

2015-04-29 Thread Ralph Castain
Try adding —hetero-nodes to the cmd line and see if that helps resolve the problem. Of course, if all the machines are identical, then it won’t > On Apr 29, 2015, at 1:43 PM, Brice Goglin wrote: > > Le 29/04/2015 22:25, Noam Bernstein a écrit : >>> On Apr 29, 2015, at 4:09 PM, Brice Goglin wr

Re: [OMPI users] new hwloc error

2015-04-29 Thread Brice Goglin
Le 29/04/2015 22:25, Noam Bernstein a écrit : >> On Apr 29, 2015, at 4:09 PM, Brice Goglin wrote: >> >> Nothing wrong in that XML. I don't see what could be happening besides a >> node rebooting with hyper-threading enabled for random reasons. >> Please run "lstopo foo.xml" again on the node next

Re: [OMPI users] new hwloc error

2015-04-29 Thread Noam Bernstein
> On Apr 29, 2015, at 4:09 PM, Brice Goglin wrote: > > Nothing wrong in that XML. I don't see what could be happening besides a > node rebooting with hyper-threading enabled for random reasons. > Please run "lstopo foo.xml" again on the node next time you get the OMPI > failure (assuming you get

Re: [OMPI users] new hwloc error

2015-04-29 Thread Brice Goglin
Le 29/04/2015 18:55, Noam Bernstein a écrit : >> On Apr 29, 2015, at 12:47 PM, Brice Goglin wrote: >> >> Thanks. It's indeed normal that OMPI failed to bind to cpuset 0,16 since >> 16 doesn't exist at all. >> Can you run "lstopo foo.xml" on one node where it failed, and send the >> foo.xml that go

Re: [OMPI users] new hwloc error

2015-04-29 Thread Noam Bernstein
> On Apr 29, 2015, at 12:47 PM, Brice Goglin wrote: > > Thanks. It's indeed normal that OMPI failed to bind to cpuset 0,16 since > 16 doesn't exist at all. > Can you run "lstopo foo.xml" on one node where it failed, and send the > foo.xml that got generated? Just want to make sure we don't have

Re: [OMPI users] new hwloc error

2015-04-29 Thread Brice Goglin
Le 29/04/2015 14:53, Noam Bernstein a écrit : > They’re dual 8-core processor, so the 16 cores are physical ones. lstopo > output looks identical on nodes where this does happen, and nodes where it > never does. My next step is to see if I can reproduce the behavior at will - > I’m still n

Re: [OMPI users] new hwloc error

2015-04-29 Thread Noam Bernstein
> On Apr 28, 2015, at 4:54 PM, Brice Goglin wrote: > > Hello, > Can you build hwloc and run lstopo on these nodes to check that everything > looks similar? > You have hyperthreading enabled on all nodes, and you're trying to bind > processes to entire cores, right? > Does 0,16 correspond to two

Re: [OMPI users] new hwloc error

2015-04-28 Thread Brice Goglin
Hello, Can you build hwloc and run lstopo on these nodes to check that everything looks similar? You have hyperthreading enabled on all nodes, and you're trying to bind processes to entire cores, right? Does 0,16 correspond to two hyperthreads within a single core on these nodes? (lstopo -p should

[OMPI users] new hwloc error

2015-04-28 Thread Noam Bernstein
Hi all - we’re having a new error, despite the fact that as far as I can tell we haven’t changed anything recently, and I was wondering if anyone had any ideas as to what might be going on. The symptom is that we sometimes get an error when starting a new mpi job: Open MPI tried to bind a new p