On 03/03/2014 05:06 PM, Brice Goglin wrote:
Le 03/03/2014 23:02, Gus Correa a écrit :
I rebooted the node and ran hwloc-gather-topology again.
This turn it didn't throw any errors on the terminal window,
which may be a good sign.
[root@node14 ~]# hwloc-gather-topology /tmp/`date
+"%Y%m%d%H%M"`.
Le 03/03/2014 23:02, Gus Correa a écrit :
> I rebooted the node and ran hwloc-gather-topology again.
> This turn it didn't throw any errors on the terminal window,
> which may be a good sign.
>
> [root@node14 ~]# hwloc-gather-topology /tmp/`date
> +"%Y%m%d%H%M"`.$(uname -n)
> Hierarchy gathered in
Hi Brice
Here are answers to your questions,
and my latest attempt to solve the problem:
1) Kernel version:
The nodes with new motherboards (node14 and node16) have the
same kernel as the nodes with original motherboards (e.g. node15),
as they were cloned from the same node image:
[root@node14
Le 28/02/2014 21:30, Gus Correa a écrit :
> Hi Brice
>
> The (pdf) output of lstopo shows one L1d (16k) for each core,
> and one L1i (64k) for each *pair* of cores.
> Is this wrong?
It's correct. AMD uses this "dual-core compute unit" where L2 and L1i
are shared but L1d isn't.
> BTW, if there are
Am 28.02.2014 um 21:23 schrieb Brice Goglin:
> OK, the problem is that node14's BIOS reports invalid NUMA info. It properly
> detects 2 sockets with 16-cores each. But it reports 2 NUMA nodes total,
> instead of 2 per socket (4 total). And hwloc warns because the cores
> contained in these NUMA
On 02/28/2014 03:32 AM, Brice Goglin wrote:
Le 28/02/2014 02:48, Ralph Castain a écrit :
Remember, hwloc doesn't actually "sense" hardware - it just parses files in the
/proc area. So if something is garbled in those files, hwloc will report errors. Doesn't
mean anything is wrong with the hard
OK, the problem is that node14's BIOS reports invalid NUMA info. It
properly detects 2 sockets with 16-cores each. But it reports 2 NUMA
nodes total, instead of 2 per socket (4 total). And hwloc warns because
the cores contained in these NUMA nodes are incompatible with sockets:
socket0 contains 0-
You might also want to check the BIOS rev level on node14, Gus - as Brice
suggested, it could be that the board came with the wrong firmware.
On Feb 28, 2014, at 11:55 AM, Gus Correa wrote:
> Hi Brice and Ralph
>
> Many thanks for helping out with this!
>
> Yes, you are right about node15 bei
Hi Brice and Ralph
Many thanks for helping out with this!
Yes, you are right about node15 being OK.
Node15 was a red herring, as along with node14 it was part of
the same job that failed.
However, after a closer look, I noticed that failure reported
by hwloc was indeed in node14.
I attach both
On Feb 28, 2014, at 12:32 AM, Brice Goglin wrote:
> Le 28/02/2014 02:48, Ralph Castain a écrit :
>> Remember, hwloc doesn't actually "sense" hardware - it just parses files in
>> the /proc area. So if something is garbled in those files, hwloc will report
>> errors. Doesn't mean anything is wr
Le 28/02/2014 02:48, Ralph Castain a écrit :
> Remember, hwloc doesn't actually "sense" hardware - it just parses files in
> the /proc area. So if something is garbled in those files, hwloc will report
> errors. Doesn't mean anything is wrong with the hardware at all.
For the record, that's not
Hello Gus,
I'll need the tarball generated by gather-topology on node14 to debug
this. node15 doesn't have any issue.
We've seen issues on AMD machines because of buggy BIOS reporting
incompatible Socket and NUMA info. If node14 doesn't have the same BIOS
version as other nodes, that could explain
On Feb 27, 2014, at 4:39 PM, Gus Correa wrote:
> Thank you, Ralph!
>
> I did a bit more of homework, and found out that all jobs that had
> the hwloc error involved one specific node (node14).
>
> The "report bindings" output in those jobs' stderr show
> that node14 systematically failed to bi
Thank you, Ralph!
I did a bit more of homework, and found out that all jobs that had
the hwloc error involved one specific node (node14).
The "report bindings" output in those jobs' stderr show
that node14 systematically failed to bind the processes to the cores,
while other nodes on the same jo
The hwloc in 1.6.5 is very old (v1.3.2), so it's possible it is having trouble
with those data/instruction cache breakdowns. I don't know why it wouldn't have
shown up before, however, as this looks to be happening when we first try to
assemble the topology. To check that, what happens if you ju
15 matches
Mail list logo