DJ Delorie wrote:
There's not much difference between multi-core and multi-cpu, and I've
been building multi-cpu for years.
Some multi-core processors come with less L2 cache than their multi-CPU
counterparts.
Also, multi-cpu itself comes in different varieties, with Intels Xeons going
for the classic SMP design with a single shared memory bus providing
uniform memory access (UMA), but also a causing a bootleneck as you add
more processors; whereas the Opterons have non-uniform memory access
(NUMA), with each processor having separate memory busses.  So theat removes
the bottleneck of a shared memory bus, but the operating system must allocate
most memory locally to each CPU to avoid a bottleneck in the cross-connect
of the processors.

The Athlon X2 and Opteron dual core processors use internally a cross-bar
switch to connect the two cores to the memory and inter-processor interconnect,
so they have slightly higher memory latencies, but better communication
between the two cores inside the processor and the attached memory than between
separate Opterons processors with their attached memory.

I wonder: what does the Linux (or insert your favourite BSD here) kernel do
with dual-core Opterons? Does it keep processor-affinity for physical memory
local to the two cores it is attached to?
Dual processor Dual-core Opterons seem like a very cost effective way to get a
4-way machine, and two cores each share two memory busses (if the memory is
appropriately installed), so memory bandwidth should also be  good.  But is
there a penalty to pay because this machine is not quite classical SMP nor quite
NUMA?
If the kernel pretends it's a fully NUMA machine, that would halve the
local memory available per CPU - i.e. builds that see largely assymetric memory usage could get slower, since when the kernel runs out of what it thinks is the
memory local to one core, it has a good chance (2/3 if you assume random
distribution) of grabbing memory that is indeed not local to that core.
If it pretends the machine is a classic SMP machine with uniform memory access, it's even worse, since then irrespective of the size of the working set, it will tend
to grab memory from the wrong place half of the time.

At a previous job where we were very interested in build times, our
rule of thumb was N+1 jobs for N cpus with local disk, or 2N+1 jobs
for N cpus with nfs-mounted disk.  That was for build farms working on
a 12 hour C++ compile.
Was that UMA or NUMA, and how far up could you scale N usefully?
Do you know if software RAID (not so much R as A of ID) is effective at
avoiding I/O bottlenecks, e.g. will two disks for four cores work as well as one
disk for two?

Reply via email to