On 1/29/2014 10:56 PM, Victor wrote:
Thanks for the insights Tim. I was aware that the CPUs will choke
beyond a certain point. From memory on my machine this happens with 5
concurrent MPI jobs with that benchmark that I am using.
Regarding your mention of setting affinities and MPI ranks do you have
a specific (as in syntactically specific since I am a novice and
easily ...) examples how I may want to set affinities to get the
Westmere node performing better?
ompi_info returns this: MCA paffinity: hwloc (MCA v2.0, API v2.0,
Component v1.6.5)
I haven't worked with current OpenMPI on Intel Westmere, although I do
have a Westmere as my only dual CPU platform. Ideally, the current
scheme OpenMPI uses for MPI/OpenMP hybrid affinity will make it easy to
allocate adjacent pairs of cores to ranks: [0,1], [2,3],[4,5],....
hwloc will not be able to see whether cores [0,1] and [2,3] are actually
the pairs sharing internal cache buss, and Intel never guaranteed it,
but that is the only way I've seen it done (presumably controlled by BIOS).
If you had a requirement to run 1 rank per CPU, with 4 threads per CPU,
you would pin a thread to the each of the core pairs [0,1] and [2,3]
(and [6,7],[8,9]. If required to run 8 threads per CPU, using
HyperThreading, you would pin 1 thread to each of the first 4 cores on
each CPU and 2 threads each to the remaining cores (the ones which don't
share cache paths).
Likewise, when you are testing pure MPI scaling, you would take care not
to place a 2nd rank on a core pair wich shares an internal buss until
you are using all 4 internal buss resources, and you would load up the 2
CPUs symmetrically. You might find that 8 ranks with optimized
placement gave nearly the performance of 12 ranks, and that you need an
effective hybrid MPI/OpenMP to get perhaps 25% additional performance by
using the remaining cores. I've never seen an automated scheme to deal
with this. If you ignored the placement requirements, you would find
that 8 ranks on the 12 core platform didn't perform as well as on the
similar 8 core platform.
Needless to say, these special requirements of this CPU model have
eluded even experts, and led to it not being used to full
effectiveness. The reason we got into this is your remark that it
seemed strange to you that you didn't gain performance when you added a
rank, presumably a 2nd rank on a core pair sharing an internal buss.
You seem to have the impression that MPI performance scaling could be
linear with the number of cores in use. Such an expectation is
unrealistic given that the point of multi-core platforms is to share
memory and other resources and support more ranks without a linear
increase in cost.
In your efforts to make an effective cluster out of nodes of dissimilar
performance levels, you may need to explore means of evening up the
performance per rank, such as more OpenMP threads per rank on the lower
performance CPUs. It really doesn't look like a beginner's project.
--
Tim Prince