Re: [OMPI users] Running on two nodes slower than running on one node

Tim Prince Thu, 30 Jan 2014 07:36:37 -0500 (EST)


On 1/29/2014 10:56 PM, Victor wrote:

Thanks for the insights Tim. I was aware that the CPUs will chokebeyond a certain point. From memory on my machine this happens with 5concurrent MPI jobs with that benchmark that I am using.
Regarding your mention of setting affinities and MPI ranks do you havea specific (as in syntactically specific since I am a novice andeasily ...) examples how I may want to set affinities to get theWestmere node performing better?
ompi_info returns this: MCA paffinity: hwloc (MCA v2.0, API v2.0,Component v1.6.5)

I haven't worked with current OpenMPI on Intel Westmere, although I dohave a Westmere as my only dual CPU platform. Ideally, the currentscheme OpenMPI uses for MPI/OpenMP hybrid affinity will make it easy toallocate adjacent pairs of cores to ranks: [0,1], [2,3],[4,5],....hwloc will not be able to see whether cores [0,1] and [2,3] are actuallythe pairs sharing internal cache buss, and Intel never guaranteed it,but that is the only way I've seen it done (presumably controlled by BIOS).If you had a requirement to run 1 rank per CPU, with 4 threads per CPU,you would pin a thread to the each of the core pairs [0,1] and [2,3](and [6,7],[8,9]. If required to run 8 threads per CPU, usingHyperThreading, you would pin 1 thread to each of the first 4 cores oneach CPU and 2 threads each to the remaining cores (the ones which don'tshare cache paths).Likewise, when you are testing pure MPI scaling, you would take care notto place a 2nd rank on a core pair wich shares an internal buss untilyou are using all 4 internal buss resources, and you would load up the 2CPUs symmetrically. You might find that 8 ranks with optimizedplacement gave nearly the performance of 12 ranks, and that you need aneffective hybrid MPI/OpenMP to get perhaps 25% additional performance byusing the remaining cores. I've never seen an automated scheme to dealwith this. If you ignored the placement requirements, you would findthat 8 ranks on the 12 core platform didn't perform as well as on thesimilar 8 core platform.Needless to say, these special requirements of this CPU model haveeluded even experts, and led to it not being used to fulleffectiveness. The reason we got into this is your remark that itseemed strange to you that you didn't gain performance when you added arank, presumably a 2nd rank on a core pair sharing an internal buss.You seem to have the impression that MPI performance scaling could belinear with the number of cores in use. Such an expectation isunrealistic given that the point of multi-core platforms is to sharememory and other resources and support more ranks without a linearincrease in cost.In your efforts to make an effective cluster out of nodes of dissimilarperformance levels, you may need to explore means of evening up theperformance per rank, such as more OpenMP threads per rank on the lowerperformance CPUs. It really doesn't look like a beginner's project.


--
Tim Prince

Re: [OMPI users] Running on two nodes slower than running on one node

Reply via email to