Thank you for the very detailed reply Ralph. I will try what you say. I will need to ask the developers to let me know about threading of the main solver process.
On 30 January 2014 12:30, Ralph Castain <r...@open-mpi.org> wrote: > > On Jan 29, 2014, at 7:56 PM, Victor <victor.ma...@gmail.com> wrote: > > Thanks for the insights Tim. I was aware that the CPUs will choke beyond a > certain point. From memory on my machine this happens with 5 concurrent MPI > jobs with that benchmark that I am using. > > My primary question was about scaling between the nodes. I was not getting > close to double the performance when running MPI jobs acros two 4 core > nodes. It may be better now since I have Open-MX in place, but I have not > repeated the benchmarks yet since I need to get one simulation job done > asap. > > > Some of that may be due to expected loss of performance when you switch > from shared memory to inter-node transports. While it is true about > saturation of the memory path, what you reported could be more consistent > with that transition - i.e., it isn't unusual to see applications perform > better when run on a single node, depending upon how they are written, up > to a certain size of problem (which your code may not be hitting). > > > Regarding your mention of setting affinities and MPI ranks do you have a > specific (as in syntactically specific since I am a novice and easily > confused...) examples how I may want to set affinities to get the Westmere > node performing better? > > > mpirun --bind-to-core -cpus-per-rank 2 ... > > will bind each MPI rank to 2 cores. Note that this will definitely *not* > be a good idea if you are running more than two threads in your process - > if you are, then set --cpus-per-rank to the number of threads, keeping in > mind that you want things to break evenly across the sockets. In other > words, if you have two 6 core/socket Westmere's on the node, then you > either want to run 6 process at cpus-per-rank=2 if each process runs 2 > threads, or 4 processes with cpus-per-rank=3 if each process runs 3 > threads, or 2 processes with no cpus-per-rank but --bind-to-socket instead > of --bind-to-core for any other thread number > 3. > > You would not want to run any other number of processes on the node or > else the binding pattern will cause a single process to split its threads > across the sockets - which will definitely hurt performance. > > > > ompi_info returns this: MCA paffinity: hwloc (MCA v2.0, API v2.0, > Component v1.6.5) > > And finally to hybridisation... in a week or so I will get 4 AMD A10-6800 > machines with 8Gb each on loan and will attempt to make them work along the > existing Intel nodes. > > Victor > > > On 29 January 2014 22:03, Tim Prince <n...@aol.com> wrote: > >> >> On 1/29/2014 8:02 AM, Reuti wrote: >> >>> Quoting Victor <victor.ma...@gmail.com>: >>> >>> Thanks for the reply Reuti, >>>> >>>> There are two machines: Node1 with 12 physical cores (dual 6 core Xeon) >>>> and >>>> >>> >>> Do you have this CPU? >>> >>> http://ark.intel.com/de/products/37109/Intel-Xeon- >>> Processor-X5560-8M-Cache-2_80-GHz-6_40-GTs-Intel-QPI >>> >>> -- Reuti >>> >>> It's expected on the Xeon Westmere 6-core CPUs to see MPI performance >> saturating when all 4 of the internal buss paths are in use. For this >> reason, hybrid MPI/OpenMP with 2 cores per MPI rank, with affinity set so >> that each MPI rank has its own internal CPU buss, could out-perform plain >> MPI on those CPUs. >> That scheme of pairing cores on selected internal buss paths hasn't been >> repeated. Some influential customers learned to prefer the 4-core version >> of that CPU, given a reluctance to adopt MPI/OpenMP hybrid with affinity. >> If you want to talk about "downright strange," start thinking about the >> schemes to optimize performance of 8 threads with 2 threads assigned to >> each internal CPU buss on that CPU model. Or your scheme of trying to >> balance MPI performance between very different CPU models. >> Tim >> >> >>> Node2 with 4 physical cores (i5-2400). >>>> >>>> Regarding scaling on the single 12 core node, not it is also not >>>> linear. In >>>> fact it is downright strange. I do not remember the numbers right now >>>> but >>>> 10 jobs are faster than 11 and 12 are the fastest with peak performance >>>> of >>>> approximately 66 Msu/s which is also far from triple the 4 core >>>> performance. This odd non-linear behaviour also happens at the lower job >>>> counts on that 12 core node. I understand the decrease in scaling with >>>> increase in core count on the single node as the memory bandwidth is an >>>> issue. >>>> >>>> On the 4 core machine the scaling is progressive, ie. every additional >>>> job >>>> brings an increase in performance. Single core delivers 8.1 Msu/s while >>>> 4 >>>> cores deliver 30.8 Msu/s. This is almost linear. >>>> >>>> Since my original email I have also installed Open-MX and recompiled >>>> OpenMPI to use it. This has resulted in approximately 10% better >>>> performance using the existing GbE hardware. >>>> >>>> >>>> On 29 January 2014 19:40, Reuti <re...@staff.uni-marburg.de> wrote: >>>> >>>> Am 29.01.2014 um 03:00 schrieb Victor: >>>>> >>>>> > I am running a CFD simulation benchmark cavity3d available within >>>>> http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz >>>>> > >>>>> > It is a parallel friendly Lattice Botlzmann solver library. >>>>> > >>>>> > Palabos provides benchmark results for the cavity3d on several >>>>> different >>>>> platforms and variables here: >>>>> http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400 >>>>> > >>>>> > The problem that I have is that the benchmark performance on my >>>>> cluster >>>>> does not scale even close to a linear scale. >>>>> > >>>>> > My cluster configuration: >>>>> > >>>>> > Node1: Dual Xeon 5560 48 Gb RAM >>>>> > Node2: i5-2400 24 Gb RAM >>>>> > >>>>> > Gigabit ethernet connection on eth0 >>>>> > >>>>> > OpenMPI 1.6.5 on Ubuntu 12.04.3 >>>>> > >>>>> > >>>>> > Hostfile: >>>>> > >>>>> > Node1 -slots=4 -max-slots=4 >>>>> > Node2 -slots=4 -max-slots=4 >>>>> > >>>>> > MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile >>>>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 >>>>> > >>>>> > Problem: >>>>> > >>>>> > cavity3d 400 >>>>> > >>>>> > When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per >>>>> second >>>>> > When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per >>>>> second >>>>> > When I run mpirun --mca btl_tcp_if_include eth0 --hostfile >>>>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get 47.3538 Mega >>>>> site >>>>> updates per second >>>>> > >>>>> > I understand that there are latencies with GbE and that there is MPI >>>>> overhead, but this performance scaling still seems very poor. Are my >>>>> expectations of scaling naive, or is there actually something wrong and >>>>> fixable that will improve the scaling? Optimistically I would like each >>>>> node to add to the cluster performance, not slow it down. >>>>> > >>>>> > Things get even worse if I run asymmetric number of mpi jobs in each >>>>> node. For instance running -np 12 on Node1 >>>>> >>>>> Isn't this overloading the machine with only 8 real cores in total? >>>>> >>>>> >>>>> > is significantly faster than running -np 16 across Node1 and Node2, >>>>> thus >>>>> adding Node2 actually slows down the performance. >>>>> >>>>> The i5-2400 has only 4 cores and no threads. >>>>> >>>>> It depends on the algorithm how much data has to be exchanged between >>>>> the >>>>> processes, and this can indeed be worse when used across a network. >>>>> >>>>> Also: is the algorithm scaling linear when used on node1 only with 8 >>>>> cores? When it's "35.7615 " with 4 cores, what result do you get with 8 >>>>> cores on this machine. >>>>> >>>>> -- Reuti >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> -- >> Tim Prince >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >