Re: [OMPI users] Running on two nodes slower than running on one node

Victor Thu, 30 Jan 2014 01:16:13 -0500 (EST)

Thank you for the very detailed reply Ralph. I will try what you say. I
will need to ask the developers to let me know about threading of the main
solver process.



On 30 January 2014 12:30, Ralph Castain <r...@open-mpi.org> wrote:

>
> On Jan 29, 2014, at 7:56 PM, Victor <victor.ma...@gmail.com> wrote:
>
> Thanks for the insights Tim. I was aware that the CPUs will choke beyond a
> certain point. From memory on my machine this happens with 5 concurrent MPI
> jobs with that benchmark that I am using.
>
> My primary question was about scaling between the nodes. I was not getting
> close to double the performance when running MPI jobs acros two 4 core
> nodes. It may be better now since I have Open-MX in place, but I have not
> repeated the benchmarks yet since I need to get one simulation job done
> asap.
>
>
> Some of that may be due to expected loss of performance when you switch
> from shared memory to inter-node transports. While it is true about
> saturation of the memory path, what you reported could be more consistent
> with that transition - i.e., it isn't unusual to see applications perform
> better when run on a single node, depending upon how they are written, up
> to a certain size of problem (which your code may not be hitting).
>
>
> Regarding your mention of setting affinities and MPI ranks do you have a
> specific (as in syntactically specific since I am a novice and easily
> confused...) examples how I may want to set affinities to get the Westmere
> node performing better?
>
>
> mpirun --bind-to-core -cpus-per-rank 2 ...
>
> will bind each MPI rank to 2 cores. Note that this will definitely *not*
> be a good idea if you are running more than two threads in your process -
> if you are, then set --cpus-per-rank to the number of threads, keeping in
> mind that you want things to break evenly across the sockets. In other
> words, if you have two 6 core/socket Westmere's on the node, then you
> either want to run 6 process at cpus-per-rank=2 if each process runs 2
> threads, or 4 processes with cpus-per-rank=3 if each process runs 3
> threads, or 2 processes with no cpus-per-rank but --bind-to-socket instead
> of --bind-to-core for any other thread number > 3.
>
> You would not want to run any other number of processes on the node or
> else the binding pattern will cause a single process to split its threads
> across the sockets - which will definitely hurt performance.
>
>
>
> ompi_info returns this: MCA paffinity: hwloc (MCA v2.0, API v2.0,
> Component v1.6.5)
>
> And finally to hybridisation... in a week or so I will get 4 AMD A10-6800
> machines with 8Gb each on loan and will attempt to make them work along the
> existing Intel nodes.
>
> Victor
>
>
> On 29 January 2014 22:03, Tim Prince <n...@aol.com> wrote:
>
>>
>> On 1/29/2014 8:02 AM, Reuti wrote:
>>
>>> Quoting Victor <victor.ma...@gmail.com>:
>>>
>>>  Thanks for the reply Reuti,
>>>>
>>>> There are two machines: Node1 with 12 physical cores (dual 6 core Xeon)
>>>> and
>>>>
>>>
>>> Do you have this CPU?
>>>
>>> http://ark.intel.com/de/products/37109/Intel-Xeon-
>>> Processor-X5560-8M-Cache-2_80-GHz-6_40-GTs-Intel-QPI
>>>
>>> -- Reuti
>>>
>>>  It's expected on the Xeon Westmere 6-core CPUs to see MPI performance
>> saturating when all 4 of the internal buss paths are in use.  For this
>> reason, hybrid MPI/OpenMP with 2 cores per MPI rank, with affinity set so
>> that each MPI rank has its own internal CPU buss, could out-perform plain
>> MPI on those CPUs.
>> That scheme of pairing cores on selected internal buss paths hasn't been
>> repeated.  Some influential customers learned to prefer the 4-core version
>> of that CPU, given a reluctance to adopt MPI/OpenMP hybrid with affinity.
>> If you want to talk about "downright strange," start thinking about the
>> schemes to optimize performance of 8 threads with 2 threads assigned to
>> each internal CPU buss on that CPU model.  Or your scheme of trying to
>> balance MPI performance between very different CPU models.
>> Tim
>>
>>
>>>  Node2 with 4 physical cores (i5-2400).
>>>>
>>>> Regarding scaling on the single 12 core node, not it is also not
>>>> linear. In
>>>> fact it is downright strange. I do not remember the numbers right now
>>>> but
>>>> 10 jobs are faster than 11 and 12 are the fastest with peak performance
>>>> of
>>>> approximately 66 Msu/s which is also far from triple the 4 core
>>>> performance. This odd non-linear behaviour also happens at the lower job
>>>> counts on that 12 core node. I understand the decrease in scaling with
>>>> increase in core count on the single node as the memory bandwidth is an
>>>> issue.
>>>>
>>>> On the 4 core machine the scaling is progressive, ie. every additional
>>>> job
>>>> brings an increase in performance. Single core delivers 8.1 Msu/s while
>>>> 4
>>>> cores deliver 30.8 Msu/s. This is almost linear.
>>>>
>>>> Since my original email I have also installed Open-MX and recompiled
>>>> OpenMPI to use it. This has resulted in approximately 10% better
>>>> performance using the existing GbE hardware.
>>>>
>>>>
>>>> On 29 January 2014 19:40, Reuti <re...@staff.uni-marburg.de> wrote:
>>>>
>>>>  Am 29.01.2014 um 03:00 schrieb Victor:
>>>>>
>>>>> > I am running a CFD simulation benchmark cavity3d available within
>>>>> http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
>>>>> >
>>>>> > It is a parallel friendly Lattice Botlzmann solver library.
>>>>> >
>>>>> > Palabos provides benchmark results for the cavity3d on several
>>>>> different
>>>>> platforms and variables here:
>>>>> http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400
>>>>> >
>>>>> > The problem that I have is that the benchmark performance on my
>>>>> cluster
>>>>> does not scale even close to a linear scale.
>>>>> >
>>>>> > My cluster configuration:
>>>>> >
>>>>> > Node1: Dual Xeon 5560 48 Gb RAM
>>>>> > Node2: i5-2400 24 Gb RAM
>>>>> >
>>>>> > Gigabit ethernet connection on eth0
>>>>> >
>>>>> > OpenMPI 1.6.5 on Ubuntu 12.04.3
>>>>> >
>>>>> >
>>>>> > Hostfile:
>>>>> >
>>>>> > Node1 -slots=4 -max-slots=4
>>>>> > Node2 -slots=4 -max-slots=4
>>>>> >
>>>>> > MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile
>>>>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
>>>>> >
>>>>> > Problem:
>>>>> >
>>>>> > cavity3d 400
>>>>> >
>>>>> > When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per
>>>>> second
>>>>> > When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per
>>>>> second
>>>>> > When I run mpirun --mca btl_tcp_if_include eth0 --hostfile
>>>>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get 47.3538 Mega
>>>>> site
>>>>> updates per second
>>>>> >
>>>>> > I understand that there are latencies with GbE and that there is MPI
>>>>> overhead, but this performance scaling still seems very poor. Are my
>>>>> expectations of scaling naive, or is there actually something wrong and
>>>>> fixable that will improve the scaling? Optimistically I would like each
>>>>> node to add to the cluster performance, not slow it down.
>>>>> >
>>>>> > Things get even worse if I run asymmetric number of mpi jobs in each
>>>>> node. For instance running -np 12 on Node1
>>>>>
>>>>> Isn't this overloading the machine with only 8 real cores in total?
>>>>>
>>>>>
>>>>> > is significantly faster than running -np 16 across Node1 and Node2,
>>>>> thus
>>>>> adding Node2 actually slows down the performance.
>>>>>
>>>>> The i5-2400 has only 4 cores and no threads.
>>>>>
>>>>> It depends on the algorithm how much data has to be exchanged between
>>>>> the
>>>>> processes, and this can indeed be worse when used across a network.
>>>>>
>>>>> Also: is the algorithm scaling linear when used on node1 only with 8
>>>>> cores? When it's "35.7615 " with 4 cores, what result do you get with 8
>>>>> cores on this machine.
>>>>>
>>>>> -- Reuti
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> --
>> Tim Prince
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Running on two nodes slower than running on one node

Reply via email to