Re: [OMPI users] Running on two nodes slower than running on one node

Victor Wed, 29 Jan 2014 08:03:42 -0500 (EST)

Sorry typo. I have dual X5660 not X5560.
http://ark.intel.com/products/47921/Intel-Xeon-Processor-X5660-12M-Cache-2_80-GHz-6_40-GTs-Intel-QPI?q=x5660



On 29 January 2014 21:02, Reuti <re...@staff.uni-marburg.de> wrote:

> Quoting Victor <victor.ma...@gmail.com>:
>
>  Thanks for the reply Reuti,
>>
>> There are two machines: Node1 with 12 physical cores (dual 6 core Xeon)
>> and
>>
>
> Do you have this CPU?
>
> http://ark.intel.com/de/products/37109/Intel-Xeon-
> Processor-X5560-8M-Cache-2_80-GHz-6_40-GTs-Intel-QPI
>
> -- Reuti
>
>
>
>  Node2 with 4 physical cores (i5-2400).
>>
>> Regarding scaling on the single 12 core node, not it is also not linear.
>> In
>> fact it is downright strange. I do not remember the numbers right now but
>> 10 jobs are faster than 11 and 12 are the fastest with peak performance of
>> approximately 66 Msu/s which is also far from triple the 4 core
>> performance. This odd non-linear behaviour also happens at the lower job
>> counts on that 12 core node. I understand the decrease in scaling with
>> increase in core count on the single node as the memory bandwidth is an
>> issue.
>>
>> On the 4 core machine the scaling is progressive, ie. every additional job
>> brings an increase in performance. Single core delivers 8.1 Msu/s while 4
>> cores deliver 30.8 Msu/s. This is almost linear.
>>
>> Since my original email I have also installed Open-MX and recompiled
>> OpenMPI to use it. This has resulted in approximately 10% better
>> performance using the existing GbE hardware.
>>
>>
>> On 29 January 2014 19:40, Reuti <re...@staff.uni-marburg.de> wrote:
>>
>>  Am 29.01.2014 um 03:00 schrieb Victor:
>>>
>>> > I am running a CFD simulation benchmark cavity3d available within
>>> http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
>>> >
>>> > It is a parallel friendly Lattice Botlzmann solver library.
>>> >
>>> > Palabos provides benchmark results for the cavity3d on several
>>> different
>>> platforms and variables here:
>>> http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400
>>> >
>>> > The problem that I have is that the benchmark performance on my cluster
>>> does not scale even close to a linear scale.
>>> >
>>> > My cluster configuration:
>>> >
>>> > Node1: Dual Xeon 5560 48 Gb RAM
>>> > Node2: i5-2400 24 Gb RAM
>>> >
>>> > Gigabit ethernet connection on eth0
>>> >
>>> > OpenMPI 1.6.5 on Ubuntu 12.04.3
>>> >
>>> >
>>> > Hostfile:
>>> >
>>> > Node1 -slots=4 -max-slots=4
>>> > Node2 -slots=4 -max-slots=4
>>> >
>>> > MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile
>>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
>>> >
>>> > Problem:
>>> >
>>> > cavity3d 400
>>> >
>>> > When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per
>>> second
>>> > When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per
>>> second
>>> > When I run mpirun --mca btl_tcp_if_include eth0 --hostfile
>>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get  47.3538 Mega site
>>> updates per second
>>> >
>>> > I understand that there are latencies with GbE and that there is MPI
>>> overhead, but this performance scaling still seems very poor. Are my
>>> expectations of scaling naive, or is there actually something wrong and
>>> fixable that will improve the scaling? Optimistically I would like each
>>> node to add to the cluster performance, not slow it down.
>>> >
>>> > Things get even worse if I run asymmetric number of mpi jobs in each
>>> node. For instance running -np 12 on Node1
>>>
>>> Isn't this overloading the machine with only 8 real cores in total?
>>>
>>>
>>> > is significantly faster than running -np 16 across Node1 and Node2,
>>> thus
>>> adding Node2 actually slows down the performance.
>>>
>>> The i5-2400 has only 4 cores and no threads.
>>>
>>> It depends on the algorithm how much data has to be exchanged between the
>>> processes, and this can indeed be worse when used across a network.
>>>
>>> Also: is the algorithm scaling linear when used on node1 only with 8
>>> cores? When it's "35.7615 " with 4 cores, what result do you get with 8
>>> cores on this machine.
>>>
>>> -- Reuti
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Running on two nodes slower than running on one node

Reply via email to