Re: [OMPI users] Running on two nodes slower than running on one node

2014-01-29 Thread Reuti
Am 29.01.2014 um 03:00 schrieb Victor:

> I am running a CFD simulation benchmark cavity3d available within 
> http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
> 
> It is a parallel friendly Lattice Botlzmann solver library.
> 
> Palabos provides benchmark results for the cavity3d on several different 
> platforms and variables here: 
> http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400
> 
> The problem that I have is that the benchmark performance on my cluster does 
> not scale even close to a linear scale.
> 
> My cluster configuration:
> 
> Node1: Dual Xeon 5560 48 Gb RAM
> Node2: i5-2400 24 Gb RAM
> 
> Gigabit ethernet connection on eth0
> 
> OpenMPI 1.6.5 on Ubuntu 12.04.3
> 
> 
> Hostfile:
> 
> Node1 -slots=4 -max-slots=4
> Node2 -slots=4 -max-slots=4
> 
> MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile 
> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
> 
> Problem:
> 
> cavity3d 400
> 
> When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per second
> When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per second
> When I run mpirun --mca btl_tcp_if_include eth0 --hostfile 
> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get  47.3538 Mega site 
> updates per second
> 
> I understand that there are latencies with GbE and that there is MPI 
> overhead, but this performance scaling still seems very poor. Are my 
> expectations of scaling naive, or is there actually something wrong and 
> fixable that will improve the scaling? Optimistically I would like each node 
> to add to the cluster performance, not slow it down. 
> 
> Things get even worse if I run asymmetric number of mpi jobs in each node. 
> For instance running -np 12 on Node1

Isn't this overloading the machine with only 8 real cores in total?


> is significantly faster than running -np 16 across Node1 and Node2, thus 
> adding Node2 actually slows down the performance.

The i5-2400 has only 4 cores and no threads.

It depends on the algorithm how much data has to be exchanged between the 
processes, and this can indeed be worse when used across a network.

Also: is the algorithm scaling linear when used on node1 only with 8 cores? 
When it's "35.7615 " with 4 cores, what result do you get with 8 cores on this 
machine.

-- Reuti

Re: [OMPI users] Running on two nodes slower than running on one node

2014-01-29 Thread Victor
Thanks for the reply Reuti,

There are two machines: Node1 with 12 physical cores (dual 6 core Xeon) and
Node2 with 4 physical cores (i5-2400).

Regarding scaling on the single 12 core node, not it is also not linear. In
fact it is downright strange. I do not remember the numbers right now but
10 jobs are faster than 11 and 12 are the fastest with peak performance of
approximately 66 Msu/s which is also far from triple the 4 core
performance. This odd non-linear behaviour also happens at the lower job
counts on that 12 core node. I understand the decrease in scaling with
increase in core count on the single node as the memory bandwidth is an
issue.

On the 4 core machine the scaling is progressive, ie. every additional job
brings an increase in performance. Single core delivers 8.1 Msu/s while 4
cores deliver 30.8 Msu/s. This is almost linear.

Since my original email I have also installed Open-MX and recompiled
OpenMPI to use it. This has resulted in approximately 10% better
performance using the existing GbE hardware.


On 29 January 2014 19:40, Reuti  wrote:

> Am 29.01.2014 um 03:00 schrieb Victor:
>
> > I am running a CFD simulation benchmark cavity3d available within
> http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
> >
> > It is a parallel friendly Lattice Botlzmann solver library.
> >
> > Palabos provides benchmark results for the cavity3d on several different
> platforms and variables here:
> http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400
> >
> > The problem that I have is that the benchmark performance on my cluster
> does not scale even close to a linear scale.
> >
> > My cluster configuration:
> >
> > Node1: Dual Xeon 5560 48 Gb RAM
> > Node2: i5-2400 24 Gb RAM
> >
> > Gigabit ethernet connection on eth0
> >
> > OpenMPI 1.6.5 on Ubuntu 12.04.3
> >
> >
> > Hostfile:
> >
> > Node1 -slots=4 -max-slots=4
> > Node2 -slots=4 -max-slots=4
> >
> > MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile
> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
> >
> > Problem:
> >
> > cavity3d 400
> >
> > When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per
> second
> > When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per
> second
> > When I run mpirun --mca btl_tcp_if_include eth0 --hostfile
> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get  47.3538 Mega site
> updates per second
> >
> > I understand that there are latencies with GbE and that there is MPI
> overhead, but this performance scaling still seems very poor. Are my
> expectations of scaling naive, or is there actually something wrong and
> fixable that will improve the scaling? Optimistically I would like each
> node to add to the cluster performance, not slow it down.
> >
> > Things get even worse if I run asymmetric number of mpi jobs in each
> node. For instance running -np 12 on Node1
>
> Isn't this overloading the machine with only 8 real cores in total?
>
>
> > is significantly faster than running -np 16 across Node1 and Node2, thus
> adding Node2 actually slows down the performance.
>
> The i5-2400 has only 4 cores and no threads.
>
> It depends on the algorithm how much data has to be exchanged between the
> processes, and this can indeed be worse when used across a network.
>
> Also: is the algorithm scaling linear when used on node1 only with 8
> cores? When it's "35.7615 " with 4 cores, what result do you get with 8
> cores on this machine.
>
> -- Reuti
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Running on two nodes slower than running on one node

2014-01-29 Thread Reuti

Quoting Victor :


Thanks for the reply Reuti,

There are two machines: Node1 with 12 physical cores (dual 6 core Xeon) and


Do you have this CPU?

http://ark.intel.com/de/products/37109/Intel-Xeon-Processor-X5560-8M-Cache-2_80-GHz-6_40-GTs-Intel-QPI

-- Reuti



Node2 with 4 physical cores (i5-2400).

Regarding scaling on the single 12 core node, not it is also not linear. In
fact it is downright strange. I do not remember the numbers right now but
10 jobs are faster than 11 and 12 are the fastest with peak performance of
approximately 66 Msu/s which is also far from triple the 4 core
performance. This odd non-linear behaviour also happens at the lower job
counts on that 12 core node. I understand the decrease in scaling with
increase in core count on the single node as the memory bandwidth is an
issue.

On the 4 core machine the scaling is progressive, ie. every additional job
brings an increase in performance. Single core delivers 8.1 Msu/s while 4
cores deliver 30.8 Msu/s. This is almost linear.

Since my original email I have also installed Open-MX and recompiled
OpenMPI to use it. This has resulted in approximately 10% better
performance using the existing GbE hardware.


On 29 January 2014 19:40, Reuti  wrote:


Am 29.01.2014 um 03:00 schrieb Victor:

> I am running a CFD simulation benchmark cavity3d available within
http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
>
> It is a parallel friendly Lattice Botlzmann solver library.
>
> Palabos provides benchmark results for the cavity3d on several different
platforms and variables here:
http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400
>
> The problem that I have is that the benchmark performance on my cluster
does not scale even close to a linear scale.
>
> My cluster configuration:
>
> Node1: Dual Xeon 5560 48 Gb RAM
> Node2: i5-2400 24 Gb RAM
>
> Gigabit ethernet connection on eth0
>
> OpenMPI 1.6.5 on Ubuntu 12.04.3
>
>
> Hostfile:
>
> Node1 -slots=4 -max-slots=4
> Node2 -slots=4 -max-slots=4
>
> MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile
/home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
>
> Problem:
>
> cavity3d 400
>
> When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per
second
> When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per
second
> When I run mpirun --mca btl_tcp_if_include eth0 --hostfile
/home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get  47.3538 Mega site
updates per second
>
> I understand that there are latencies with GbE and that there is MPI
overhead, but this performance scaling still seems very poor. Are my
expectations of scaling naive, or is there actually something wrong and
fixable that will improve the scaling? Optimistically I would like each
node to add to the cluster performance, not slow it down.
>
> Things get even worse if I run asymmetric number of mpi jobs in each
node. For instance running -np 12 on Node1

Isn't this overloading the machine with only 8 real cores in total?


> is significantly faster than running -np 16 across Node1 and Node2, thus
adding Node2 actually slows down the performance.

The i5-2400 has only 4 cores and no threads.

It depends on the algorithm how much data has to be exchanged between the
processes, and this can indeed be worse when used across a network.

Also: is the algorithm scaling linear when used on node1 only with 8
cores? When it's "35.7615 " with 4 cores, what result do you get with 8
cores on this machine.

-- Reuti
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users







Re: [OMPI users] Running on two nodes slower than running on one node

2014-01-29 Thread Victor
Sorry typo. I have dual X5660 not X5560.
http://ark.intel.com/products/47921/Intel-Xeon-Processor-X5660-12M-Cache-2_80-GHz-6_40-GTs-Intel-QPI?q=x5660


On 29 January 2014 21:02, Reuti  wrote:

> Quoting Victor :
>
>  Thanks for the reply Reuti,
>>
>> There are two machines: Node1 with 12 physical cores (dual 6 core Xeon)
>> and
>>
>
> Do you have this CPU?
>
> http://ark.intel.com/de/products/37109/Intel-Xeon-
> Processor-X5560-8M-Cache-2_80-GHz-6_40-GTs-Intel-QPI
>
> -- Reuti
>
>
>
>  Node2 with 4 physical cores (i5-2400).
>>
>> Regarding scaling on the single 12 core node, not it is also not linear.
>> In
>> fact it is downright strange. I do not remember the numbers right now but
>> 10 jobs are faster than 11 and 12 are the fastest with peak performance of
>> approximately 66 Msu/s which is also far from triple the 4 core
>> performance. This odd non-linear behaviour also happens at the lower job
>> counts on that 12 core node. I understand the decrease in scaling with
>> increase in core count on the single node as the memory bandwidth is an
>> issue.
>>
>> On the 4 core machine the scaling is progressive, ie. every additional job
>> brings an increase in performance. Single core delivers 8.1 Msu/s while 4
>> cores deliver 30.8 Msu/s. This is almost linear.
>>
>> Since my original email I have also installed Open-MX and recompiled
>> OpenMPI to use it. This has resulted in approximately 10% better
>> performance using the existing GbE hardware.
>>
>>
>> On 29 January 2014 19:40, Reuti  wrote:
>>
>>  Am 29.01.2014 um 03:00 schrieb Victor:
>>>
>>> > I am running a CFD simulation benchmark cavity3d available within
>>> http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
>>> >
>>> > It is a parallel friendly Lattice Botlzmann solver library.
>>> >
>>> > Palabos provides benchmark results for the cavity3d on several
>>> different
>>> platforms and variables here:
>>> http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400
>>> >
>>> > The problem that I have is that the benchmark performance on my cluster
>>> does not scale even close to a linear scale.
>>> >
>>> > My cluster configuration:
>>> >
>>> > Node1: Dual Xeon 5560 48 Gb RAM
>>> > Node2: i5-2400 24 Gb RAM
>>> >
>>> > Gigabit ethernet connection on eth0
>>> >
>>> > OpenMPI 1.6.5 on Ubuntu 12.04.3
>>> >
>>> >
>>> > Hostfile:
>>> >
>>> > Node1 -slots=4 -max-slots=4
>>> > Node2 -slots=4 -max-slots=4
>>> >
>>> > MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile
>>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
>>> >
>>> > Problem:
>>> >
>>> > cavity3d 400
>>> >
>>> > When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per
>>> second
>>> > When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per
>>> second
>>> > When I run mpirun --mca btl_tcp_if_include eth0 --hostfile
>>> /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get  47.3538 Mega site
>>> updates per second
>>> >
>>> > I understand that there are latencies with GbE and that there is MPI
>>> overhead, but this performance scaling still seems very poor. Are my
>>> expectations of scaling naive, or is there actually something wrong and
>>> fixable that will improve the scaling? Optimistically I would like each
>>> node to add to the cluster performance, not slow it down.
>>> >
>>> > Things get even worse if I run asymmetric number of mpi jobs in each
>>> node. For instance running -np 12 on Node1
>>>
>>> Isn't this overloading the machine with only 8 real cores in total?
>>>
>>>
>>> > is significantly faster than running -np 16 across Node1 and Node2,
>>> thus
>>> adding Node2 actually slows down the performance.
>>>
>>> The i5-2400 has only 4 cores and no threads.
>>>
>>> It depends on the algorithm how much data has to be exchanged between the
>>> processes, and this can indeed be worse when used across a network.
>>>
>>> Also: is the algorithm scaling linear when used on node1 only with 8
>>> cores? When it's "35.7615 " with 4 cores, what result do you get with 8
>>> cores on this machine.
>>>
>>> -- Reuti
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Running on two nodes slower than running on one node

2014-01-29 Thread Tim Prince


On 1/29/2014 8:02 AM, Reuti wrote:

Quoting Victor :


Thanks for the reply Reuti,

There are two machines: Node1 with 12 physical cores (dual 6 core 
Xeon) and


Do you have this CPU?

http://ark.intel.com/de/products/37109/Intel-Xeon-Processor-X5560-8M-Cache-2_80-GHz-6_40-GTs-Intel-QPI 



-- Reuti

It's expected on the Xeon Westmere 6-core CPUs to see MPI performance 
saturating when all 4 of the internal buss paths are in use.  For this 
reason, hybrid MPI/OpenMP with 2 cores per MPI rank, with affinity set 
so that each MPI rank has its own internal CPU buss, could out-perform 
plain MPI on those CPUs.
That scheme of pairing cores on selected internal buss paths hasn't been 
repeated.  Some influential customers learned to prefer the 4-core 
version of that CPU, given a reluctance to adopt MPI/OpenMP hybrid with 
affinity.
If you want to talk about "downright strange," start thinking about the 
schemes to optimize performance of 8 threads with 2 threads assigned to 
each internal CPU buss on that CPU model.  Or your scheme of trying to 
balance MPI performance between very different CPU models.

Tim



Node2 with 4 physical cores (i5-2400).

Regarding scaling on the single 12 core node, not it is also not 
linear. In
fact it is downright strange. I do not remember the numbers right now 
but
10 jobs are faster than 11 and 12 are the fastest with peak 
performance of

approximately 66 Msu/s which is also far from triple the 4 core
performance. This odd non-linear behaviour also happens at the lower job
counts on that 12 core node. I understand the decrease in scaling with
increase in core count on the single node as the memory bandwidth is an
issue.

On the 4 core machine the scaling is progressive, ie. every 
additional job
brings an increase in performance. Single core delivers 8.1 Msu/s 
while 4

cores deliver 30.8 Msu/s. This is almost linear.

Since my original email I have also installed Open-MX and recompiled
OpenMPI to use it. This has resulted in approximately 10% better
performance using the existing GbE hardware.


On 29 January 2014 19:40, Reuti  wrote:


Am 29.01.2014 um 03:00 schrieb Victor:

> I am running a CFD simulation benchmark cavity3d available within
http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
>
> It is a parallel friendly Lattice Botlzmann solver library.
>
> Palabos provides benchmark results for the cavity3d on several 
different

platforms and variables here:
http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400
>
> The problem that I have is that the benchmark performance on my 
cluster

does not scale even close to a linear scale.
>
> My cluster configuration:
>
> Node1: Dual Xeon 5560 48 Gb RAM
> Node2: i5-2400 24 Gb RAM
>
> Gigabit ethernet connection on eth0
>
> OpenMPI 1.6.5 on Ubuntu 12.04.3
>
>
> Hostfile:
>
> Node1 -slots=4 -max-slots=4
> Node2 -slots=4 -max-slots=4
>
> MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile
/home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
>
> Problem:
>
> cavity3d 400
>
> When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per
second
> When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per
second
> When I run mpirun --mca btl_tcp_if_include eth0 --hostfile
/home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get 47.3538 Mega 
site

updates per second
>
> I understand that there are latencies with GbE and that there is MPI
overhead, but this performance scaling still seems very poor. Are my
expectations of scaling naive, or is there actually something wrong and
fixable that will improve the scaling? Optimistically I would like each
node to add to the cluster performance, not slow it down.
>
> Things get even worse if I run asymmetric number of mpi jobs in each
node. For instance running -np 12 on Node1

Isn't this overloading the machine with only 8 real cores in total?


> is significantly faster than running -np 16 across Node1 and 
Node2, thus

adding Node2 actually slows down the performance.

The i5-2400 has only 4 cores and no threads.

It depends on the algorithm how much data has to be exchanged 
between the

processes, and this can indeed be worse when used across a network.

Also: is the algorithm scaling linear when used on node1 only with 8
cores? When it's "35.7615 " with 4 cores, what result do you get with 8
cores on this machine.

-- Reuti
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Tim Prince



[OMPI users] Compiling OpenMPI with PGI pgc++

2014-01-29 Thread Jiri Kraus
Hi,

I am trying to compile OpenMPI 1.7.3 with pgc++ (14.1) as C++ compiler. During 
configure it fails with

checking if C and C++ are link compatible... no

The error from config.log is:

configure:18205: checking if C and C++ are link compatible
configure:18230: pgcc -c -DNDEBUG -fast  conftest_c.c
configure:18237: $? = 0
configure:18268: pgc++ -o conftest -DNDEBUG -fast   conftest.cpp conftest_c.o  
>&5
conftest.cpp:
"conftest.cpp", line 21: error: "_GNU_SOURCE" is predefined; attempted
  redefinition ignored
  #define _GNU_SOURCE 1
  ^

"conftest.cpp", line 86: error: "_GNU_SOURCE" is predefined; attempted
  redefinition ignored
  #define _GNU_SOURCE 1
  ^

"conftest.cpp", line 167: warning: statement is unreachable
return 0;
^

2 errors detected in the compilation of "conftest.cpp".

When I use pgcpp instead of pgc++ OpenMPI configures and builds.

I am using

CXX=pgcpp|pgc++ CC=pgcc FC=pgfortran F77=pgfortran CFLAGS=-fast FCFLAGS=-fast 
FFLAGS=-fast CXXFLAGS=-fast ./configure 
--with-hwloc=/shared/apps/rhel-6.2/tools/hwloc-1.7.1 --enable-hwloc-pci 
--with-cuda --prefix=/home-2/jkraus/local/openmpi-1.7.3/pgi-14.1/cuda-5.5.22

to configure OpenMPI. Any Idea what caused the errors with pgc++?

Thanks

Jiri


NVIDIA GmbH, Wuerselen, Germany, Amtsgericht Aachen, HRB 8361
Managing Director: Karen Theresa Burns

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] Compiling OpenMPI with PGI pgc++

2014-01-29 Thread Jeff Squyres (jsquyres)
That sounds about right.

What's happening is that OMPI has learned a bunch about the C compiler before 
it does this C++ link test.  In your first case (which is assumedly with gcc), 
it determines that it needs _GNU_SOURCE set -- or some other test has caused 
that to be set.  Then it uses that with pgc++ and runs into the error you show 
below.

Is there a reason you want to mix gcc and pgc++?  It's usually simpler/better 
to use a single compiler suite for the whole thing.


On Jan 29, 2014, at 10:54 AM, Jiri Kraus  wrote:

> Hi,
>  
> I am trying to compile OpenMPI 1.7.3 with pgc++ (14.1) as C++ compiler. 
> During configure it fails with
>  
> checking if C and C++ are link compatible... no
>  
> The error from config.log is:
>  
> configure:18205: checking if C and C++ are link compatible
> configure:18230: pgcc -c -DNDEBUG -fast  conftest_c.c
> configure:18237: $? = 0
> configure:18268: pgc++ -o conftest -DNDEBUG -fast   conftest.cpp conftest_c.o 
>  >&5
> conftest.cpp:
> "conftest.cpp", line 21: error: "_GNU_SOURCE" is predefined; attempted
>   redefinition ignored
>   #define _GNU_SOURCE 1
>   ^
>  
> "conftest.cpp", line 86: error: "_GNU_SOURCE" is predefined; attempted
>   redefinition ignored
>   #define _GNU_SOURCE 1
>   ^
>  
> "conftest.cpp", line 167: warning: statement is unreachable
> return 0;
> ^
>  
> 2 errors detected in the compilation of "conftest.cpp".
>  
> When I use pgcpp instead of pgc++ OpenMPI configures and builds.
>  
> I am using
>  
> CXX=pgcpp|pgc++ CC=pgcc FC=pgfortran F77=pgfortran CFLAGS=-fast FCFLAGS=-fast 
> FFLAGS=-fast CXXFLAGS=-fast ./configure 
> --with-hwloc=/shared/apps/rhel-6.2/tools/hwloc-1.7.1 --enable-hwloc-pci 
> --with-cuda --prefix=/home-2/jkraus/local/openmpi-1.7.3/pgi-14.1/cuda-5.5.22
>  
> to configure OpenMPI. Any Idea what caused the errors with pgc++?
>  
> Thanks
>  
> Jiri
>  
> ---
> Nvidia GmbH
> W?rselen
> Amtsgericht Aachen
> HRB 8361
> Managing Director: Karen Theresa Burns
> 
> ---
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> ---
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Compiling OpenMPI with PGI pgc++

2014-01-29 Thread Jiri Kraus
Hi Jeff,

thanks for taking a look. I don't want to mix compiler tool chains. I have just 
double checked my configure line and I am passing

CXX=pgc++ CC=pgcc FC=pgfortran F77=pgfortran ...

so there are only PGI compilers used.

Thanks

Jiri

> Date: Wed, 29 Jan 2014 16:24:08 +
> From: "Jeff Squyres (jsquyres)" 
> To: Open MPI Users 
> Subject: Re: [OMPI users] Compiling OpenMPI with PGI pgc++
> Message-ID: 
> Content-Type: text/plain; charset="us-ascii"
> 
> That sounds about right.
> 
> What's happening is that OMPI has learned a bunch about the C compiler
> before it does this C++ link test.  In your first case (which is assumedly 
> with
> gcc), it determines that it needs _GNU_SOURCE set -- or some other test has
> caused that to be set.  Then it uses that with pgc++ and runs into the error 
> you
> show below.
> 
> Is there a reason you want to mix gcc and pgc++?  It's usually simpler/better 
> to
> use a single compiler suite for the whole thing.
> 
> 
> On Jan 29, 2014, at 10:54 AM, Jiri Kraus  wrote:
> 
> > Hi,
> >
> > I am trying to compile OpenMPI 1.7.3 with pgc++ (14.1) as C++ compiler.
> During configure it fails with
> >
> > checking if C and C++ are link compatible... no
> >
> > The error from config.log is:
> >
> > configure:18205: checking if C and C++ are link compatible
> > configure:18230: pgcc -c -DNDEBUG -fast  conftest_c.c
> > configure:18237: $? = 0
> > configure:18268: pgc++ -o conftest -DNDEBUG -fast   conftest.cpp
> conftest_c.o  >&5
> > conftest.cpp:
> > "conftest.cpp", line 21: error: "_GNU_SOURCE" is predefined; attempted
> >   redefinition ignored
> >   #define _GNU_SOURCE 1
> >   ^
> >
> > "conftest.cpp", line 86: error: "_GNU_SOURCE" is predefined; attempted
> >   redefinition ignored
> >   #define _GNU_SOURCE 1
> >   ^
> >
> > "conftest.cpp", line 167: warning: statement is unreachable
> > return 0;
> > ^
> >
> > 2 errors detected in the compilation of "conftest.cpp".
> >
> > When I use pgcpp instead of pgc++ OpenMPI configures and builds.
> >
> > I am using
> >
> > CXX=pgcpp|pgc++ CC=pgcc FC=pgfortran F77=pgfortran CFLAGS=-fast
> FCFLAGS=-fast FFLAGS=-fast CXXFLAGS=-fast ./configure --with-
> hwloc=/shared/apps/rhel-6.2/tools/hwloc-1.7.1 --enable-hwloc-pci --with-cuda -
> -prefix=/home-2/jkraus/local/openmpi-1.7.3/pgi-14.1/cuda-5.5.22
> >
> > to configure OpenMPI. Any Idea what caused the errors with pgc++?
> >
> > Thanks
> >
> > Jiri
> >
> > ---
> > Nvidia GmbH
> > W?rselen
> > Amtsgericht Aachen
> > HRB 8361
> > Managing Director: Karen Theresa Burns
> >
> > ---
> > This email message is for the sole use of the intended recipient(s) and may
> contain
> > confidential information.  Any unauthorized review, use, disclosure or
> distribution
> > is prohibited.  If you are not the intended recipient, please contact the 
> > sender
> by
> > reply email and destroy all copies of the original message.
> > ---
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> --
> 
> Subject: Digest Footer
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> --
> 
> End of users Digest, Vol 2796, Issue 1
> **


Re: [OMPI users] Compiling OpenMPI with PGI pgc++

2014-01-29 Thread Jeff Squyres (jsquyres)
Oh, I'm sorry -- I mis-read your initial mail (I thought when you did use all 
the PGI compilers, it worked).

I don't know the difference between pgc++ and pgcpp, unfortunately.

Do you have the latest version of your PGI compiler suite in that series?


On Jan 29, 2014, at 12:10 PM, Jiri Kraus  wrote:

> Hi Jeff,
> 
> thanks for taking a look. I don't want to mix compiler tool chains. I have 
> just double checked my configure line and I am passing
> 
>   CXX=pgc++ CC=pgcc FC=pgfortran F77=pgfortran ...
> 
> so there are only PGI compilers used.
> 
> Thanks
> 
> Jiri
> 
>> Date: Wed, 29 Jan 2014 16:24:08 +
>> From: "Jeff Squyres (jsquyres)" 
>> To: Open MPI Users 
>> Subject: Re: [OMPI users] Compiling OpenMPI with PGI pgc++
>> Message-ID: 
>> Content-Type: text/plain; charset="us-ascii"
>> 
>> That sounds about right.
>> 
>> What's happening is that OMPI has learned a bunch about the C compiler
>> before it does this C++ link test.  In your first case (which is assumedly 
>> with
>> gcc), it determines that it needs _GNU_SOURCE set -- or some other test has
>> caused that to be set.  Then it uses that with pgc++ and runs into the error 
>> you
>> show below.
>> 
>> Is there a reason you want to mix gcc and pgc++?  It's usually 
>> simpler/better to
>> use a single compiler suite for the whole thing.
>> 
>> 
>> On Jan 29, 2014, at 10:54 AM, Jiri Kraus  wrote:
>> 
>>> Hi,
>>> 
>>> I am trying to compile OpenMPI 1.7.3 with pgc++ (14.1) as C++ compiler.
>> During configure it fails with
>>> 
>>> checking if C and C++ are link compatible... no
>>> 
>>> The error from config.log is:
>>> 
>>> configure:18205: checking if C and C++ are link compatible
>>> configure:18230: pgcc -c -DNDEBUG -fast  conftest_c.c
>>> configure:18237: $? = 0
>>> configure:18268: pgc++ -o conftest -DNDEBUG -fast   conftest.cpp
>> conftest_c.o  >&5
>>> conftest.cpp:
>>> "conftest.cpp", line 21: error: "_GNU_SOURCE" is predefined; attempted
>>>  redefinition ignored
>>>  #define _GNU_SOURCE 1
>>>  ^
>>> 
>>> "conftest.cpp", line 86: error: "_GNU_SOURCE" is predefined; attempted
>>>  redefinition ignored
>>>  #define _GNU_SOURCE 1
>>>  ^
>>> 
>>> "conftest.cpp", line 167: warning: statement is unreachable
>>>return 0;
>>>^
>>> 
>>> 2 errors detected in the compilation of "conftest.cpp".
>>> 
>>> When I use pgcpp instead of pgc++ OpenMPI configures and builds.
>>> 
>>> I am using
>>> 
>>> CXX=pgcpp|pgc++ CC=pgcc FC=pgfortran F77=pgfortran CFLAGS=-fast
>> FCFLAGS=-fast FFLAGS=-fast CXXFLAGS=-fast ./configure --with-
>> hwloc=/shared/apps/rhel-6.2/tools/hwloc-1.7.1 --enable-hwloc-pci --with-cuda 
>> -
>> -prefix=/home-2/jkraus/local/openmpi-1.7.3/pgi-14.1/cuda-5.5.22
>>> 
>>> to configure OpenMPI. Any Idea what caused the errors with pgc++?
>>> 
>>> Thanks
>>> 
>>> Jiri
>>> 
>>> ---
>>> Nvidia GmbH
>>> W?rselen
>>> Amtsgericht Aachen
>>> HRB 8361
>>> Managing Director: Karen Theresa Burns
>>> 
>>> ---
>>> This email message is for the sole use of the intended recipient(s) and may
>> contain
>>> confidential information.  Any unauthorized review, use, disclosure or
>> distribution
>>> is prohibited.  If you are not the intended recipient, please contact the 
>>> sender
>> by
>>> reply email and destroy all copies of the original message.
>>> ---
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> 
>> --
>> 
>> Subject: Digest Footer
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> --
>> 
>> End of users Digest, Vol 2796, Issue 1
>> **
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Compiling OpenMPI with PGI pgc++

2014-01-29 Thread Reuti
Am 29.01.2014 um 18:24 schrieb Jeff Squyres (jsquyres):

> Oh, I'm sorry -- I mis-read your initial mail (I thought when you did use all 
> the PGI compilers, it worked).
> 
> I don't know the difference between pgc++ and pgcpp, unfortunately.

It's a matter of the ABI:

http://www.pgroup.com/lit/articles/insider/v4n1a2.htm

pgc++ uses the new ABI.

-- Reuti


> Do you have the latest version of your PGI compiler suite in that series?
> 
> 
> On Jan 29, 2014, at 12:10 PM, Jiri Kraus  wrote:
> 
>> Hi Jeff,
>> 
>> thanks for taking a look. I don't want to mix compiler tool chains. I have 
>> just double checked my configure line and I am passing
>> 
>>  CXX=pgc++ CC=pgcc FC=pgfortran F77=pgfortran ...
>> 
>> so there are only PGI compilers used.
>> 
>> Thanks
>> 
>> Jiri
>> 
>>> Date: Wed, 29 Jan 2014 16:24:08 +
>>> From: "Jeff Squyres (jsquyres)" 
>>> To: Open MPI Users 
>>> Subject: Re: [OMPI users] Compiling OpenMPI with PGI pgc++
>>> Message-ID: 
>>> Content-Type: text/plain; charset="us-ascii"
>>> 
>>> That sounds about right.
>>> 
>>> What's happening is that OMPI has learned a bunch about the C compiler
>>> before it does this C++ link test.  In your first case (which is assumedly 
>>> with
>>> gcc), it determines that it needs _GNU_SOURCE set -- or some other test has
>>> caused that to be set.  Then it uses that with pgc++ and runs into the 
>>> error you
>>> show below.
>>> 
>>> Is there a reason you want to mix gcc and pgc++?  It's usually 
>>> simpler/better to
>>> use a single compiler suite for the whole thing.
>>> 
>>> 
>>> On Jan 29, 2014, at 10:54 AM, Jiri Kraus  wrote:
>>> 
 Hi,
 
 I am trying to compile OpenMPI 1.7.3 with pgc++ (14.1) as C++ compiler.
>>> During configure it fails with
 
 checking if C and C++ are link compatible... no
 
 The error from config.log is:
 
 configure:18205: checking if C and C++ are link compatible
 configure:18230: pgcc -c -DNDEBUG -fast  conftest_c.c
 configure:18237: $? = 0
 configure:18268: pgc++ -o conftest -DNDEBUG -fast   conftest.cpp
>>> conftest_c.o  >&5
 conftest.cpp:
 "conftest.cpp", line 21: error: "_GNU_SOURCE" is predefined; attempted
 redefinition ignored
 #define _GNU_SOURCE 1
 ^
 
 "conftest.cpp", line 86: error: "_GNU_SOURCE" is predefined; attempted
 redefinition ignored
 #define _GNU_SOURCE 1
 ^
 
 "conftest.cpp", line 167: warning: statement is unreachable
   return 0;
   ^
 
 2 errors detected in the compilation of "conftest.cpp".
 
 When I use pgcpp instead of pgc++ OpenMPI configures and builds.
 
 I am using
 
 CXX=pgcpp|pgc++ CC=pgcc FC=pgfortran F77=pgfortran CFLAGS=-fast
>>> FCFLAGS=-fast FFLAGS=-fast CXXFLAGS=-fast ./configure --with-
>>> hwloc=/shared/apps/rhel-6.2/tools/hwloc-1.7.1 --enable-hwloc-pci 
>>> --with-cuda -
>>> -prefix=/home-2/jkraus/local/openmpi-1.7.3/pgi-14.1/cuda-5.5.22
 
 to configure OpenMPI. Any Idea what caused the errors with pgc++?
 
 Thanks
 
 Jiri
 
 ---
 Nvidia GmbH
 W?rselen
 Amtsgericht Aachen
 HRB 8361
 Managing Director: Karen Theresa Burns
 
 ---
 This email message is for the sole use of the intended recipient(s) and may
>>> contain
 confidential information.  Any unauthorized review, use, disclosure or
>>> distribution
 is prohibited.  If you are not the intended recipient, please contact the 
 sender
>>> by
 reply email and destroy all copies of the original message.
 ---
 
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>> 
>>> 
>>> 
>>> --
>>> 
>>> Subject: Digest Footer
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> --
>>> 
>>> End of users Digest, Vol 2796, Issue 1
>>> **
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Compiling OpenMPI with PGI pgc++

2014-01-29 Thread Jeff Squyres (jsquyres)
On Jan 29, 2014, at 12:35 PM, Reuti  wrote:

>> I don't know the difference between pgc++ and pgcpp, unfortunately.
> 
> It's a matter of the ABI:
> 
> http://www.pgroup.com/lit/articles/insider/v4n1a2.htm
> 
> pgc++ uses the new ABI.


Must be more than that -- this is a compile issue, not a link issue.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] Running on two nodes slower than running on one node

2014-01-29 Thread Victor
Thanks for the insights Tim. I was aware that the CPUs will choke beyond a
certain point. From memory on my machine this happens with 5 concurrent MPI
jobs with that benchmark that I am using.

My primary question was about scaling between the nodes. I was not getting
close to double the performance when running MPI jobs acros two 4 core
nodes. It may be better now since I have Open-MX in place, but I have not
repeated the benchmarks yet since I need to get one simulation job done
asap.

Regarding your mention of setting affinities and MPI ranks do you have a
specific (as in syntactically specific since I am a novice and easily
confused...) examples how I may want to set affinities to get the Westmere
node performing better?

ompi_info returns this: MCA paffinity: hwloc (MCA v2.0, API v2.0, Component
v1.6.5)

And finally to hybridisation... in a week or so I will get 4 AMD A10-6800
machines with 8Gb each on loan and will attempt to make them work along the
existing Intel nodes.

Victor


On 29 January 2014 22:03, Tim Prince  wrote:

>
> On 1/29/2014 8:02 AM, Reuti wrote:
>
>> Quoting Victor :
>>
>>  Thanks for the reply Reuti,
>>>
>>> There are two machines: Node1 with 12 physical cores (dual 6 core Xeon)
>>> and
>>>
>>
>> Do you have this CPU?
>>
>> http://ark.intel.com/de/products/37109/Intel-Xeon-
>> Processor-X5560-8M-Cache-2_80-GHz-6_40-GTs-Intel-QPI
>>
>> -- Reuti
>>
>>  It's expected on the Xeon Westmere 6-core CPUs to see MPI performance
> saturating when all 4 of the internal buss paths are in use.  For this
> reason, hybrid MPI/OpenMP with 2 cores per MPI rank, with affinity set so
> that each MPI rank has its own internal CPU buss, could out-perform plain
> MPI on those CPUs.
> That scheme of pairing cores on selected internal buss paths hasn't been
> repeated.  Some influential customers learned to prefer the 4-core version
> of that CPU, given a reluctance to adopt MPI/OpenMP hybrid with affinity.
> If you want to talk about "downright strange," start thinking about the
> schemes to optimize performance of 8 threads with 2 threads assigned to
> each internal CPU buss on that CPU model.  Or your scheme of trying to
> balance MPI performance between very different CPU models.
> Tim
>
>
>>  Node2 with 4 physical cores (i5-2400).
>>>
>>> Regarding scaling on the single 12 core node, not it is also not linear.
>>> In
>>> fact it is downright strange. I do not remember the numbers right now but
>>> 10 jobs are faster than 11 and 12 are the fastest with peak performance
>>> of
>>> approximately 66 Msu/s which is also far from triple the 4 core
>>> performance. This odd non-linear behaviour also happens at the lower job
>>> counts on that 12 core node. I understand the decrease in scaling with
>>> increase in core count on the single node as the memory bandwidth is an
>>> issue.
>>>
>>> On the 4 core machine the scaling is progressive, ie. every additional
>>> job
>>> brings an increase in performance. Single core delivers 8.1 Msu/s while 4
>>> cores deliver 30.8 Msu/s. This is almost linear.
>>>
>>> Since my original email I have also installed Open-MX and recompiled
>>> OpenMPI to use it. This has resulted in approximately 10% better
>>> performance using the existing GbE hardware.
>>>
>>>
>>> On 29 January 2014 19:40, Reuti  wrote:
>>>
>>>  Am 29.01.2014 um 03:00 schrieb Victor:

 > I am running a CFD simulation benchmark cavity3d available within
 http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
 >
 > It is a parallel friendly Lattice Botlzmann solver library.
 >
 > Palabos provides benchmark results for the cavity3d on several
 different
 platforms and variables here:
 http://wiki.palabos.org/plb_wiki:benchmark:cavity_n400
 >
 > The problem that I have is that the benchmark performance on my
 cluster
 does not scale even close to a linear scale.
 >
 > My cluster configuration:
 >
 > Node1: Dual Xeon 5560 48 Gb RAM
 > Node2: i5-2400 24 Gb RAM
 >
 > Gigabit ethernet connection on eth0
 >
 > OpenMPI 1.6.5 on Ubuntu 12.04.3
 >
 >
 > Hostfile:
 >
 > Node1 -slots=4 -max-slots=4
 > Node2 -slots=4 -max-slots=4
 >
 > MPI command: mpirun --mca btl_tcp_if_include eth0 --hostfile
 /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400
 >
 > Problem:
 >
 > cavity3d 400
 >
 > When I run mpirun -np 4 on Node1 I get 35.7615 Mega site updates per
 second
 > When I run mpirun -np 4 on Node2 I get 30.7972 Mega site updates per
 second
 > When I run mpirun --mca btl_tcp_if_include eth0 --hostfile
 /home/mpiuser/.mpi_hostfile -np 8 ./cavity3d 400 I get 47.3538 Mega site
 updates per second
 >
 > I understand that there are latencies with GbE and that there is MPI
 overhead, but this performance scaling still seems very poor. Are my
 expectations of scaling naive, or is there actually something wrong and
 fixable t

Re: [OMPI users] Running on two nodes slower than running on one node

2014-01-29 Thread Ralph Castain

On Jan 29, 2014, at 7:56 PM, Victor  wrote:

> Thanks for the insights Tim. I was aware that the CPUs will choke beyond a 
> certain point. From memory on my machine this happens with 5 concurrent MPI 
> jobs with that benchmark that I am using.
> 
> My primary question was about scaling between the nodes. I was not getting 
> close to double the performance when running MPI jobs acros two 4 core nodes. 
> It may be better now since I have Open-MX in place, but I have not repeated 
> the benchmarks yet since I need to get one simulation job done asap.

Some of that may be due to expected loss of performance when you switch from 
shared memory to inter-node transports. While it is true about saturation of 
the memory path, what you reported could be more consistent with that 
transition - i.e., it isn't unusual to see applications perform better when run 
on a single node, depending upon how they are written, up to a certain size of 
problem (which your code may not be hitting).

> 
> Regarding your mention of setting affinities and MPI ranks do you have a 
> specific (as in syntactically specific since I am a novice and easily 
> confused...) examples how I may want to set affinities to get the Westmere 
> node performing better?

mpirun --bind-to-core -cpus-per-rank 2 ...

will bind each MPI rank to 2 cores. Note that this will definitely *not* be a 
good idea if you are running more than two threads in your process - if you 
are, then set --cpus-per-rank to the number of threads, keeping in mind that 
you want things to break evenly across the sockets. In other words, if you have 
two 6 core/socket Westmere's on the node, then you either want to run 6 process 
at cpus-per-rank=2 if each process runs 2 threads, or 4 processes with 
cpus-per-rank=3 if each process runs 3 threads, or 2 processes with no 
cpus-per-rank but --bind-to-socket instead of --bind-to-core for any other 
thread number > 3.

You would not want to run any other number of processes on the node or else the 
binding pattern will cause a single process to split its threads across the 
sockets - which will definitely hurt performance.


> 
> ompi_info returns this: MCA paffinity: hwloc (MCA v2.0, API v2.0, Component 
> v1.6.5)
> 
> And finally to hybridisation... in a week or so I will get 4 AMD A10-6800 
> machines with 8Gb each on loan and will attempt to make them work along the 
> existing Intel nodes. 
> 
> Victor
> 
> 
> On 29 January 2014 22:03, Tim Prince  wrote:
> 
> On 1/29/2014 8:02 AM, Reuti wrote:
> Quoting Victor :
> 
> Thanks for the reply Reuti,
> 
> There are two machines: Node1 with 12 physical cores (dual 6 core Xeon) and
> 
> Do you have this CPU?
> 
> http://ark.intel.com/de/products/37109/Intel-Xeon-Processor-X5560-8M-Cache-2_80-GHz-6_40-GTs-Intel-QPI
>  
> 
> -- Reuti
> 
> It's expected on the Xeon Westmere 6-core CPUs to see MPI performance 
> saturating when all 4 of the internal buss paths are in use.  For this 
> reason, hybrid MPI/OpenMP with 2 cores per MPI rank, with affinity set so 
> that each MPI rank has its own internal CPU buss, could out-perform plain MPI 
> on those CPUs.
> That scheme of pairing cores on selected internal buss paths hasn't been 
> repeated.  Some influential customers learned to prefer the 4-core version of 
> that CPU, given a reluctance to adopt MPI/OpenMP hybrid with affinity.
> If you want to talk about "downright strange," start thinking about the 
> schemes to optimize performance of 8 threads with 2 threads assigned to each 
> internal CPU buss on that CPU model.  Or your scheme of trying to balance MPI 
> performance between very different CPU models.
> Tim
> 
> 
> Node2 with 4 physical cores (i5-2400).
> 
> Regarding scaling on the single 12 core node, not it is also not linear. In
> fact it is downright strange. I do not remember the numbers right now but
> 10 jobs are faster than 11 and 12 are the fastest with peak performance of
> approximately 66 Msu/s which is also far from triple the 4 core
> performance. This odd non-linear behaviour also happens at the lower job
> counts on that 12 core node. I understand the decrease in scaling with
> increase in core count on the single node as the memory bandwidth is an
> issue.
> 
> On the 4 core machine the scaling is progressive, ie. every additional job
> brings an increase in performance. Single core delivers 8.1 Msu/s while 4
> cores deliver 30.8 Msu/s. This is almost linear.
> 
> Since my original email I have also installed Open-MX and recompiled
> OpenMPI to use it. This has resulted in approximately 10% better
> performance using the existing GbE hardware.
> 
> 
> On 29 January 2014 19:40, Reuti  wrote:
> 
> Am 29.01.2014 um 03:00 schrieb Victor:
> 
> > I am running a CFD simulation benchmark cavity3d available within
> http://www.palabos.org/images/palabos_releases/palabos-v1.4r1.tgz
> >
> > It is a parallel friendly Lattice Botlzmann solver library.
> >
> > Palabos provides benchmark results for the cavity3d on several