amjad ali wrote:
Hi,
Suppose we run a parallel MPI code with 64 processes on a cluster, say
of 16 nodes. The cluster nodes has multicore CPU say 4 cores on each node.
Now all the 64 cores on the cluster running a process. Program is SPMD,
means all processes has the same workload.
Now if we had done auto-vectorization while compiling the code (for
example with Intel compilers); Will there be any benefit
(efficiency/scalability improvement) of having code with the
auto-vectorization? Or we will get the same performance as without
Auto-vectorization in this example case?
MEANS THAT if we do not have free cpu cores in a PC or cluster (all
cores are running MPI processes), still the auto-vertorization is
beneficial? Or it is beneficial only if we have some free cpu cores
locally?
How can we really get benefit in performance improvement with
Auto-Vectorization?
Auto-vectorization should give similar performance benefit under MPI as
it does in a single process. That's about all that can be said when you
say nothing about the nature of your application. This assumes that
your MPI domain decomposition, which may not be highly vectorizable,
doesn't take up too large a fraction of elapsed time. By the same
token, auto-vectorization techniques aren't specific to MPI
applications, so an in-depth treatment isn't topical here.
I'll just mention that we are well into the era of 3 levels of
programming parallelization: vectorization, threaded parallel (e.g.
OpenMP), and process parallel (e.g. MPI).
For an application which I work on, 8 nodes with auto-vectorization give
about the performance of 12 nodes without, so compilers without
auto-vectorization capability for such applications fell by the wayside
a decade ago. This application gains significant benefit from cache
blocking, so vectorization has more opportunity to gain than for
applications which have less memory locality.
I have not seen an application which was effectively vectorized which
also gained from HyperThreading, but the gain for vectorization should
be significantly greater than could be gained from HyperThreading. It's
also common that vectorization gains more on lower clock speed/cheaper
CPU models (of the same architecture), enabling lower cost of purchase
or power consumption, but that's true of all forms of parallelization.
Some applications can be vectorized effectively by any of the popular
auto-vectorizing compilers, including recent gnu compilers, while others
show much more gain with certain compilers, such as Intel, PGI, or Open64.