amjad ali wrote:
Hi,
Suppose we run a parallel MPI code with 64 processes on a cluster, say of 16 nodes. The cluster nodes has multicore CPU say 4 cores on each node.

Now all the 64 cores on the cluster running a process. Program is SPMD, means all processes has the same workload.

Now if we had done auto-vectorization while compiling the code (for example with Intel compilers); Will there be any benefit (efficiency/scalability improvement) of having code with the auto-vectorization? Or we will get the same performance as without Auto-vectorization in this example case? MEANS THAT if we do not have free cpu cores in a PC or cluster (all cores are running MPI processes), still the auto-vertorization is beneficial? Or it is beneficial only if we have some free cpu cores locally?


How can we really get benefit in performance improvement with Auto-Vectorization?

Auto-vectorization should give similar performance benefit under MPI as it does in a single process. That's about all that can be said when you say nothing about the nature of your application. This assumes that your MPI domain decomposition, which may not be highly vectorizable, doesn't take up too large a fraction of elapsed time. By the same token, auto-vectorization techniques aren't specific to MPI applications, so an in-depth treatment isn't topical here. I'll just mention that we are well into the era of 3 levels of programming parallelization: vectorization, threaded parallel (e.g. OpenMP), and process parallel (e.g. MPI). For an application which I work on, 8 nodes with auto-vectorization give about the performance of 12 nodes without, so compilers without auto-vectorization capability for such applications fell by the wayside a decade ago. This application gains significant benefit from cache blocking, so vectorization has more opportunity to gain than for applications which have less memory locality. I have not seen an application which was effectively vectorized which also gained from HyperThreading, but the gain for vectorization should be significantly greater than could be gained from HyperThreading. It's also common that vectorization gains more on lower clock speed/cheaper CPU models (of the same architecture), enabling lower cost of purchase or power consumption, but that's true of all forms of parallelization. Some applications can be vectorized effectively by any of the popular auto-vectorizing compilers, including recent gnu compilers, while others show much more gain with certain compilers, such as Intel, PGI, or Open64.

Reply via email to