Re: [gmx-users] Parallelization scheme and terminology help

Szilárd Páll Wed, 23 Jan 2013 17:11:15 -0800

Hi,

Here's a bit more explanation, hopefully a bit more practical and give for
you and others a better view of what's going on under mdrun's hood.

thread-MPI, or in other contexts referred to as "thread_mpi" or abbreviated
as "tMPI", is functionally equivalent with the standard MPI you'd use on a
cluster. The difference that tMPI implements MPI "ranks" using
platform-native threads (i.e pthreads on Linux, Mac OS, etc. and Win
threads on Windozz), while most standard MPI implementations use processes
for ranks. Note that threads always belong to a single process. From a
practical point of view the implication is that you don't need to use an
external MPI library (thread-MPI comes with the GROMACS source) nor the mpi
command launcher (mpirun) if you want to run on a *single machine* - a
single mdrun processes' MPI threads can only span across a single machine's
cores, but not across machines.

In other words, running the following two is equivalent:
mpirun -np 4 mdrun_mpi
mdrun -ntmpi 4
except that the former uses four MPI processes while the latter the same
amount of MPI threads. In both cases you'll have four domains, typically
with a 4x1x1 domain-decomposition, each domain assigned to an MPI
process/thread.

While thread-MPI does essentially provide multi-threading capability to
mdrun and uses efficient multi-core optimized communication and
synchronization, it does not allow exploiting the efficient core-to-core
data-sharing (through cache) advantages of multi-core CPUs as it relies on
domain-decomposition which localizes data to threads (running on a single
core).

To achieve data sharing we use OpenMP multi-threading and the data
decomposed is practically equivalent with particle-decomposition. To
illustrate this, let's take as example the update (integration) with N
particles running on four cores: the first core will compute the new
coordinates for the particles [0..N/4), the second for [N/4,N/2), etc. In
contrast, with domain-decomposition four *spatial* domains are assigned to
the four cores.

The two parallelization schemes can, of course, be combined and one can use
domain-decomposition with this particle-decompostion-like parallelization
together. In implementation this means in (t)MPI + OpenMP parallelization
(=hybrid or multi-level).

Now, if you have a machine with two sockets and two CPUs, these communicate
through a fast link (QPI on Intel HyperTransport on AMD) which does allow
inter-core communication and therefore can be used without explicit (MPI)
communication. Hence, you can run OpenMP-only parallelization, that is
2x4x2=16 threads, on your machine* by:
mdrun -ntmpi 1 -ntomp 16
as well as tMPI-only:
mdrun -ntmpi 16 -ntomp 1
or mixed:
mdrun -ntmpi 2 -ntomp 8

Mixing MPI+OpenMP has a non-negligible overhead and in our current
algorithms/implementation, it only becomes considerably useful with GPUs,
or in highly parallel runs where communication (mostly PME) becomes
bottleneck or when the domain-decomposition limits the parallelization (by
e.g halving the number of domains needed with -ntomp 2).

Note that Intel machines, especially Sandy Bridge, have much more efficient
inter-core/socket communication than AMD. On a 2x8-core Sandy Bridge
machine one can run 16 OpenMP threads typically with much better
performance than 16 thread-MPI, but on AMD above 4-8 OpenMP becomes
inefficient and running across sockets is out of question. As noted before,
except with GPUs and/or at high parallelization mixing MPI and OpenMP is
rarely advantageous.

We tried to come up with a decent level of automation for mdrun launch
configuration (i.e. #of threads). So the launch configuration should, in
most cases, give the best performance and mdrun will always try to use all
cores "filling up" the machine with OpenMP threads, i.e.
mdrun -ntmpi 2
on a four core machines will result in 2 tMPI x 2 OpenMP and is equivalent
with
mdrun -ntmpi 2 -ntomp 2, but (for the aforementioned reasons) "mdrun"
without arguments with be equivalent with
mdrun -ntomp 4.
Note that with MPI you need to specify a fixed number of processes, so if
you request 4x8-core nodes but only 8 processes (mpirun -np 8), you'll get
4 OpenMP threads started per process (which gives 8x4=32 threads).

NOTE: Full OpenMP parallelization is available only with the Verlet scheme,
the group scheme supports OpenMP for PME and can be used to improve scaling
at high parallelization on clusters.

Coming back to your machine, depending on the Intel CPU generation, OpenMP
can be the default. On "Nehalem" (e.g. Wesmere Xeon) CPUs by default we use
OpenMP only with max 12 threads, above tMPI. You can certainly try -ntomp
16, for you it *could* be faster than -ntmpi 16.

I hope this helps, I'll try to add some more clarification to the
acceleration/parallelization page later.

Cheers,
--
Szilárd

On Mon, Jan 21, 2013 at 11:50 PM, Brad Van Oosten <bv0...@brocku.ca> wrote:

> I have been lost in the sea of terminology for installing gromacs with
> multi-processors.   The plan is to upgrade from 4.5.5 to the 4.6 and i want
> the optimal install for my system.  There is a a nice explanaion at
> http://www.gromacs.org/**Documentation/Acceleration_**and_parallelization<http://www.gromacs.org/Documentation/Acceleration_and_parallelization>but
>  the number of different options and terminology has confused me.
>
> I currently have one computer with 2 processor sockets each with 4 cores
> each with 2 threads.  A mouthful which in the end allows for 16 processes
> at once(2*4*2).
>
> The way i read the documentation is that MPI is needed for the talk
> between the 2 physical processors, OpenMP does the talk between the 4 cores
> in each processor and thread-MPI does the treading? or does thread-MPI do
> everything?
>
> What would be the Parallelization scheme is required?
> --
> gmx-users mailing list    gmx-users@gromacs.org
> http://lists.gromacs.org/**mailman/listinfo/gmx-users<http://lists.gromacs.org/mailman/listinfo/gmx-users>
> * Please search the archive at http://www.gromacs.org/**
> Support/Mailing_Lists/Search<http://www.gromacs.org/Support/Mailing_Lists/Search>before
>  posting!
> * Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-requ...@gromacs.org.
> * Can't post? Read 
> http://www.gromacs.org/**Support/Mailing_Lists<http://www.gromacs.org/Support/Mailing_Lists>
>
--
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] Parallelization scheme and terminology help

Reply via email to