Hi, Here's a bit more explanation, hopefully a bit more practical and give for you and others a better view of what's going on under mdrun's hood.
thread-MPI, or in other contexts referred to as "thread_mpi" or abbreviated as "tMPI", is functionally equivalent with the standard MPI you'd use on a cluster. The difference that tMPI implements MPI "ranks" using platform-native threads (i.e pthreads on Linux, Mac OS, etc. and Win threads on Windozz), while most standard MPI implementations use processes for ranks. Note that threads always belong to a single process. From a practical point of view the implication is that you don't need to use an external MPI library (thread-MPI comes with the GROMACS source) nor the mpi command launcher (mpirun) if you want to run on a *single machine* - a single mdrun processes' MPI threads can only span across a single machine's cores, but not across machines. In other words, running the following two is equivalent: mpirun -np 4 mdrun_mpi mdrun -ntmpi 4 except that the former uses four MPI processes while the latter the same amount of MPI threads. In both cases you'll have four domains, typically with a 4x1x1 domain-decomposition, each domain assigned to an MPI process/thread. While thread-MPI does essentially provide multi-threading capability to mdrun and uses efficient multi-core optimized communication and synchronization, it does not allow exploiting the efficient core-to-core data-sharing (through cache) advantages of multi-core CPUs as it relies on domain-decomposition which localizes data to threads (running on a single core). To achieve data sharing we use OpenMP multi-threading and the data decomposed is practically equivalent with particle-decomposition. To illustrate this, let's take as example the update (integration) with N particles running on four cores: the first core will compute the new coordinates for the particles [0..N/4), the second for [N/4,N/2), etc. In contrast, with domain-decomposition four *spatial* domains are assigned to the four cores. The two parallelization schemes can, of course, be combined and one can use domain-decomposition with this particle-decompostion-like parallelization together. In implementation this means in (t)MPI + OpenMP parallelization (=hybrid or multi-level). Now, if you have a machine with two sockets and two CPUs, these communicate through a fast link (QPI on Intel HyperTransport on AMD) which does allow inter-core communication and therefore can be used without explicit (MPI) communication. Hence, you can run OpenMP-only parallelization, that is 2x4x2=16 threads, on your machine* by: mdrun -ntmpi 1 -ntomp 16 as well as tMPI-only: mdrun -ntmpi 16 -ntomp 1 or mixed: mdrun -ntmpi 2 -ntomp 8 Mixing MPI+OpenMP has a non-negligible overhead and in our current algorithms/implementation, it only becomes considerably useful with GPUs, or in highly parallel runs where communication (mostly PME) becomes bottleneck or when the domain-decomposition limits the parallelization (by e.g halving the number of domains needed with -ntomp 2). Note that Intel machines, especially Sandy Bridge, have much more efficient inter-core/socket communication than AMD. On a 2x8-core Sandy Bridge machine one can run 16 OpenMP threads typically with much better performance than 16 thread-MPI, but on AMD above 4-8 OpenMP becomes inefficient and running across sockets is out of question. As noted before, except with GPUs and/or at high parallelization mixing MPI and OpenMP is rarely advantageous. We tried to come up with a decent level of automation for mdrun launch configuration (i.e. #of threads). So the launch configuration should, in most cases, give the best performance and mdrun will always try to use all cores "filling up" the machine with OpenMP threads, i.e. mdrun -ntmpi 2 on a four core machines will result in 2 tMPI x 2 OpenMP and is equivalent with mdrun -ntmpi 2 -ntomp 2, but (for the aforementioned reasons) "mdrun" without arguments with be equivalent with mdrun -ntomp 4. Note that with MPI you need to specify a fixed number of processes, so if you request 4x8-core nodes but only 8 processes (mpirun -np 8), you'll get 4 OpenMP threads started per process (which gives 8x4=32 threads). NOTE: Full OpenMP parallelization is available only with the Verlet scheme, the group scheme supports OpenMP for PME and can be used to improve scaling at high parallelization on clusters. Coming back to your machine, depending on the Intel CPU generation, OpenMP can be the default. On "Nehalem" (e.g. Wesmere Xeon) CPUs by default we use OpenMP only with max 12 threads, above tMPI. You can certainly try -ntomp 16, for you it *could* be faster than -ntmpi 16. I hope this helps, I'll try to add some more clarification to the acceleration/parallelization page later. Cheers, -- Szilárd On Mon, Jan 21, 2013 at 11:50 PM, Brad Van Oosten <bv0...@brocku.ca> wrote: > I have been lost in the sea of terminology for installing gromacs with > multi-processors. The plan is to upgrade from 4.5.5 to the 4.6 and i want > the optimal install for my system. There is a a nice explanaion at > http://www.gromacs.org/**Documentation/Acceleration_**and_parallelization<http://www.gromacs.org/Documentation/Acceleration_and_parallelization>but > the number of different options and terminology has confused me. > > I currently have one computer with 2 processor sockets each with 4 cores > each with 2 threads. A mouthful which in the end allows for 16 processes > at once(2*4*2). > > The way i read the documentation is that MPI is needed for the talk > between the 2 physical processors, OpenMP does the talk between the 4 cores > in each processor and thread-MPI does the treading? or does thread-MPI do > everything? > > What would be the Parallelization scheme is required? > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/**mailman/listinfo/gmx-users<http://lists.gromacs.org/mailman/listinfo/gmx-users> > * Please search the archive at http://www.gromacs.org/** > Support/Mailing_Lists/Search<http://www.gromacs.org/Support/Mailing_Lists/Search>before > posting! > * Please don't post (un)subscribe requests to the list. Use the www > interface or send it to gmx-users-requ...@gromacs.org. > * Can't post? Read > http://www.gromacs.org/**Support/Mailing_Lists<http://www.gromacs.org/Support/Mailing_Lists> > -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists