On Tue, Jun 4, 2013 at 5:48 PM, Mark Abraham <mark.j.abra...@gmail.com>wrote:
> > > > On Tue, Jun 4, 2013 at 4:50 PM, Jianguo Li <ljg...@yahoo.com.sg> wrote: > >> >> >> Thank you, Mark and Xavier. >> >> The thing is that the cluster manager set the >> minimum number of cores of each jobs in Bluegene/Q is 128, so I can not >> use 64 cores. But according to the performance, 512 cores in Bluegene >> roughly equivalent to 64 cores in another cluster. Since there are 16 >> cores in each computational cards, the total number of cores I used in >> Bluegene//Q is num_cards times 16. So in my test, I acutally run >> simulations using different number of cards, from 8 to 256. >> >> The following is the script I submitted to bluegene using 128 >> computational cards: >> >> #!/bin/sh >> #SBATCH --nodes=128 >> # set Use 128 Compute Cards ( 1x Compute Card = 16 cores, 128x16 = 2048 >> cores ) >> #SBATCH --job-name="128x16x2" >> # set Job name >> #SBATCH -output="first-job-sample" >> # set >> Output file >> #SBATCH --partition="training" >> >> >> srun >> --ntasks-per-node=32 --overcommit >> /scratch/home/biilijg/package/gromacs-461/bin/mdrun -s box_md1.tpr -c >> box_md1.gro -x box_md1.xtc -g md1.log >& job_md1 >> >> Since bluegene/q accepts up to 4 tasks each >> core, I used 32 mpi tasks for each card (2 task per core). I tried >> --ntasks-per-node=64, but the simulations get much slower. >> Is there a optimized number for --ntasks-per-node? >> > > The threads per core thing will surely be useless for GROMACS. Even our > unoptimized kernels will saturate the available flops. There simply is > nothing to overlap, so you net lose from the extra overhead. You should aim > at 16 threads per node, one for each A2 core. Each of those 16 need not be > an MPI process, however. > > There's some general background info here > http://www.gromacs.org/Documentation/Acceleration_and_parallelization. > Relevant to BG/Q is that you will be using real MPI and should use OpenMP > and the Verlet kernels (see > http://www.gromacs.org/Documentation/Acceleration_and_parallelization#Multi-level_parallelization.3a_MPI.2fthread-MPI_.2b_OpenMP). > Finding the right balance of OpenMP threads per MPI process is hardware- > and problem-dependent, so you will need to experiment there. > > Thought I'd clarify further. A BG/Q node has 16 A2 cores. Some mix of MPI and OpenMP threads across those will be right for GROMACS. Each core is capable of running up to four "hardware threads." The processor in each core can only issue two instructions per cycle, one flop and one non-flop, but only to two different hardware threads. There is a theoretical speedup from using more than one hardware thread, since you get to take advantage of more instruction-issue opportunities. But doing so with more MPI processes will incur other overhead (e.g. from PME global communication, as well as pure-MPI overhead). Even if you can map the extra hardware threads to OpenMP threads, you will only be able to get some fraction of the speedup depending on available registers and bandwidth from cache (and you still pay some extra overhead for the OpenMP). How big these effects are depend whether you are running PME, and which of the kernels you are actually executing. So it might be worth investigating 2 hardware threads per core using OpenMP, but don't expect to want to write home about the results! :-) Cheers, Mark -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists