Hello,

I am currently testing a large system on a power6 cluster. I have compiled gromacs 4.0.4 successfully, and it appears to be working fine for <64 "cores" (sic, see later). First, I notice that it runs at approximately 1/2 the speed that it obtains on some older opterons, which is unfortunate but acceptable. Second, I run into some strange issues when I have a greater number of cores. Since there are 32 cores per node with simultaneous multithreading this yields 64 tasks inside one box, and I realize that these problems could be MPI related.

Some background:
This test system is stable for > 100ns on an opteron so I am quite confident that I do not have a problem with my topology or starting structure.

Compilation was successful with -O2 only when I modified the ./configure file as follows, otherwise I got a stray ')' and a linking error:
[cne...@tcs-f11n05]$ diff configure.000 configure
5052a5053
ac_cv_f77_libs="-L/scratch/cneale/exe/fftw-3.1.2_aix/exec/lib -lxlf90 -L/usr/lpp/xlf/lib -lxlopt -lxlf -lxlomp_ser -lpthreads -lm -lc"

The error messages:
For N=1,2,4,8,16,32, and 64, the system runs properly.
For N=200, I get the error: "ERROR: 0032-103 Invalid count (-8388608) in MPI_Recv, task 37" For N=196 my system explodes via regular settle/lings warnings followed by a crash.

Here are the log file and stderr snippits

On 200 cores:

The log file appears normal but is truncated.

## stderr:
...
Will use 112 particle-particle and 88 PME only nodes
This is a guess, check the performance at the end of the log file

NOTE: For optimal PME load balancing at high parallelization
      PME grid_x (175) and grid_y (175) should be divisible by #PME_nodes (88)

Making 3D domain decomposition 4 x 7 x 4

starting mdrun 'Big Box'
500 steps,      1.0 ps.
ERROR: 0032-103 Invalid count  (-8388608) in MPI_Recv, task 37


#####################

On 196 cores,

...
Initializing Domain Decomposition on 196 nodes
Dynamic load balancing: auto
Will sort the charge groups at every domain (re)decomposition
Initial maximum inter charge-group distances:
    two-body bonded interactions: 0.556 nm, LJ-14, atoms 25035 25038
  multi-body bonded interactions: 0.556 nm, Proper Dih., atoms 25035 25038
Minimum cell size due to bonded interactions: 0.612 nm
Maximum distance for 5 constraints, at 120 deg. angles, all-trans: 0.820 nm
Estimated maximum distance required for P-LINCS: 0.820 nm
This distance will limit the DD cell size, you can override this with -rcon
Guess for relative PME load: 0.37
Will use 108 particle-particle and 88 PME only nodes
This is a guess, check the performance at the end of the log file
Using 88 separate PME nodes
Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
Optimizing the DD grid for 108 cells with a minimum initial size of 1.025 nm
The maximum allowed number of cells is: X 16 Y 16 Z 14
Domain decomposition grid 6 x 6 x 3, separate PME nodes 88
Interleaving PP and PME nodes
This is a particle-particle only node

Domain decomposition nodeid 0, coordinates 0 0 0

Using two step summing over 4 groups of on average 27.0 processes
...


##### And to stderr, I get:
...
Back Off! I just backed up temp.log to ./#temp.log.2#
Reading file temp.tpr, VERSION 4.0.4 (single precision)

Will use 108 particle-particle and 88 PME only nodes
This is a guess, check the performance at the end of the log file

NOTE: For optimal PME load balancing at high parallelization
      PME grid_x (175) and grid_y (175) should be divisible by #PME_nodes (88)

Making 3D domain decomposition 6 x 6 x 3

starting mdrun 'Big Box'
500 steps,      1.0 ps.

Step 61, time 0.122 (ps)  LINCS WARNING
relative constraint deviation after LINCS:
rms 0.002765, max 0.028338 (between atoms 46146 and 46145)
bonds that rotated more than 30 degrees:
 atom 1 atom 2  angle  previous, current, constraint length
  46148  46146   89.9    0.1480   0.1499      0.1480
  46050  46049   32.0    0.1470   0.1475      0.1470

t = 0.122 ps: Water molecule starting at atom 62389 can not be settled.
Check for bad contacts and/or reduce the timestep.

t = 0.122 ps: Water molecule starting at atom 706505 can not be settled.

...

### And the system then proceeds to explode.

###################

I am happy to provide more information, and apologize if what I have posted here is incomplete. These log files are large though, and I tried to keep this first post as short as possible.

Thanks for any assistance,
Chris.

_______________________________________________
gmx-users mailing list    gmx-users@gromacs.org
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/mailing_lists/users.php

Reply via email to