I have some (hopefully) clarifying commments to my previous post now:

First to answer your question regarding pme.c. My compilation was done from v. 1.125
------------
Line 1037-
    if ((kx>0) || (ky>0)) {
                kzstart = 0;
            } else {
                kzstart = 1;
                p0++;
            }
------
As you can see the p0++; line is there.

Now here are some additional points:

On Mon, 29 Sep 2008, Bjørn Steen Sæthre wrote:

The only Error message I can find is the rather cryptic::

NOTE: Turning on dynamic load balancing

_pmii_daemon(SIGCHLD): PE 4 exit signal Killed
[NID 1412]Apid 159787: initiated application termination

There are no error's apart from that.

Furthermore I can now report that this error is endemic in all my sims
using harmonic position restraints in GROMACS 4.0_beta1 and GMX
4.0_rc1.

About core dumps. I will talk to our HPC staff, and get back to you with
something more substantial I hope.


OK, I have gotten some info from our HPC staff, they checked another job of
mine which crashed in the exact same fashion, with the exact same starting
run-topology and node configuration.
They found some more info in the admin's log:

Hi,
this job got an OOM (out of memory), which is only recorded in the
system logs, not available directly to users:

[2008-09-29 17:18:18][c11-0c0s1n0]Out of memory: Killed process 8888
(parmdrun).

I can also add that I have been able to stabilize the engine, by altering the
cut-offs and lowering the total PME-load of the run, at the expense of far
greater computational inefficiency.

That is I went from unstable < to stable > as in the following diff on the mdp-files:
-----------------------------
21c21
< rlist                    = 0.9
---
rlist                    = 1.0
24c24
< rcoulomb                 = 0.9
---
rcoulomb                 = 1.0
26c26
< rvdw                     = 0.9
---
rvdw                     = 1.0
28,30c28,31
< fourier_nx             = 60
< fourier_ny             = 40
< fourier_nz             = 40
---
fourier_nx             = 48
fourier_ny             = 32
fourier_nz             = 32
35c36
------------------------------
That is, the  PME-workload went from 1/2 of nodes to 1/3 of them since I was
using exactly the same startup configuration ---------------------

This however, while enhancing stability, the output rate slowed down
appreciably. And as shown in the log output, the reason is clear:
------------------------------------------------------------
Making 2D domain decomposition 8 x 4 x 1
starting mdrun 'Propane-hydrate prism (2x2x3 UC)'
2000000 steps,   4000.0 ps.
Step 726095: Run time exceeded 3.960 hours, will terminate the run

Step 726100: Run time exceeded 3.960 hours, will terminate the run

 Average load imbalance: 26.7 %
 Part of the total run time spent waiting due to load imbalance: 1.5 %
 Average PME mesh/force load: 9.369
 Part of the total run time spent waiting due to PP/PME imbalance: 57.5 %

NOTE: 57.5 % performance was lost because the PME nodes
      had more work to do than the PP nodes.
      You might want to increase the number of PME nodes
      or increase the cut-off and the grid spacing.


        Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)
       Time:   5703.000   5703.000    100.0
                       1h35:03
               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)
Performance:     29.593      8.566     60.600      0.396

gcq#0: Thanx for Using GROMACS - Have a Nice Day
-----------------------------------------------


One thing more is odd here though.
In the startup script I allocated 4 hours, and set -maxh 4:

-----------------------------------------------
#PBS -l walltime=4:00:00,mppwidth=48,mppnppn=4
cd /work/bjornss/pmf/structII/hydrate_annealing/heatup_400K_2nd
source $HOME/gmx_latest_290908/bin/GMXRC
aprun -n 48 parmdrun -s topol.tpr -maxh 4 -npme 16
exit $?
-----------------------

why the wallclock inconsistency (ie. wallclock is 1:35:03 which does not
correspond to the note of 3.96 hours exceeded.)



I hope this is helpful in resolving the issue brought up originally. (Might
there be a possible memory leak somewhere?)

Regards
Bjørn


PhD-student
Insitute of Physics & Tech.- University of Bergen
Allegt. 55,
5007 Bergen
Norway

Tel(office): +47 55582869
Cell:        +47 99253386
_______________________________________________
gmx-users mailing list    gmx-users@gromacs.org
http://www.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at http://www.gromacs.org/search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to [EMAIL PROTECTED]
Can't post? Read http://www.gromacs.org/mailing_lists/users.php

Reply via email to