Re: [gmx-users] Why does the -append option exist?

Mark Abraham Sat, 04 Jun 2011 18:10:35 -0700

On 5/06/2011 3:11 AM, Dimitar Pachov wrote:

On Fri, Jun 3, 2011 at 9:24 PM, Mark Abraham <mark.abra...@anu.edu.au<mailto:mark.abra...@anu.edu.au>> wrote:


    On 4/06/2011 8:26 AM, Dimitar Pachov wrote


    If this is true, then it wants fixing, and fast, and will get it
    :-) However, it would be surprising for such a problem to exist
    and not have been reported up to now. This feature has been in the
    code for a year now, and while some minor issues have been fixed
    since the 4.5 release, it would surprise me greatly if your claim
    was true.

    You're saying the equivalent of the steps below can occur:
    1. Simulation wanders along normally and writes a checkpoint at
    step 1003
    2. Random crash happens at step 1106
    3. An -append restart from the old .tpr and the recent .cpt file
    will restart from step 1003
    4. Random crash happens at step 1059
    5. Now a restart doesn't restart from step 1003, but some other step

    and most importantly, the most important piece of data, that
    being the trajectory file, could be completely lost! I don't know
    the code behind the checkpointing & appending, but I can see how
    easy one can overwrite 100ns trajectories, for example, and
    "obtain" the same trajectories of size .... 0.


    I don't see how easy that is, without a concrete example, where
    user error is not possible.


Here is an example:

========================
[dpachov]$ ll -rth run1*  \#run1*
-rw-rw-r-- 1 dpachov dpachov  11K May  2 02:59 run1.po.mdp
-rw-rw-r-- 1 dpachov dpachov 4.6K May  2 02:59 run1.grompp.out
-rw-rw-r-- 1 dpachov dpachov 3.5M May 13 19:09 run1.gro
-rw-rw-r-- 1 dpachov dpachov 2.3M May 14 00:40 run1.tpr
-rw-rw-r-- 1 dpachov dpachov 2.3M May 14 00:40 run1-i.tpr
-rw-rw-r-- 1 dpachov dpachov    0 May 29 21:53 run1.trr
-rw-rw-r-- 1 dpachov dpachov 1.2M May 31 10:45 run1.cpt
-rw-rw-r-- 1 dpachov dpachov 1.2M May 31 10:45 run1_prev.cpt
-rw-rw-r-- 1 dpachov dpachov    0 Jun  3 14:03 run1.xtc
-rw-rw-r-- 1 dpachov dpachov    0 Jun  3 14:03 run1.edr
-rw-rw-r-- 1 dpachov dpachov  15M Jun  3 17:03 run1.log
========================

Submitted by:
========================
ii=1
ifmpi="mpirun -np $NSLOTS"
--------
   if [ ! -f run${ii}-i.tpr ];then
      cp run${ii}.tpr run${ii}-i.tpr
      tpbconv -s run${ii}-i.tpr -until 200000 -o run${ii}.tpr
   fi

   k=`ls md-${ii}*.out | wc -l`
   outfile="md-${ii}-$k.out"
   if [[ -f run${ii}.cpt ]]; then

$ifmpi `which mdrun` -s run${ii}.tpr -cpi run${ii}.cpt -v-deffnm run${ii} -npme 0 > $outfile 2>&1


   fi
=========================

This script is not using mdrun -append. Your original post suggested theuse of -append was a problem. Why aren't we seeing a script with mdrun-append? Also, please provide the full script - it looks like theremight be a loop around your tpbconv-then-mdrun fragment.

Note that a useful trouble-shooting technique can be to construct yourcommand line in a shell variable, echo it to stdout (redirected assuitable) and then execute the contents of the variable. Now, nobody hasto parse a shell script to know what command line generated what output,and it can be co-located with the command's stdout.

From the end of run1.log:
=========================
Started mdrun on node 0 Tue May 31 10:28:52 2011

           Step           Time         Lambda
       51879390   103758.78000        0.00000

   Energies (kJ/mol)
U-B Proper Dih. Improper Dih. CMAP Dih.LJ-148.37521e+03 4.52303e+03 4.78633e+02 -1.23174e+032.87366e+03Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul.recip.3.02277e+04 9.48267e+04 -3.88596e+03 -7.43902e+05-8.36436e+04Potential Kinetic En. Total Energy Temperature Pres. DC(bar)-6.91359e+05 1.29016e+05 -5.62342e+05 3.00159e+02-1.24746e+02
 Pressure (bar)   Constr. rmsd
   -2.43143e+00    0.00000e+00

DD  step 51879399 load imb.: force 225.5%

<snip>

Writing checkpoint, step 51879590 at Tue May 31 10:45:22 2011
   Energies (kJ/mol)
U-B Proper Dih. Improper Dih. CMAP Dih.LJ-148.33208e+03 4.72300e+03 5.31983e+02 -1.21532e+032.89586e+03Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul.recip.3.00900e+04 9.31785e+04 -3.87790e+03 -7.40841e+05-8.36838e+04Potential Kinetic En. Total Energy Temperature Pres. DC(bar)-6.89867e+05 1.28721e+05 -5.61146e+05 2.99472e+02-1.24229e+02
 Pressure (bar)   Constr. rmsd
   -1.03491e+02    2.99840e-05
====================================


So the -append restart looks like it did fine here.

Last output files from restarts:
====================================
[dpachov]$ ll -rth md-1-*out | tail -10
-rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 16:40 md-1-2428.out
-rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 16:44 md-1-2429.out
-rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 16:46 md-1-2430.out
-rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 16:48 md-1-2431.out
-rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 16:50 md-1-2432.out
-rw-rw-r-- 1 dpachov dpachov    0 Jun  3 16:52 md-1-2433.out
-rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 16:55 md-1-2434.out
-rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 16:58 md-1-2435.out
-rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 17:03 md-1-2436.out
*-rw-rw-r-- 1 dpachov dpachov 5.8K Jun  3 17:04 md-1-2437.out*
====================================
+ around the time when the run1.xtc file seems to have been saved:
====================================
[dpachov]$ ll -rth md-1-23[5-6][0-9]*out
-rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 13:37 md-1-2350.out
-rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 13:39 md-1-2351.out
-rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 13:43 md-1-2352.out
-rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 13:45 md-1-2353.out
-rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 13:46 md-1-2354.out
-rw-rw-r-- 1 dpachov dpachov    0 Jun  3 13:47 md-1-2355.out
-rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 13:49 md-1-2356.out
-rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 13:52 md-1-2357.out
-rw-rw-r-- 1 dpachov dpachov  12K Jun  3 13:57 md-1-2358.out
*-rw-rw-r-- 1 dpachov dpachov  12K Jun  3 14:02 md-1-2359.out*
*-rw-rw-r-- 1 dpachov dpachov 6.0K Jun  3 14:03 md-1-2360.out*
-rw-rw-r-- 1 dpachov dpachov 6.2K Jun  3 14:06 md-1-2361.out
-rw-rw-r-- 1 dpachov dpachov 5.8K Jun  3 14:09 md-1-2362.out
-rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 14:10 md-1-2363.out
-rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 14:11 md-1-2364.out
-rw-rw-r-- 1 dpachov dpachov 5.8K Jun  3 14:12 md-1-2365.out
-rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 14:13 md-1-2366.out
-rw-rw-r-- 1 dpachov dpachov 6.1K Jun  3 14:14 md-1-2367.out
-rw-rw-r-- 1 dpachov dpachov 6.0K Jun  3 14:17 md-1-2368.out
-rw-rw-r-- 1 dpachov dpachov 5.9K Jun  3 14:18 md-1-2369.out
====================================

I don't understand why you have so many restarts only a minute or twoapart. Checkpoints are only written (by default) every 15 minutes, andno job seems to run that long, so all of these will start from the samepoint. If they're running simultaneously then it's conceivable thatmultiple processes trying to use the same output file could be aproblem, as suggested by Jussi. You say that's not the case. So why arethere so many restarts?

From md-1-2359.out:
=====================================
:::::::
Getting Loaded...
Reading file run1.tpr, VERSION 4.5.4 (single precision)

Reading checkpoint file run1.cpt generated: Tue May 31 10:45:22 2011


Loaded with Money

Making 2D domain decomposition 4 x 2 x 1

WARNING: This run will generate roughly 4915 Mb of data

starting mdrun 'run1'
100000000 steps, 200000.0 ps (continuing from step 51879590, 103759.2 ps).
step 51879590, will finish Wed Aug 17 14:21:59 2011
imb F 44%
NOTE: Turning on dynamic load balancing

step 51879600, will finish Fri Jul 15 14:00:00 2011
vol 0.64  imb F  0% step 51879700, will finish Mon Jun 27 02:19:09 2011
vol 0.63  imb F  0% step 51879800, will finish Sat Jun 25 15:14:01 2011
vol 0.64  imb F  1% step 51879900, will finish Sat Jun 25 02:11:53 2011
vol 0.64  imb F  0% step 51880000, will finish Fri Jun 24 19:48:54 2011
vol 0.64  imb F  1% step 51880100, will finish Fri Jun 24 15:55:19 2011
::::::
vol 0.67  imb F  0% step 51886400, will finish Fri Jun 24 02:51:45 2011
vol 0.66  imb F  0% step 51886500, will finish Fri Jun 24 02:48:10 2011
vol 0.66  imb F  0% step 51886600, will finish Fri Jun 24 02:47:33 2011
=====================================

From md-1-2360.out:
=====================================
:::::::
Getting Loaded...
Reading file run1.tpr, VERSION 4.5.4 (single precision)

Reading checkpoint file run1.cpt generated: Tue May 31 10:45:22 2011


Loaded with Money

Making 2D domain decomposition 4 x 2 x 1

WARNING: This run will generate roughly 4915 Mb of data

starting mdrun 'run1'
100000000 steps, 200000.0 ps (continuing from step 51879590, 103759.2 ps).
=====================================

These aren't showing anything other than that the restart is coming fromthe same point each time.

And from the last generated output md-1-2437.out (I think I killed thejob at that point because of the above observed behavior):
=====================================
:::::::
Getting Loaded...
Reading file run1.tpr, VERSION 4.5.4 (single precision)
=====================================
I have at least 5-6 additional examples like this one. In some of themthe *xtc file does have size greater than zero yet still very small,but it starts from some random frame (for example, in one of the casesit contains frames from ~91000ps to ~104000ps, but all frames before91000ps are missing).

I think that demonstrating a problem requires that the set of outputfiles were fine before one particular restart, and weird afterwards. Idon't think we've seen that yet.

I realize there might be another problem, but the bottom line is thatthere is no mechanism that can prevent this from happening if manyrestarts are required, and particularly if the timing between theserestarts is prone to be small (distributed computing could easilysatisfy this condition).
Any suggestions, particularly related to the minimum resistance pathto regenerate the missing data? :)
    Using the checkpoint capability & appending make sense when many
    restarts are expected, but unfortunately it is exactly then when
    these options completely fail! As a new user of Gromacs, I must
    say I am disappointed, and would like to obtain an explanation of
    why the usage of these options is clearly stated to be safe when
    it is not, and why the append option is the default, and why at
    least a single warning has not been posted anywhere in the docs &
    manuals?
    I can understand and sympathize with your frustration if you've
    experienced the loss of a simulation. Do be careful when
    suggesting that others' actions are blame-worthy, however.


I have never suggested this. As a user, I am entitled to ask.

Sure. However, talking about something that can "completely fail" whichmakes you "disappointed" and wanting to "obtain an explanation" aboutwhy something doesn't work as stated and lacks "a single warning"suggests that someone has done something less than appropriate, and soblame-worthy. It also assumes that the actions of a new user werecorrect, and the actions of a developer with long experience were not.This may or may not prove to be true. Starting such a discussion from aconciliatory (rather than antagonistic) stance is usually moreproductive. The shared objective should be to fix the problem, not provethat someone did something wrong.


An alternative way of wording your paragraph could have been:

"Using the checkpoint capability & appending make sense when manyrestarts are expected, however I observe that under such circumstancesthis capability can fail. I am a new user of GROMACS, might I have beenusing them incorrectly? Are the developers aware of any situations underwhich the capability is unreliable? If so, should the default behaviourbe different, and should this issue be documented somewhere?"

And since my questions were not clearly answered, I will repeat themin a structured way:
1. Why is the usage of these options (-cpi and -append) clearly statedto be safe when in fact it is not?

Because they are believed to be safe. Jussi's suggestion about filelocking may have merit.

2. Why have you made the -append option the default in the mostcurrent GMX versions?


Because it's the most convenient mode of operation.

3. Why has not a single warning been posted anywhere in the docs &manuals? (this question is somewhat clear - because you did not knowabout such a problem, but people say "ignorance of the law excuses noone", which means ignoring to put a warning for something that youwere not 100% certain it would be error-free could not be an excuse)


Because no-one is aware of a problem to warn about.

I am blame-worthy - for blindly believing what was written in themanual without taking the necessary precautions. Lesson learned.
    However, developers' time rarely permits addressing "feature X
    doesn't work, why not?" in a productive way. Solving bugs can be
    hard, but will be easier (and solved faster!) if the user who
    thinks a problem exists follows good procedure. See
    http://www.chiark.greenend.org.uk/~sgtatham/bugs.html
    <http://www.chiark.greenend.org.uk/%7Esgtatham/bugs.html>
Implying that I did not follow a certain procedure related to acertain problem without you knowing what my initial intention was isjust a speculation.

I don't follow your point. If your intent is to get the problem beingfixed, the advice on that web page is useful. If your intent is to provesomeone else did something wrong then it's time to stop the discussion :-)


Cheers,

Mark

-- 
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] Why does the -append option exist?

Reply via email to