On Sat, Jun 4, 2011 at 1:50 PM, Rossen Apostolov <ros...@kth.se> wrote:
> Hi, > > On Jun 4, 2011, at 19:11, Dimitar Pachov <dpac...@brandeis.edu> wrote: > > By the way, is this ever reviewed: > > "Your mail to 'gmx-users' with the subject > > Re: [gmx-users] Why does the -append option exist? > > Is being held until the list moderator can review it for approval." > > > This message usually comes when e.g. one sends mails larger than 50K which > are eventually discarded. If you need to send big attachments post a > download link instead. > If they are eventually discarded, why doesn't the message says so? It's confusing. My message with all the quotes was 52Kb; no attachments. Anyway, I resent it, but it appeared as a quote. "*So you are referring to the case where you have multiple, independent* *processes all using the same trajectory file. Yes, this will probably * *lead to problems, unless the trajectory file is somehow locked.*" I don't not think so - during restart old processes should be killed and new should be generated. If that's not the case, then you might be right. Thanks, Dimitar > > Cheers, > Rossen > > > On Fri, Jun 3, 2011 at 9:24 PM, Mark Abraham < <mark.abra...@anu.edu.au> > mark.abra...@anu.edu.au> wrote: > >> On 4/06/2011 8:26 AM, Dimitar Pachov wrote >> >> >> If this is true, then it wants fixing, and fast, and will get it :-) >> However, it would be surprising for such a problem to exist and not have >> been reported up to now. This feature has been in the code for a year now, >> and while some minor issues have been fixed since the 4.5 release, it would >> surprise me greatly if your claim was true. >> >> You're saying the equivalent of the steps below can occur: >> 1. Simulation wanders along normally and writes a checkpoint at step 1003 >> 2. Random crash happens at step 1106 >> 3. An -append restart from the old .tpr and the recent .cpt file will >> restart from step 1003 >> 4. Random crash happens at step 1059 >> 5. Now a restart doesn't restart from step 1003, but some other step >> >> >> and most importantly, the most important piece of data, that being the >> trajectory file, could be completely lost! I don't know the code behind the >> checkpointing & appending, but I can see how easy one can overwrite 100ns >> trajectories, for example, and "obtain" the same trajectories of size .... >> 0. >> >> >> I don't see how easy that is, without a concrete example, where user error >> is not possible. >> > > Here is an example: > > ======================== > [dpachov]$ ll -rth run1* \#run1* > -rw-rw-r-- 1 dpachov dpachov 11K May 2 02:59 run1.po.mdp > -rw-rw-r-- 1 dpachov dpachov 4.6K May 2 02:59 run1.grompp.out > -rw-rw-r-- 1 dpachov dpachov 3.5M May 13 19:09 run1.gro > -rw-rw-r-- 1 dpachov dpachov 2.3M May 14 00:40 run1.tpr > -rw-rw-r-- 1 dpachov dpachov 2.3M May 14 00:40 run1-i.tpr > -rw-rw-r-- 1 dpachov dpachov 0 May 29 21:53 run1.trr > -rw-rw-r-- 1 dpachov dpachov 1.2M May 31 10:45 run1.cpt > -rw-rw-r-- 1 dpachov dpachov 1.2M May 31 10:45 run1_prev.cpt > -rw-rw-r-- 1 dpachov dpachov 0 Jun 3 14:03 run1.xtc > -rw-rw-r-- 1 dpachov dpachov 0 Jun 3 14:03 run1.edr > -rw-rw-r-- 1 dpachov dpachov 15M Jun 3 17:03 run1.log > ======================== > > Submitted by: > ======================== > ii=1 > ifmpi="mpirun -np $NSLOTS" > -------- > if [ ! -f run${ii}-i.tpr ];then > cp run${ii}.tpr run${ii}-i.tpr > tpbconv -s run${ii}-i.tpr -until 200000 -o run${ii}.tpr > fi > > k=`ls md-${ii}*.out | wc -l` > outfile="md-${ii}-$k.out" > if [[ -f run${ii}.cpt ]]; then > > $ifmpi `which mdrun` -s run${ii}.tpr -cpi run${ii}.cpt -v -deffnm > run${ii} -npme 0 > $outfile 2>&1 > > fi > ========================= > > From the end of run1.log: > ========================= > Started mdrun on node 0 Tue May 31 10:28:52 2011 > > Step Time Lambda > 51879390 103758.78000 0.00000 > > Energies (kJ/mol) > U-B Proper Dih. Improper Dih. CMAP Dih. LJ-14 > 8.37521e+03 4.52303e+03 4.78633e+02 -1.23174e+03 2.87366e+03 > Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. > 3.02277e+04 9.48267e+04 -3.88596e+03 -7.43902e+05 -8.36436e+04 > Potential Kinetic En. Total Energy Temperature Pres. DC (bar) > -6.91359e+05 1.29016e+05 -5.62342e+05 3.00159e+02 -1.24746e+02 > Pressure (bar) Constr. rmsd > -2.43143e+00 0.00000e+00 > > DD step 51879399 load imb.: force 225.5% > > Writing checkpoint, step 51879590 at Tue May 31 10:45:22 2011 > > > > > ----------------------------------------------------------- > Restarting from checkpoint, appending to previous log file. > > Log file opened on Fri Jun 3 17:03:20 2011 > Host: compute-1-13.local pid: 337 nodeid: 0 nnodes: 8 > The Gromacs distribution was built Tue Mar 22 09:26:37 EDT 2011 by > dpachov@login-0-0.local (Linux 2.6.18-194.17.1.el5xen x86_64) > > ::: > ::: > ::: > > Grid: 13 x 15 x 11 cells > Initial temperature: 301.137 K > > Started mdrun on node 0 Fri Jun 3 13:58:07 2011 > > Step Time Lambda > 51879590 103759.18000 0.00000 > > Energies (kJ/mol) > U-B Proper Dih. Improper Dih. CMAP Dih. LJ-14 > 8.47435e+03 4.61654e+03 3.99388e+02 -1.16765e+03 2.93920e+03 > Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. > 2.99294e+04 9.42035e+04 -3.87927e+03 -7.43250e+05 -8.35872e+04 > Potential Kinetic En. Total Energy Temperature Pres. DC (bar) > -6.91322e+05 1.29433e+05 -5.61889e+05 3.01128e+02 -1.24317e+02 > Pressure (bar) Constr. rmsd > -2.18259e+00 0.00000e+00 > > DD step 51879599 load imb.: force 43.7% > > At step 51879600 the performance loss due to force load imbalance is 17.5 % > > NOTE: Turning on dynamic load balancing > > DD step 51879999 vol min/aver 0.643 load imb.: force 0.4% > > :: > :: > :: > > DD step 51884999 vol min/aver 0.647 load imb.: force 0.3% > > Step Time Lambda > 51885000 103770.00000 0.00000 > > Energies (kJ/mol) > U-B Proper Dih. Improper Dih. CMAP Dih. LJ-14 > 8.33208e+03 4.72300e+03 5.31983e+02 -1.21532e+03 2.89586e+03 > Coulomb-14 LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. > 3.00900e+04 9.31785e+04 -3.87790e+03 -7.40841e+05 -8.36838e+04 > Potential Kinetic En. Total Energy Temperature Pres. DC (bar) > -6.89867e+05 1.28721e+05 -5.61146e+05 2.99472e+02 -1.24229e+02 > Pressure (bar) Constr. rmsd > -1.03491e+02 2.99840e-05 > ==================================== > > Last output files from restarts: > ==================================== > [dpachov]$ ll -rth md-1-*out | tail -10 > -rw-rw-r-- 1 dpachov dpachov 6.1K Jun 3 16:40 md-1-2428.out > -rw-rw-r-- 1 dpachov dpachov 6.2K Jun 3 16:44 md-1-2429.out > -rw-rw-r-- 1 dpachov dpachov 5.9K Jun 3 16:46 md-1-2430.out > -rw-rw-r-- 1 dpachov dpachov 5.9K Jun 3 16:48 md-1-2431.out > -rw-rw-r-- 1 dpachov dpachov 6.1K Jun 3 16:50 md-1-2432.out > -rw-rw-r-- 1 dpachov dpachov 0 Jun 3 16:52 md-1-2433.out > -rw-rw-r-- 1 dpachov dpachov 6.2K Jun 3 16:55 md-1-2434.out > -rw-rw-r-- 1 dpachov dpachov 6.2K Jun 3 16:58 md-1-2435.out > -rw-rw-r-- 1 dpachov dpachov 5.9K Jun 3 17:03 md-1-2436.out > *-rw-rw-r-- 1 dpachov dpachov 5.8K Jun 3 17:04 md-1-2437.out* > ==================================== > + around the time when the run1.xtc file seems to have been saved: > ==================================== > [dpachov]$ ll -rth md-1-23[5-6][0-9]*out > -rw-rw-r-- 1 dpachov dpachov 6.2K Jun 3 13:37 md-1-2350.out > -rw-rw-r-- 1 dpachov dpachov 6.1K Jun 3 13:39 md-1-2351.out > -rw-rw-r-- 1 dpachov dpachov 6.2K Jun 3 13:43 md-1-2352.out > -rw-rw-r-- 1 dpachov dpachov 6.2K Jun 3 13:45 md-1-2353.out > -rw-rw-r-- 1 dpachov dpachov 5.9K Jun 3 13:46 md-1-2354.out > -rw-rw-r-- 1 dpachov dpachov 0 Jun 3 13:47 md-1-2355.out > -rw-rw-r-- 1 dpachov dpachov 6.1K Jun 3 13:49 md-1-2356.out > -rw-rw-r-- 1 dpachov dpachov 6.1K Jun 3 13:52 md-1-2357.out > -rw-rw-r-- 1 dpachov dpachov 12K Jun 3 13:57 md-1-2358.out > *-rw-rw-r-- 1 dpachov dpachov 12K Jun 3 14:02 md-1-2359.out* > *-rw-rw-r-- 1 dpachov dpachov 6.0K Jun 3 14:03 md-1-2360.out* > -rw-rw-r-- 1 dpachov dpachov 6.2K Jun 3 14:06 md-1-2361.out > -rw-rw-r-- 1 dpachov dpachov 5.8K Jun 3 14:09 md-1-2362.out > -rw-rw-r-- 1 dpachov dpachov 5.9K Jun 3 14:10 md-1-2363.out > -rw-rw-r-- 1 dpachov dpachov 6.1K Jun 3 14:11 md-1-2364.out > -rw-rw-r-- 1 dpachov dpachov 5.8K Jun 3 14:12 md-1-2365.out > -rw-rw-r-- 1 dpachov dpachov 6.1K Jun 3 14:13 md-1-2366.out > -rw-rw-r-- 1 dpachov dpachov 6.1K Jun 3 14:14 md-1-2367.out > -rw-rw-r-- 1 dpachov dpachov 6.0K Jun 3 14:17 md-1-2368.out > -rw-rw-r-- 1 dpachov dpachov 5.9K Jun 3 14:18 md-1-2369.out > ==================================== > > From md-1-2359.out: > ===================================== > ::::::: > Getting Loaded... > Reading file run1.tpr, VERSION 4.5.4 (single precision) > > Reading checkpoint file run1.cpt generated: Tue May 31 10:45:22 2011 > > > Loaded with Money > > Making 2D domain decomposition 4 x 2 x 1 > > WARNING: This run will generate roughly 4915 Mb of data > > starting mdrun 'run1' > 100000000 steps, 200000.0 ps (continuing from step 51879590, 103759.2 ps). > step 51879590, will finish Wed Aug 17 14:21:59 2011 > imb F 44% > NOTE: Turning on dynamic load balancing > > step 51879600, will finish Fri Jul 15 14:00:00 2011 > vol 0.64 imb F 0% step 51879700, will finish Mon Jun 27 02:19:09 2011 > vol 0.63 imb F 0% step 51879800, will finish Sat Jun 25 15:14:01 2011 > vol 0.64 imb F 1% step 51879900, will finish Sat Jun 25 02:11:53 2011 > vol 0.64 imb F 0% step 51880000, will finish Fri Jun 24 19:48:54 2011 > vol 0.64 imb F 1% step 51880100, will finish Fri Jun 24 15:55:19 2011 > :::::: > vol 0.67 imb F 0% step 51886400, will finish Fri Jun 24 02:51:45 2011 > vol 0.66 imb F 0% step 51886500, will finish Fri Jun 24 02:48:10 2011 > vol 0.66 imb F 0% step 51886600, will finish Fri Jun 24 02:47:33 2011 > ===================================== > > From md-1-2360.out: > ===================================== > ::::::: > Getting Loaded... > Reading file run1.tpr, VERSION 4.5.4 (single precision) > > Reading checkpoint file run1.cpt generated: Tue May 31 10:45:22 2011 > > > Loaded with Money > > Making 2D domain decomposition 4 x 2 x 1 > > WARNING: This run will generate roughly 4915 Mb of data > > starting mdrun 'run1' > 100000000 steps, 200000.0 ps (continuing from step 51879590, 103759.2 ps). > ===================================== > > And from the last generated output md-1-2437.out (I think I killed the job > at that point because of the above observed behavior): > ===================================== > ::::::: > Getting Loaded... > Reading file run1.tpr, VERSION 4.5.4 (single precision) > ===================================== > > I have at least 5-6 additional examples like this one. In some of them the > *xtc file does have size greater than zero yet still very small, but it > starts from some random frame (for example, in one of the cases it contains > frames from ~91000ps to ~104000ps, but all frames before 91000ps are > missing). > > I realize there might be another problem, but the bottom line is that there > is no mechanism that can prevent this from happening if many restarts are > required, and particularly if the timing between these restarts is prone to > be small (distributed computing could easily satisfy this condition). > > Any suggestions, particularly related to the minimum resistance path to > regenerate the missing data? :) > > > >> >> >> >> Using the checkpoint capability & appending make sense when many restarts >> are expected, but unfortunately it is exactly then when these options >> completely fail! As a new user of Gromacs, I must say I am disappointed, and >> would like to obtain an explanation of why the usage of these options is >> clearly stated to be safe when it is not, and why the append option is the >> default, and why at least a single warning has not been posted anywhere in >> the docs & manuals? >> >> >> I can understand and sympathize with your frustration if you've >> experienced the loss of a simulation. Do be careful when suggesting that >> others' actions are blame-worthy, however. >> > > I have never suggested this. As a user, I am entitled to ask. And since my > questions were not clearly answered, I will repeat them in a structured way: > > 1. Why is the usage of these options (-cpi and -append) clearly stated to > be safe when in fact it is not? > 2. Why have you made the -append option the default in the most current GMX > versions? > 3. Why has not a single warning been posted anywhere in the docs & manuals? > (this question is somewhat clear - because you did not know about such a > problem, but people say "ignorance of the law excuses no one", which means > ignoring to put a warning for something that you were not 100% certain it > would be error-free could not be an excuse) > > I am blame-worthy - for blindly believing what was written in the manual > without taking the necessary precautions. Lesson learned. > > >> However, developers' time rarely permits addressing "feature X doesn't >> work, why not?" in a productive way. Solving bugs can be hard, but will be >> easier (and solved faster!) if the user who thinks a problem exists follows >> good procedure. See <http://www.chiark.greenend.org.uk/~sgtatham/bugs.html> >> http://www.chiark.greenend.org.uk/~sgtatham/bugs.html >> >> > Implying that I did not follow a certain procedure related to a certain > problem without you knowing what my initial intention was is just a > speculation. > > At any instant, I do appreciate the time everybody unselfishly devotes to > communicating with people experiencing problems. > > Thanks, > Dimitar > > -- > gmx-users mailing list <gmx-users@gromacs.org>gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > Can't post? Read <http://www.gromacs.org/Support/Mailing_Lists> > http://www.gromacs.org/Support/Mailing_Lists > > > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > -- ===================================================== *Dimitar V Pachov* PhD Physics Postdoctoral Fellow HHMI & Biochemistry Department Phone: (781) 736-2326 Brandeis University, MS 057 Email: dpac...@brandeis.edu =====================================================
-- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists