Hi, yes that helps a lot. One more question. What filesystem on hopper 2 are you using for this test (home, scratch or proj, to see if it is Lustre or GPFS) ? And are you running the test on the login node or on the compute node?
Thanks Roland On Wed, Jun 8, 2011 at 1:17 PM, Dimitar Pachov <dpac...@brandeis.edu> wrote: > Hello, > > On Wed, Jun 8, 2011 at 4:21 AM, Sander Pronk <pr...@cbr.su.se> wrote: > >> Hi Dimitar, >> >> Thanks for the bug report. Would you mind trying the test program I >> attached on the same file system that you get the truncated files on? >> >> compile it with gcc testje.c -o testio >> > > Yes, but no problem: > > ==== > [dpachov@login-0-0 NEWTEST]$ ./testio > TEST PASSED: ftell gives: 46 > ==== > > As for the other questions: > > HPC OS version: > ==== > [dpachov@login-0-0 NEWTEST]$ uname -a > Linux login-0-0.local 2.6.18-194.17.1.el5xen #1 SMP Mon Sep 20 07:20:39 EDT > 2010 x86_64 x86_64 x86_64 GNU/Linux > [dpachov@login-0-0 NEWTEST]$ cat /etc/redhat-release > Red Hat Enterprise Linux Server release 5.2 (Tikanga) > ==== > > GROMACS 4.5.4 built: > ==== > module purge > module load INTEL/intel-12.0 > module load OPENMPI/1.4.3_INTEL_12.0 > module load FFTW/2.1.5-INTEL_12.0 # not needed > > ##### > # GROMACS settings > > export CC=mpicc > export F77=mpif77 > export CXX=mpic++ > export FC=mpif90 > export F90=mpif90 > > make distclean > > echo "XXXXXXX building single prec XXXXXX" > > ./configure > --prefix=/home/dpachov/mymodules/GROMACS/EXEC/4.5.4-INTEL_12.0/SINGLE \ > --enable-mpi \ > --enable-shared \ > --program-prefix="" --program-suffix="" \ > --enable-float --disable-fortran \ > --with-fft=mkl \ > --with-external-blas \ > --with-external-lapack \ > --with-gsl \ > --without-x \ > CFLAGS="-O3 -funroll-all-loops" \ > FFLAGS="-O3 -funroll-all-loops" \ > CPPFLAGS="-I${MPI_INCLUDE} -I${MKL_INCLUDE} " \ > LDFLAGS="-L${MPI_LIB} -L${MKL_LIB} -lmkl_intel_lp64 -lmkl_core > -lmkl_intel_thread -liomp5 " > > make -j 8 && make install > ==== > > Just did the same test on Hopper 2: > http://www.nersc.gov/users/computational-systems/hopper/ > > with their built GROMACS 4.5.3 (gromacs/4.5.3(default)), and the result was > the same as reported earlier. You could do the test there as well, if you > have access, and see what you would get. > > Hope that helps a bit. > > Thanks, > Dimitar > > > > > >> >> Sander >> >> >> >> >> >> On Jun 7, 2011, at 23:21 , Dimitar Pachov wrote: >> >> Hello, >> >> Just a quick update after a few shorts tests we (my colleague and I) >> quickly did. First, using >> >> "*You can emulate this yourself by calling "sleep 10s" before mdrun and >> see if that's long enough to solve the latency issue in your case.*" >> >> doesn't work for a few reasons, mainly because it doesn't seem to be a >> latency issue, but also because the load on a node is not affected by >> "sleep". >> >> However, you can reproduce the behavior I have observed pretty easily. It >> seems to be related to the values of the pointers to the *xtc, *trr, *edr, >> etc files written at the end of the checkpoint file after abrupt crashes AND >> to the frequency of access (opening) to those files. How to test: >> >> 1. In your input *mdp file put a high frequency of saving coordinates to, >> say, the *xtc (10, for example) and a low frequency for the *trr file >> (10,000, for example). >> 2. Run GROMACS (mdrun -s run.tpr -v -cpi -deffnm run) >> 3. Kill abruptly the run shortly after that (say, after 10-100 steps). >> 4. You should have a few frames written in the *xtc file, and the only one >> (the first) in the *trr file. The *cpt file should have different from zero >> values for "file_offset_low" for all of these files (the pointers have been >> updated). >> >> 5. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run). >> 6. Kill abruptly the run shortly after that (say, after 10-100 steps). Pay >> attention that the frequency for accessing/writing the *trr has not been >> reached. >> 7. You should have a few additional frames written in the *xtc file, while >> the *trr will still have only 1 frame (the first). The *cpt file now has >> updated all pointer values "file_offset_low", BUT the pointer to the *trr >> has acquired a value of 0. Obviously, we already now what will happen if we >> restart again from this last *cpt file. >> >> 8. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run). >> 9. Kill it. >> 10. File *trr has size zero. >> >> >> Therefore, if a run is killed before the files are accessed for writing >> (depending on the chosen frequency), the file offset values reported in the >> *cpt file doesn't seem to be accordingly updated, and hence a new restart >> inevitably leads to overwritten output files. >> >> Do you think this is fixable? >> >> Thanks, >> Dimitar >> >> >> >> >> >> >> On Sun, Jun 5, 2011 at 6:20 PM, Roland Schulz <rol...@utk.edu> wrote: >> >>> Two comments about the discussion: >>> >>> 1) I agree that buffered output (Kernel buffers - not application >>> buffers) should not affect I/O. If it does it should be filed as bug to the >>> OS. Maybe someone can write a short test application which tries to >>> reproduce this idea. Thus writing to a file from one node and immediate >>> after one test program is killed on one node writing to it from some other >>> node. >>> >>> 2) We lock files but only the log file. The idea is that we only need >>> to guarantee that the set of files is only accessed by one application. This >>> seems safe but in case someone sees a way of how the trajectory is opened >>> without the log file being opened, please file a bug. >>> >>> Roland >>> >>> On Sun, Jun 5, 2011 at 10:13 AM, Mark Abraham >>> <mark.abra...@anu.edu.au>wrote: >>> >>>> On 5/06/2011 11:08 PM, Francesco Oteri wrote: >>>> >>>> Dear Dimitar, >>>> I'm following the debate regarding: >>>> >>>> >>>> The point was not "why" I was getting the restarts, but the fact >>>> itself that I was getting restarts close in time, as I stated in my first >>>> post. I actually also don't know whether jobs are deleted or suspended. >>>> I've >>>> thought that a job returned back to the queue will basically start from the >>>> beginning when later moved to an empty slot ... so don't understand the >>>> difference from that perspective. >>>> >>>> >>>> In the second mail yoo say: >>>> >>>> Submitted by: >>>> ======================== >>>> ii=1 >>>> ifmpi="mpirun -np $NSLOTS" >>>> -------- >>>> if [ ! -f run${ii}-i.tpr ];then >>>> cp run${ii}.tpr run${ii}-i.tpr >>>> tpbconv -s run${ii}-i.tpr -until 200000 -o run${ii}.tpr >>>> fi >>>> >>>> k=`ls md-${ii}*.out | wc -l` >>>> outfile="md-${ii}-$k.out" >>>> if [[ -f run${ii}.cpt ]]; then >>>> >>>> * $ifmpi `which mdrun` *-s run${ii}.tpr -cpi run${ii}.cpt -v >>>> -deffnm run${ii} -npme 0 > $outfile 2>&1 >>>> >>>> fi >>>> ========================= >>>> >>>> >>>> If I understand well, you are submitting the SERIAL mdrun. This means >>>> that multiple instances of mdrun are running at the same time. >>>> Each instance of mdrun is an INDIPENDENT instance. Therefore checkpoint >>>> files, one for each instance (i.e. one for each CPU), are written at the >>>> same time. >>>> >>>> >>>> Good thought, but Dimitar's stdout excerpts from early in the thread do >>>> indicate the presence of multiple execution threads. Dynamic load balancing >>>> gets turned on, and the DD is 4x2x1 for his 8 processors. Conventionally, >>>> and by default in the installation process, the MPI-enabled binaries get an >>>> "_mpi" suffix, but it isn't enforced - or enforceable :-) >>>> >>>> Mark >>>> >>>> -- >>>> >>>> gmx-users mailing list gmx-users@gromacs.org >>>> http://lists.gromacs.org/mailman/listinfo/gmx-users >>>> Please search the archive at >>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >>>> Please don't post (un)subscribe requests to the list. Use the >>>> www interface or send it to gmx-users-requ...@gromacs.org. >>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>>> >>> >>> >>> >>> -- >>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov >>> 865-241-1537, ORNL PO BOX 2008 MS6309 >>> >>> -- >>> gmx-users mailing list gmx-users@gromacs.org >>> http://lists.gromacs.org/mailman/listinfo/gmx-users >>> Please search the archive at >>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >>> Please don't post (un)subscribe requests to the list. Use the >>> www interface or send it to gmx-users-requ...@gromacs.org. >>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>> >> >> >> >> -- >> ===================================================== >> *Dimitar V Pachov* >> >> PhD Physics >> Postdoctoral Fellow >> HHMI & Biochemistry Department Phone: (781) 736-2326 >> Brandeis University, MS 057 Email: dpac...@brandeis.edu >> ===================================================== >> >> -- >> gmx-users mailing list gmx-users@gromacs.org >> http://lists.gromacs.org/mailman/listinfo/gmx-users >> Please search the archive at >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >> Please don't post (un)subscribe requests to the list. Use the >> www interface or send it to gmx-users-requ...@gromacs.org. >> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> >> >> >> -- >> >> gmx-users mailing list gmx-users@gromacs.org >> http://lists.gromacs.org/mailman/listinfo/gmx-users >> Please search the archive at >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >> Please don't post (un)subscribe requests to the list. Use the >> www interface or send it to gmx-users-requ...@gromacs.org. >> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> > > > > -- > ===================================================== > *Dimitar V Pachov* > > PhD Physics > Postdoctoral Fellow > HHMI & Biochemistry Department Phone: (781) 736-2326 > Brandeis University, MS 057 Email: dpac...@brandeis.edu > ===================================================== > > > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > -- ORNL/UT Center for Molecular Biophysics cmb.ornl.gov 865-241-1537, ORNL PO BOX 2008 MS6309
-- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists