Hi, On Thu, Jun 9, 2011 at 2:55 AM, Roland Schulz <rol...@utk.edu> wrote:
> Hi, > > yes that helps a lot. One more question. What filesystem on hopper 2 are > you using for this test (home, scratch or proj, to see if it is Lustre or > GPFS) ? > I used home. > And are you running the test on the login node or on the compute node? > I did the test on the debug queue, so it was a compute node. Let me know if you need more info. Best, Dimitar > > Thanks > Roland > > > On Wed, Jun 8, 2011 at 1:17 PM, Dimitar Pachov <dpac...@brandeis.edu>wrote: > >> Hello, >> >> On Wed, Jun 8, 2011 at 4:21 AM, Sander Pronk <pr...@cbr.su.se> wrote: >> >>> Hi Dimitar, >>> >>> Thanks for the bug report. Would you mind trying the test program I >>> attached on the same file system that you get the truncated files on? >>> >>> compile it with gcc testje.c -o testio >>> >> >> Yes, but no problem: >> >> ==== >> [dpachov@login-0-0 NEWTEST]$ ./testio >> TEST PASSED: ftell gives: 46 >> ==== >> >> As for the other questions: >> >> HPC OS version: >> ==== >> [dpachov@login-0-0 NEWTEST]$ uname -a >> Linux login-0-0.local 2.6.18-194.17.1.el5xen #1 SMP Mon Sep 20 07:20:39 >> EDT 2010 x86_64 x86_64 x86_64 GNU/Linux >> [dpachov@login-0-0 NEWTEST]$ cat /etc/redhat-release >> Red Hat Enterprise Linux Server release 5.2 (Tikanga) >> ==== >> >> GROMACS 4.5.4 built: >> ==== >> module purge >> module load INTEL/intel-12.0 >> module load OPENMPI/1.4.3_INTEL_12.0 >> module load FFTW/2.1.5-INTEL_12.0 # not needed >> >> ##### >> # GROMACS settings >> >> export CC=mpicc >> export F77=mpif77 >> export CXX=mpic++ >> export FC=mpif90 >> export F90=mpif90 >> >> make distclean >> >> echo "XXXXXXX building single prec XXXXXX" >> >> ./configure >> --prefix=/home/dpachov/mymodules/GROMACS/EXEC/4.5.4-INTEL_12.0/SINGLE \ >> --enable-mpi \ >> --enable-shared \ >> --program-prefix="" --program-suffix="" \ >> --enable-float --disable-fortran \ >> --with-fft=mkl \ >> --with-external-blas \ >> --with-external-lapack \ >> --with-gsl \ >> --without-x \ >> CFLAGS="-O3 -funroll-all-loops" \ >> FFLAGS="-O3 -funroll-all-loops" \ >> CPPFLAGS="-I${MPI_INCLUDE} -I${MKL_INCLUDE} " \ >> LDFLAGS="-L${MPI_LIB} -L${MKL_LIB} -lmkl_intel_lp64 -lmkl_core >> -lmkl_intel_thread -liomp5 " >> >> make -j 8 && make install >> ==== >> >> Just did the same test on Hopper 2: >> http://www.nersc.gov/users/computational-systems/hopper/ >> >> with their built GROMACS 4.5.3 (gromacs/4.5.3(default)), and the result >> was the same as reported earlier. You could do the test there as well, if >> you have access, and see what you would get. >> >> Hope that helps a bit. >> >> Thanks, >> Dimitar >> >> >> >> >> >>> >>> Sander >>> >>> >>> >>> >>> >>> On Jun 7, 2011, at 23:21 , Dimitar Pachov wrote: >>> >>> Hello, >>> >>> Just a quick update after a few shorts tests we (my colleague and I) >>> quickly did. First, using >>> >>> "*You can emulate this yourself by calling "sleep 10s" before mdrun and >>> see if that's long enough to solve the latency issue in your case.*" >>> >>> doesn't work for a few reasons, mainly because it doesn't seem to be a >>> latency issue, but also because the load on a node is not affected by >>> "sleep". >>> >>> However, you can reproduce the behavior I have observed pretty easily. It >>> seems to be related to the values of the pointers to the *xtc, *trr, *edr, >>> etc files written at the end of the checkpoint file after abrupt crashes AND >>> to the frequency of access (opening) to those files. How to test: >>> >>> 1. In your input *mdp file put a high frequency of saving coordinates to, >>> say, the *xtc (10, for example) and a low frequency for the *trr file >>> (10,000, for example). >>> 2. Run GROMACS (mdrun -s run.tpr -v -cpi -deffnm run) >>> 3. Kill abruptly the run shortly after that (say, after 10-100 steps). >>> 4. You should have a few frames written in the *xtc file, and the only >>> one (the first) in the *trr file. The *cpt file should have different from >>> zero values for "file_offset_low" for all of these files (the pointers have >>> been updated). >>> >>> 5. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run). >>> 6. Kill abruptly the run shortly after that (say, after 10-100 steps). >>> Pay attention that the frequency for accessing/writing the *trr has not been >>> reached. >>> 7. You should have a few additional frames written in the *xtc file, >>> while the *trr will still have only 1 frame (the first). The *cpt file now >>> has updated all pointer values "file_offset_low", BUT the pointer to the >>> *trr has acquired a value of 0. Obviously, we already now what will happen >>> if we restart again from this last *cpt file. >>> >>> 8. Restart GROMACS (mdrun -s run.tpr -v -cpi -deffnm run). >>> 9. Kill it. >>> 10. File *trr has size zero. >>> >>> >>> Therefore, if a run is killed before the files are accessed for writing >>> (depending on the chosen frequency), the file offset values reported in the >>> *cpt file doesn't seem to be accordingly updated, and hence a new restart >>> inevitably leads to overwritten output files. >>> >>> Do you think this is fixable? >>> >>> Thanks, >>> Dimitar >>> >>> >>> >>> >>> >>> >>> On Sun, Jun 5, 2011 at 6:20 PM, Roland Schulz <rol...@utk.edu> wrote: >>> >>>> Two comments about the discussion: >>>> >>>> 1) I agree that buffered output (Kernel buffers - not application >>>> buffers) should not affect I/O. If it does it should be filed as bug to the >>>> OS. Maybe someone can write a short test application which tries to >>>> reproduce this idea. Thus writing to a file from one node and immediate >>>> after one test program is killed on one node writing to it from some other >>>> node. >>>> >>>> 2) We lock files but only the log file. The idea is that we only need >>>> to guarantee that the set of files is only accessed by one application. >>>> This >>>> seems safe but in case someone sees a way of how the trajectory is opened >>>> without the log file being opened, please file a bug. >>>> >>>> Roland >>>> >>>> On Sun, Jun 5, 2011 at 10:13 AM, Mark Abraham >>>> <mark.abra...@anu.edu.au>wrote: >>>> >>>>> On 5/06/2011 11:08 PM, Francesco Oteri wrote: >>>>> >>>>> Dear Dimitar, >>>>> I'm following the debate regarding: >>>>> >>>>> >>>>> The point was not "why" I was getting the restarts, but the fact >>>>> itself that I was getting restarts close in time, as I stated in my first >>>>> post. I actually also don't know whether jobs are deleted or suspended. >>>>> I've >>>>> thought that a job returned back to the queue will basically start from >>>>> the >>>>> beginning when later moved to an empty slot ... so don't understand the >>>>> difference from that perspective. >>>>> >>>>> >>>>> In the second mail yoo say: >>>>> >>>>> Submitted by: >>>>> ======================== >>>>> ii=1 >>>>> ifmpi="mpirun -np $NSLOTS" >>>>> -------- >>>>> if [ ! -f run${ii}-i.tpr ];then >>>>> cp run${ii}.tpr run${ii}-i.tpr >>>>> tpbconv -s run${ii}-i.tpr -until 200000 -o run${ii}.tpr >>>>> fi >>>>> >>>>> k=`ls md-${ii}*.out | wc -l` >>>>> outfile="md-${ii}-$k.out" >>>>> if [[ -f run${ii}.cpt ]]; then >>>>> >>>>> * $ifmpi `which mdrun` *-s run${ii}.tpr -cpi run${ii}.cpt -v >>>>> -deffnm run${ii} -npme 0 > $outfile 2>&1 >>>>> >>>>> fi >>>>> ========================= >>>>> >>>>> >>>>> If I understand well, you are submitting the SERIAL mdrun. This means >>>>> that multiple instances of mdrun are running at the same time. >>>>> Each instance of mdrun is an INDIPENDENT instance. Therefore checkpoint >>>>> files, one for each instance (i.e. one for each CPU), are written at the >>>>> same time. >>>>> >>>>> >>>>> Good thought, but Dimitar's stdout excerpts from early in the thread do >>>>> indicate the presence of multiple execution threads. Dynamic load >>>>> balancing >>>>> gets turned on, and the DD is 4x2x1 for his 8 processors. Conventionally, >>>>> and by default in the installation process, the MPI-enabled binaries get >>>>> an >>>>> "_mpi" suffix, but it isn't enforced - or enforceable :-) >>>>> >>>>> Mark >>>>> >>>>> -- >>>>> >>>>> gmx-users mailing list gmx-users@gromacs.org >>>>> http://lists.gromacs.org/mailman/listinfo/gmx-users >>>>> Please search the archive at >>>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >>>>> Please don't post (un)subscribe requests to the list. Use the >>>>> www interface or send it to gmx-users-requ...@gromacs.org. >>>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>>>> >>>> >>>> >>>> >>>> -- >>>> ORNL/UT Center for Molecular Biophysics cmb.ornl.gov >>>> 865-241-1537, ORNL PO BOX 2008 MS6309 >>>> >>>> -- >>>> gmx-users mailing list gmx-users@gromacs.org >>>> http://lists.gromacs.org/mailman/listinfo/gmx-users >>>> Please search the archive at >>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >>>> Please don't post (un)subscribe requests to the list. Use the >>>> www interface or send it to gmx-users-requ...@gromacs.org. >>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>>> >>> >>> >>> >>> -- >>> ===================================================== >>> *Dimitar V Pachov* >>> >>> PhD Physics >>> Postdoctoral Fellow >>> HHMI & Biochemistry Department Phone: (781) 736-2326 >>> Brandeis University, MS 057 Email: dpac...@brandeis.edu >>> ===================================================== >>> >>> -- >>> gmx-users mailing list gmx-users@gromacs.org >>> http://lists.gromacs.org/mailman/listinfo/gmx-users >>> Please search the archive at >>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >>> Please don't post (un)subscribe requests to the list. Use the >>> www interface or send it to gmx-users-requ...@gromacs.org. >>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>> >>> >>> >>> -- >>> >>> gmx-users mailing list gmx-users@gromacs.org >>> http://lists.gromacs.org/mailman/listinfo/gmx-users >>> Please search the archive at >>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >>> Please don't post (un)subscribe requests to the list. Use the >>> www interface or send it to gmx-users-requ...@gromacs.org. >>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>> >> >> >> >> -- >> ===================================================== >> *Dimitar V Pachov* >> >> PhD Physics >> Postdoctoral Fellow >> HHMI & Biochemistry Department Phone: (781) 736-2326 >> Brandeis University, MS 057 Email: dpac...@brandeis.edu >> ===================================================== >> >> >> -- >> gmx-users mailing list gmx-users@gromacs.org >> http://lists.gromacs.org/mailman/listinfo/gmx-users >> Please search the archive at >> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >> Please don't post (un)subscribe requests to the list. Use the >> www interface or send it to gmx-users-requ...@gromacs.org. >> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> > > > > -- > ORNL/UT Center for Molecular Biophysics cmb.ornl.gov > 865-241-1537, ORNL PO BOX 2008 MS6309 > > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > -- ===================================================== *Dimitar V Pachov* PhD Physics Postdoctoral Fellow HHMI & Biochemistry Department Phone: (781) 736-2326 Brandeis University, MS 057 Email: dpac...@brandeis.edu =====================================================
-- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists