Hi, all- Have you tried running
constraints = hbonds? That might eliminate some of the constraint issues. Much less likely for LINCS to break or have DD issues if only the hbonds are constrained. 2 fs is not that big a deal for the heteroatom bonds. Best, Michael On Thu, Mar 10, 2011 at 8:04 PM, Justin A. Lemkul <jalem...@vt.edu> wrote: > > Hi Matt, > > Thanks for the extensive explanation and tips. I'll work through things and > report back. It will take a while to get things going through (unless one > of the early solutions works!) since I have no admin access to install new > compilers, libraries, etc. and for some reason the only thing I can ever get > to work in my home directory is Gromacs itself. The joys of an aging > cluster. > > We recently got access to gcc-4.4.5 on Linux, but we're stuck with 3.3 on OS > X, so there's at least a bit of hope for one partition. > > Thanks again. > > -Justin > > Matthew Zwier wrote: >> >> Hi Justin, >> >> I should have specified that the segfault happened for us after we got >> similar warnings and errors (DD and/or LINCS), so the segfault may >> have been tangential. Given that everything about your system worked >> before GROMACS 4.5, it's possible that your older compilers are >> generating code that's incompatible with the GROMACS assembly loops >> (which you are likely running with, as they are the default option on >> most mainstream processors). The bug you mentioned in your original >> post also has my antennae twitching about bad machine code. >> >> If that's indeed happening, it's almost certainly some bizarre >> alignment issue, something like half of a float is getting overwritten >> on the way into or out of the assembly code, which corruption would >> trigger the results you describe. It's also distantly possible that >> GROMACS is working fine, but your copy of FFTW or BLAS/LAPACK (more >> likely the latter) has alignment problems. One final possibility >> (which would explain the failure on YellowDog but unfortunately not >> the failure on OS X) is that GCC is generating badly-aligned code for >> auto-vectorized Altivec loops, which is still a problem for Intel's >> SIMD instructions on 32-bit x86 architectures even with GCC 4.4. I've >> also observed MPI gather/reduce operations to foul up alignment (or >> rigidly enforce it where badly compiled code is relying on broken >> alignment) under exceedingly rare circumstances, usually involving >> different libraries compiled with different compilers (which is >> generally a bad idea for scientific code anyway). >> >> Okay...so all of that said, there are a few things to try: >> >> 1) Recompile GROMACS using -O2 instead of -O3; that'll turn off the >> automatic vectorizer (on Yellow Dog) and various other relatively >> risky optimizations (on both platforms). CFLAGS="-O2 -march=powerpc" >> in the environment AND on the configure command line would do that. >> Check your build logs to make sure it took, though, because if you >> don't do it exactly right, configure will ignore your directives and >> merrily set up GROMACS to compile with -O3, which is the most likely >> culprit for badly-aligned code. >> >> 2) Recompile GROMACS specifying a forced alignment flag. I have no >> experience with PowerPC, but -malign-natural and -malign-power look >> like good initial guesses. That's probably going to cause more >> problems than it solves, but if you have a screwy BLAS/LAPACK or MPI, >> it might help. I only suggest it because if you've already tried #1, >> it will only take another half hour or hour of your time to recompile >> GROMACS again. Other than that, tinkering with alignment flags is a >> really easy way to REALLY break code, so you might consider skipping >> this and moving straight on to #3. >> >> 3) Snag GCC 4.4.4 or 4.4.5 and compile it, and use that to compile >> GROMACS, again with -O2. GCC takes forever to compile, but beyond >> that, it's not as difficult as it could be. Nothing preventing you >> from installing it in your home directory, either, assuming you set >> PATH and LD_LIBRARY_PATH (or DYLD_LIBRARY_PATH on OS X) properly. You >> might need to snag a new copy of binutils as well, if gcc refuses to >> compile with the system ld. This option would also probably get you >> threading, since you certainly have hardware support for it. >> >> 4) Rebuild your entire GROMACS stack, including FFTW, BLAS/LAPACK, >> MPI, and GROMACS itself with the same compiler (preferably GCC from >> #3) and the same compiler options (which again should be -O2, and >> definitely NOT any sort of alignment flag). Put them in their own >> tree (like "/opt/sci"), and definitely not in /usr (which is generally >> managed by the system) or /usr/local (which tends to accumulate >> cruft). ATLAS is a good choice for BLAS, and there are directions on >> the ATLAS website for building a complete and optimized LAPACK based >> on BLAS. >> >> In practice, I've found I've had to do #4 for every piece of >> scientific software our group uses, because pretty much nothing works >> right with OS-installed versions of compilers, BLAS/LAPACK, and MPI. >> It takes forever, and it pretty much defines the phrase "learning >> experience," but it also essentially *never* breaks once it works >> (because OS updates never overwrite anything you've hand-tuned to run >> correctly). But...with luck option #1 will fix things quickly enough >> to get you running without devoting two days to rebuilding your >> software stack from scratch. >> >> Hope that helps, >> Matt Z. >> >> >> On Thu, Mar 10, 2011 at 8:54 PM, Justin A. Lemkul <jalem...@vt.edu> wrote: >>> >>> Hi Matt, >>> >>> Thanks for the reply. I can't trace the problem to a specific compiler. >>> We >>> have a PowerPC cluster with two partitions - one running Mac OS X 10.3 >>> with >>> gcc-3.3, the other running YellowDog Linux with gcc-4.2.2. The problem >>> happens on both partitions. There are no seg faults, the runs just exit >>> (MPI_ABORT) after the fatal error (either "too many LINCS warnings" or >>> the >>> DD-related error I posted before). >>> >>> We are using MPI: mpich-1.2.5 on OSX and OpenMPI-1.2.3 on Linux. All of >>> the >>> above has been the same since my successful 3.3.3 TI calculations (as >>> well >>> as all of my simulations with Gromacs, ever). Our hardware and compilers >>> are somewhat (very) outdated so threading is not supported, we always use >>> MPI. >>> >>> Gromacs was compiled in single precision using standard options through >>> autoconf. The cmake build system still does not work on our cluster due >>> to >>> several outstanding bugs. >>> >>> -Justin >>> >>> Matthew Zwier wrote: >>>> >>>> Dear Justin, >>>> >>>> We recently experienced a similar problem (LINCS errors, step*.pdb >>>> files), and then GROMACS usually segfaulted. The cause was a >>>> miscompiled copy of GROMACS. Another member of our group had compiled >>>> GROMACS on an Intel Core2 quad (gcc -march=core2) and tried to run the >>>> copy without modification on an AMD Magny Cours machine. >>>> Recompilation with the correct subarchitecture type (-march=amdfam10) >>>> fixed the problem. Don't really know why it didn't die with SIGILL or >>>> SIGBUS instead of SIGSEGV, but that's probably a question for the >>>> hardware gurus. >>>> >>>> So...are you observing segfaults? What compiler are you using (and on >>>> what OS)? What were the compilation parameters for 4.5.3? Also, are >>>> you really running across nodes with MPI, or running on the same node >>>> with MPI? >>>> >>>> Cheers, >>>> Matt Zwier >>>> >>>> On Thu, Mar 10, 2011 at 1:55 PM, Justin A. Lemkul <jalem...@vt.edu> >>>> wrote: >>>>> >>>>> Hi All, >>>>> >>>>> I've been troubleshooting a problem for some time now and I wanted to >>>>> report >>>>> it here and solicit some feedback before I submit a bug report to see >>>>> if >>>>> there's anything else I can try. >>>>> >>>>> Here's the situation: I ran some free energy calculations >>>>> (thermodynamic >>>>> integration) a long time ago using version 3.3.3 to determine the >>>>> hydration >>>>> free energy of a series of small molecules. Results were good and they >>>>> ended up as part of a paper, so I'm trying to reproduce the methodology >>>>> with >>>>> 4.5.3 (using BAR) to see if I understand the workflow completely. The >>>>> problem is my systems are crashing. The runs simply stop randomly >>>>> (usually >>>>> within a few hundred ps) with lots of LINCS warnings and step*.pdb >>>>> files >>>>> being written. >>>>> >>>>> I know the parameters are good, and produce stable trajectories, since >>>>> I >>>>> spent months on them some years ago. The system prep is steepest >>>>> descents >>>>> EM >>>>> to Fmax < 100 (always achieved), NVT at 298 K for 100 ps, NPT at 298K/1 >>>>> bar >>>>> for 100 ps, then 5 ns of data collection under NPT conditions. Here's >>>>> the >>>>> rundown of what I'm seeing: >>>>> >>>>> 1. All LJ transformations work fine. The problem only comes when I >>>>> have >>>>> a >>>>> molecule with full LJ interaction and I am "charging" it (i.e., >>>>> introducing >>>>> charges to the partially-interacting species). >>>>> >>>>> 2. Simulations at lambda=1 (full interaction) work fine. >>>>> >>>>> 3. Simulations with the free energy code off entirely work fine under >>>>> all >>>>> conditions. >>>>> >>>>> 4. I cannot run in serial due to http://redmine.gromacs.org/issues/715. >>>>> The >>>>> bug seems to affect other systems and is not specifically related to my >>>>> free >>>>> energy calculations. >>>>> >>>>> 5. Running with DD fails because my system is relatively small (more on >>>>> this >>>>> in a moment). >>>>> >>>>> 6. Running with mdrun -pd 2 works, but mdrun -pd 4 crashes for any >>>>> value >>>>> of >>>>> lambda != 1. >>>>> >>>>> 7. I created a larger system (instead of a 3x3x3-nm cube of water with >>>>> my >>>>> molecule, I used 4x4x4) and ran on 4 CPU's with DD (lambda = 0, i.e. >>>>> full >>>>> vdW, no intermolecular Coulombic interactions - .mdp file is below). >>>>> This >>>>> run also crashed with some warnings about DD cell size: >>>>> >>>>> DD load balancing is limited by minimum cell size in dimension X >>>>> DD step 329999 vol min/aver 0.748! load imb.: force 31.5% >>>>> >>>>> ...and then the actual crash: >>>>> >>>>> ------------------------------------------------------- >>>>> Program mdrun_4.5.3_gcc_mpi, VERSION 4.5.3 >>>>> Source code file: domdec_con.c, line: 693 >>>>> >>>>> Fatal error: >>>>> DD cell 0 0 0 could only obtain 14 of the 15 atoms that are connected >>>>> via >>>>> constraints from the neighboring cells. This probably means your >>>>> constraint >>>>> lengths are too long compared to the domain decomposition cell size. >>>>> Decrease the number of domain decomposition grid cells or lincs-order >>>>> or >>>>> use >>>>> the -rcon option of mdrun. >>>>> For more information and tips for troubleshooting, please check the >>>>> GROMACS >>>>> website at http://www.gromacs.org/Documentation/Errors >>>>> ------------------------------------------------------- >>>>> >>>>> Watching the trajectory doesn't seem to give any useful information. >>>>> The >>>>> small molecule of interest is at a periodic boundary when the crash >>>>> happens, >>>>> but there are several crosses prior to the crash without incident, so I >>>>> don't know if the issue is related to PBC or not, but it appears not. >>>>> >>>>> 8. I initially thought the problem might be related to the barostat, >>>>> but >>>>> switching from P-R to Berendsen does not alleviate the problem, nor >>>>> does >>>>> increasing tau_p (tested 0.5, 1.0, 2.0, and 5.0 - all crash). Longer >>>>> tau_p >>>>> simply delays the crash, but does not prevent it. >>>>> >>>>> So after all that, I'm wondering if (1) anyone has seen the same, or >>>>> (2) >>>>> if >>>>> there's anything else I can try (environment variables, hidden tricks, >>>>> etc) >>>>> that I can use to get to the bottom of this before I give up and file a >>>>> bug >>>>> report. >>>>> >>>>> If you made it this far, thanks for reading my novel and hopefully >>>>> someone >>>>> can give me some ideas. The .mdp file I'm using is below, but it is >>>>> just >>>>> one of many that I've tried. In theory, it should work, since the >>>>> parameters are the same as my successful 3.3.3 runs, with the exception >>>>> of >>>>> the new free energy features in 4.5.3 and obvious keyword changes >>>>> related >>>>> to >>>>> the difference in version. >>>>> >>>>> -Justin >>>>> >>>>> --- .mdp file --- >>>>> >>>>> ; Run control >>>>> integrator = sd ; Langevin dynamics >>>>> tinit = 0 >>>>> dt = 0.002 >>>>> nsteps = 2500000 ; 5 ns >>>>> nstcomm = 100 >>>>> ; Output control >>>>> nstxout = 500 >>>>> nstvout = 500 >>>>> nstfout = 0 >>>>> nstlog = 500 >>>>> nstenergy = 500 >>>>> nstxtcout = 0 >>>>> xtc-precision = 1000 >>>>> ; Neighborsearching and short-range nonbonded interactions >>>>> nstlist = 5 >>>>> ns_type = grid >>>>> pbc = xyz >>>>> rlist = 0.9 >>>>> ; Electrostatics >>>>> coulombtype = PME >>>>> rcoulomb = 0.9 >>>>> ; van der Waals >>>>> vdw-type = cutoff >>>>> rvdw = 1.4 >>>>> ; Apply long range dispersion corrections for Energy and Pressure >>>>> DispCorr = EnerPres >>>>> ; Spacing for the PME/PPPM FFT grid >>>>> fourierspacing = 0.12 >>>>> ; EWALD/PME/PPPM parameters >>>>> pme_order = 4 >>>>> ewald_rtol = 1e-05 >>>>> epsilon_surface = 0 >>>>> optimize_fft = no >>>>> ; Temperature coupling >>>>> ; tcoupl is implicitly handled by the sd integrator >>>>> tc_grps = system >>>>> tau_t = 1.0 >>>>> ref_t = 298 >>>>> ; Pressure coupling is on for NPT >>>>> Pcoupl = Berendsen >>>>> tau_p = 2.0 >>>>> compressibility = 4.5e-05 >>>>> ref_p = 1.0 >>>>> ; Free energy control stuff >>>>> free_energy = yes >>>>> init_lambda = 0.00 >>>>> delta_lambda = 0 >>>>> foreign_lambda = 0.05 >>>>> sc-alpha = 0 >>>>> sc-power = 1.0 >>>>> sc-sigma = 0 >>>>> couple-moltype = MOR ; name of moleculetype to couple >>>>> couple-lambda0 = vdw ; vdW interactions >>>>> couple-lambda1 = vdw-q ; turn on everything >>>>> couple-intramol = no >>>>> dhdl_derivatives = yes ; this line (and the next two) are >>>>> defaults >>>>> separate_dhdl_file = yes ; included only for pedantry >>>>> nstdhdl = 10 >>>>> ; Do not generate velocities >>>>> gen_vel = no >>>>> ; options for bonds >>>>> constraints = all-bonds >>>>> ; Type of constraint algorithm >>>>> constraint-algorithm = lincs >>>>> ; Constrain the starting configuration >>>>> ; since we are continuing from NPT >>>>> continuation = yes >>>>> ; Highest order in the expansion of the constraint coupling matrix >>>>> lincs-order = 4 >>>>> >>>>> >>>>> -- >>>>> ======================================== >>>>> >>>>> Justin A. Lemkul >>>>> Ph.D. Candidate >>>>> ICTAS Doctoral Scholar >>>>> MILES-IGERT Trainee >>>>> Department of Biochemistry >>>>> Virginia Tech >>>>> Blacksburg, VA >>>>> jalemkul[at]vt.edu | (540) 231-9080 >>>>> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin >>>>> >>>>> ======================================== >>>>> -- >>>>> gmx-users mailing list gmx-users@gromacs.org >>>>> http://lists.gromacs.org/mailman/listinfo/gmx-users >>>>> Please search the archive at >>>>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >>>>> Please don't post (un)subscribe requests to the list. Use the www >>>>> interface >>>>> or send it to gmx-users-requ...@gromacs.org. >>>>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>>>> >>> -- >>> ======================================== >>> >>> Justin A. Lemkul >>> Ph.D. Candidate >>> ICTAS Doctoral Scholar >>> MILES-IGERT Trainee >>> Department of Biochemistry >>> Virginia Tech >>> Blacksburg, VA >>> jalemkul[at]vt.edu | (540) 231-9080 >>> http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin >>> >>> ======================================== >>> -- >>> gmx-users mailing list gmx-users@gromacs.org >>> http://lists.gromacs.org/mailman/listinfo/gmx-users >>> Please search the archive at >>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >>> Please don't post (un)subscribe requests to the list. Use the www >>> interface >>> or send it to gmx-users-requ...@gromacs.org. >>> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >>> >> > > -- > ======================================== > > Justin A. Lemkul > Ph.D. Candidate > ICTAS Doctoral Scholar > MILES-IGERT Trainee > Department of Biochemistry > Virginia Tech > Blacksburg, VA > jalemkul[at]vt.edu | (540) 231-9080 > http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin > > ======================================== > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > Please don't post (un)subscribe requests to the list. Use the www interface > or send it to gmx-users-requ...@gromacs.org. > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists