Hi All,
Unfortunately, my positive news was a false alarm. Intermediate values of
lambda are sporadically failing at different times. Some jobs exit at time
zero, others last somewhat longer (several hundred ps) before failing. I'm
going to keep working on the suggestions I got before about compilation options
and such, but for now I will post the error output to see if it provides any
clues about what's wrong:
Warning: Only triclinic boxes with the first vector parallel to the x-axis and
the second vector in the xy-plane are supported.
Warning: Only triclinic boxes with the first vector parallel to the x-axis and
the second vector in the xy-plane are supported.
Box (3x3):
Box[ 0]={ nan, 0.00000e+00, 0.00000e+00}
Box[ 1]={ nan, nan, nan}
Box[ 2]={ nan, nan, nan}
Can not fix pbc.
Warning: Only triclinic boxes with the first vector parallel to the x-axis and
the second vector in the xy-plane are supported.
Box (3x3):
Box[ 0]={ nan, 0.00000e+00, 0.00000e+00}
Box[ 1]={ nan, nan, nan}
Box[ 2]={ nan, nan, nan}
Can not fix pbc.
Box (3x3):
Box[ 0]={ nan, 0.00000e+00, 0.00000e+00}
Box[ 1]={ nan, nan, nan}
Box[ 2]={ nan, nan, nan}
Can not fix pbc.
Warning: Only triclinic boxes with the first vector parallel to the x-axis and
the second vector in the xy-plane are supported.
Box (3x3):
Box[ 0]={ nan, 0.00000e+00, 0.00000e+00}
Box[ 1]={ nan, nan, nan}
Box[ 2]={ nan, nan, nan}
Can not fix pbc.
[n921:01432] *** Process received signal ***
[n921:01432] Signal: Segmentation fault (11)
[n921:01432] Signal code: Address not mapped (1)
[n921:01432] Failing at address: 0xfffffffe10af11c0
[n921:01432] [ 0] [0x100428]
[n921:01432] [ 1] [0xfffffcc01f4]
[n921:01432] [ 2]
/home/rdiv1001/gromacs-4.5.3-linux_gcc445_fftw322/bin/mdrun_4.5.3_gcc_mpi(gmx_pme_do-0x51df30)
[0x1010c0e8]
[n921:01432] [ 3]
/home/rdiv1001/gromacs-4.5.3-linux_gcc445_fftw322/bin/mdrun_4.5.3_gcc_mpi(do_force_lowlevel-0x560cd0)
[0x100c88c8]
[n921:01432] [ 4]
/home/rdiv1001/gromacs-4.5.3-linux_gcc445_fftw322/bin/mdrun_4.5.3_gcc_mpi(do_force-0x50b084)
[0x1011f594]
[n921:01432] [ 5]
/home/rdiv1001/gromacs-4.5.3-linux_gcc445_fftw322/bin/mdrun_4.5.3_gcc_mpi(do_md-0x5bbb78)
[0x1006cac0]
[n921:01432] [ 6]
/home/rdiv1001/gromacs-4.5.3-linux_gcc445_fftw322/bin/mdrun_4.5.3_gcc_mpi(mdrunner-0x5c26f8)
[0x10065e98]
[n921:01432] [ 7]
/home/rdiv1001/gromacs-4.5.3-linux_gcc445_fftw322/bin/mdrun_4.5.3_gcc_mpi(main-0x5b7964)
[0x10070cec]
[n921:01432] [ 8] /lib64/libc.so.6 [0x80aaed8858]
[n921:01432] [ 9] /lib64/libc.so.6(__libc_start_main-0x1471f8) [0x80aaed8ad0]
[n921:01432] *** End of error message ***
I have not yet compiled with debugging enabled, but I can if that will be
useful.
-Justin
Justin A. Lemkul wrote:
Hi All,
I believe I have a resolution to all of this. It comes down to
compilers and FFTW. I had always used the same compilers to build all
my Gromacs versions (gcc-4.2.2 on Linux, gcc-3.3 on OSX) with the same
version of FFTW (3.0.1), primarily for reasons of continuity. Bug 715
that I posted before was due to a problem with FFTW, not Gromacs;
upgrading to 3.2.2 solves the issue. As well, the combination of
gcc-4.4.5 + FFTW 3.2.2 allows for successful free energy runs in my
tests thus far on Linux. Dependency issues on the OS X partition
prevent me from doing any upgrades, so we're out of luck there,
unfortunately. I will report back if any other weird behavior results,
but I believe the problem is solved.
I am no compiler/library guru, but it would seem to me that we've
reached a point where certain minimum versions must be required for
certain prerequisites. I would suggest that prior to the next release,
we come up with a list of what should be considered stable minimum
requirements (compiler versions, FFTW, anything else) to post on the
website and perhaps the manual. As it stands now, the prerequisites for
installation would seem to be satisfied by any C compiler and any
version of FFTW. Gromacs happily compiles under most conditions, but
shows some weird behavior if one is not using optimal dependencies.
Thanks to all for their input!
-Justin
Justin A. Lemkul wrote:
Michael Shirts wrote:
Hi, all-
Have you tried running
constraints = hbonds?
That might eliminate some of the constraint issues. Much less likely
for LINCS to break or have DD issues if only the hbonds are
constrained. 2 fs is not that big a deal for the heteroatom bonds.
I haven't yet, but I'll add it to my to-do list. I was trying to keep
as many things consistent between my 3.3.3 and 4.5.3 input files as
possible, so I could diagnose any issues, but at this point, anything
is worth a shot.
Thanks!
-Justin
Best,
Michael
On Thu, Mar 10, 2011 at 8:04 PM, Justin A. Lemkul <jalem...@vt.edu>
wrote:
Hi Matt,
Thanks for the extensive explanation and tips. I'll work through
things and
report back. It will take a while to get things going through
(unless one
of the early solutions works!) since I have no admin access to
install new
compilers, libraries, etc. and for some reason the only thing I can
ever get
to work in my home directory is Gromacs itself. The joys of an aging
cluster.
We recently got access to gcc-4.4.5 on Linux, but we're stuck with
3.3 on OS
X, so there's at least a bit of hope for one partition.
Thanks again.
-Justin
Matthew Zwier wrote:
Hi Justin,
I should have specified that the segfault happened for us after we got
similar warnings and errors (DD and/or LINCS), so the segfault may
have been tangential. Given that everything about your system worked
before GROMACS 4.5, it's possible that your older compilers are
generating code that's incompatible with the GROMACS assembly loops
(which you are likely running with, as they are the default option on
most mainstream processors). The bug you mentioned in your original
post also has my antennae twitching about bad machine code.
If that's indeed happening, it's almost certainly some bizarre
alignment issue, something like half of a float is getting overwritten
on the way into or out of the assembly code, which corruption would
trigger the results you describe. It's also distantly possible that
GROMACS is working fine, but your copy of FFTW or BLAS/LAPACK (more
likely the latter) has alignment problems. One final possibility
(which would explain the failure on YellowDog but unfortunately not
the failure on OS X) is that GCC is generating badly-aligned code for
auto-vectorized Altivec loops, which is still a problem for Intel's
SIMD instructions on 32-bit x86 architectures even with GCC 4.4. I've
also observed MPI gather/reduce operations to foul up alignment (or
rigidly enforce it where badly compiled code is relying on broken
alignment) under exceedingly rare circumstances, usually involving
different libraries compiled with different compilers (which is
generally a bad idea for scientific code anyway).
Okay...so all of that said, there are a few things to try:
1) Recompile GROMACS using -O2 instead of -O3; that'll turn off the
automatic vectorizer (on Yellow Dog) and various other relatively
risky optimizations (on both platforms). CFLAGS="-O2 -march=powerpc"
in the environment AND on the configure command line would do that.
Check your build logs to make sure it took, though, because if you
don't do it exactly right, configure will ignore your directives and
merrily set up GROMACS to compile with -O3, which is the most likely
culprit for badly-aligned code.
2) Recompile GROMACS specifying a forced alignment flag. I have no
experience with PowerPC, but -malign-natural and -malign-power look
like good initial guesses. That's probably going to cause more
problems than it solves, but if you have a screwy BLAS/LAPACK or MPI,
it might help. I only suggest it because if you've already tried #1,
it will only take another half hour or hour of your time to recompile
GROMACS again. Other than that, tinkering with alignment flags is a
really easy way to REALLY break code, so you might consider skipping
this and moving straight on to #3.
3) Snag GCC 4.4.4 or 4.4.5 and compile it, and use that to compile
GROMACS, again with -O2. GCC takes forever to compile, but beyond
that, it's not as difficult as it could be. Nothing preventing you
from installing it in your home directory, either, assuming you set
PATH and LD_LIBRARY_PATH (or DYLD_LIBRARY_PATH on OS X) properly. You
might need to snag a new copy of binutils as well, if gcc refuses to
compile with the system ld. This option would also probably get you
threading, since you certainly have hardware support for it.
4) Rebuild your entire GROMACS stack, including FFTW, BLAS/LAPACK,
MPI, and GROMACS itself with the same compiler (preferably GCC from
#3) and the same compiler options (which again should be -O2, and
definitely NOT any sort of alignment flag). Put them in their own
tree (like "/opt/sci"), and definitely not in /usr (which is generally
managed by the system) or /usr/local (which tends to accumulate
cruft). ATLAS is a good choice for BLAS, and there are directions on
the ATLAS website for building a complete and optimized LAPACK based
on BLAS.
In practice, I've found I've had to do #4 for every piece of
scientific software our group uses, because pretty much nothing works
right with OS-installed versions of compilers, BLAS/LAPACK, and MPI.
It takes forever, and it pretty much defines the phrase "learning
experience," but it also essentially *never* breaks once it works
(because OS updates never overwrite anything you've hand-tuned to run
correctly). But...with luck option #1 will fix things quickly enough
to get you running without devoting two days to rebuilding your
software stack from scratch.
Hope that helps,
Matt Z.
On Thu, Mar 10, 2011 at 8:54 PM, Justin A. Lemkul <jalem...@vt.edu>
wrote:
Hi Matt,
Thanks for the reply. I can't trace the problem to a specific
compiler.
We
have a PowerPC cluster with two partitions - one running Mac OS X
10.3
with
gcc-3.3, the other running YellowDog Linux with gcc-4.2.2. The
problem
happens on both partitions. There are no seg faults, the runs
just exit
(MPI_ABORT) after the fatal error (either "too many LINCS
warnings" or
the
DD-related error I posted before).
We are using MPI: mpich-1.2.5 on OSX and OpenMPI-1.2.3 on Linux.
All of
the
above has been the same since my successful 3.3.3 TI calculations (as
well
as all of my simulations with Gromacs, ever). Our hardware and
compilers
are somewhat (very) outdated so threading is not supported, we
always use
MPI.
Gromacs was compiled in single precision using standard options
through
autoconf. The cmake build system still does not work on our
cluster due
to
several outstanding bugs.
-Justin
Matthew Zwier wrote:
Dear Justin,
We recently experienced a similar problem (LINCS errors, step*.pdb
files), and then GROMACS usually segfaulted. The cause was a
miscompiled copy of GROMACS. Another member of our group had
compiled
GROMACS on an Intel Core2 quad (gcc -march=core2) and tried to
run the
copy without modification on an AMD Magny Cours machine.
Recompilation with the correct subarchitecture type
(-march=amdfam10)
fixed the problem. Don't really know why it didn't die with
SIGILL or
SIGBUS instead of SIGSEGV, but that's probably a question for the
hardware gurus.
So...are you observing segfaults? What compiler are you using
(and on
what OS)? What were the compilation parameters for 4.5.3? Also,
are
you really running across nodes with MPI, or running on the same
node
with MPI?
Cheers,
Matt Zwier
On Thu, Mar 10, 2011 at 1:55 PM, Justin A. Lemkul <jalem...@vt.edu>
wrote:
Hi All,
I've been troubleshooting a problem for some time now and I
wanted to
report
it here and solicit some feedback before I submit a bug report
to see
if
there's anything else I can try.
Here's the situation: I ran some free energy calculations
(thermodynamic
integration) a long time ago using version 3.3.3 to determine the
hydration
free energy of a series of small molecules. Results were good
and they
ended up as part of a paper, so I'm trying to reproduce the
methodology
with
4.5.3 (using BAR) to see if I understand the workflow
completely. The
problem is my systems are crashing. The runs simply stop randomly
(usually
within a few hundred ps) with lots of LINCS warnings and step*.pdb
files
being written.
I know the parameters are good, and produce stable trajectories,
since
I
spent months on them some years ago. The system prep is steepest
descents
EM
to Fmax < 100 (always achieved), NVT at 298 K for 100 ps, NPT at
298K/1
bar
for 100 ps, then 5 ns of data collection under NPT conditions.
Here's
the
rundown of what I'm seeing:
1. All LJ transformations work fine. The problem only comes when I
have
a
molecule with full LJ interaction and I am "charging" it (i.e.,
introducing
charges to the partially-interacting species).
2. Simulations at lambda=1 (full interaction) work fine.
3. Simulations with the free energy code off entirely work fine
under
all
conditions.
4. I cannot run in serial due to
http://redmine.gromacs.org/issues/715.
The
bug seems to affect other systems and is not specifically
related to my
free
energy calculations.
5. Running with DD fails because my system is relatively small
(more on
this
in a moment).
6. Running with mdrun -pd 2 works, but mdrun -pd 4 crashes for any
value
of
lambda != 1.
7. I created a larger system (instead of a 3x3x3-nm cube of
water with
my
molecule, I used 4x4x4) and ran on 4 CPU's with DD (lambda = 0,
i.e.
full
vdW, no intermolecular Coulombic interactions - .mdp file is
below).
This
run also crashed with some warnings about DD cell size:
DD load balancing is limited by minimum cell size in dimension X
DD step 329999 vol min/aver 0.748! load imb.: force 31.5%
...and then the actual crash:
-------------------------------------------------------
Program mdrun_4.5.3_gcc_mpi, VERSION 4.5.3
Source code file: domdec_con.c, line: 693
Fatal error:
DD cell 0 0 0 could only obtain 14 of the 15 atoms that are
connected
via
constraints from the neighboring cells. This probably means your
constraint
lengths are too long compared to the domain decomposition cell
size.
Decrease the number of domain decomposition grid cells or
lincs-order
or
use
the -rcon option of mdrun.
For more information and tips for troubleshooting, please check the
GROMACS
website at http://www.gromacs.org/Documentation/Errors
-------------------------------------------------------
Watching the trajectory doesn't seem to give any useful
information.
The
small molecule of interest is at a periodic boundary when the crash
happens,
but there are several crosses prior to the crash without
incident, so I
don't know if the issue is related to PBC or not, but it appears
not.
8. I initially thought the problem might be related to the
barostat,
but
switching from P-R to Berendsen does not alleviate the problem, nor
does
increasing tau_p (tested 0.5, 1.0, 2.0, and 5.0 - all crash).
Longer
tau_p
simply delays the crash, but does not prevent it.
So after all that, I'm wondering if (1) anyone has seen the
same, or
(2)
if
there's anything else I can try (environment variables, hidden
tricks,
etc)
that I can use to get to the bottom of this before I give up and
file a
bug
report.
If you made it this far, thanks for reading my novel and hopefully
someone
can give me some ideas. The .mdp file I'm using is below, but
it is
just
one of many that I've tried. In theory, it should work, since the
parameters are the same as my successful 3.3.3 runs, with the
exception
of
the new free energy features in 4.5.3 and obvious keyword changes
related
to
the difference in version.
-Justin
--- .mdp file ---
; Run control
integrator = sd ; Langevin dynamics
tinit = 0
dt = 0.002
nsteps = 2500000 ; 5 ns
nstcomm = 100
; Output control
nstxout = 500
nstvout = 500
nstfout = 0
nstlog = 500
nstenergy = 500
nstxtcout = 0
xtc-precision = 1000
; Neighborsearching and short-range nonbonded interactions
nstlist = 5
ns_type = grid
pbc = xyz
rlist = 0.9
; Electrostatics
coulombtype = PME
rcoulomb = 0.9
; van der Waals
vdw-type = cutoff
rvdw = 1.4
; Apply long range dispersion corrections for Energy and Pressure
DispCorr = EnerPres
; Spacing for the PME/PPPM FFT grid
fourierspacing = 0.12
; EWALD/PME/PPPM parameters
pme_order = 4
ewald_rtol = 1e-05
epsilon_surface = 0
optimize_fft = no
; Temperature coupling
; tcoupl is implicitly handled by the sd integrator
tc_grps = system
tau_t = 1.0
ref_t = 298
; Pressure coupling is on for NPT
Pcoupl = Berendsen
tau_p = 2.0
compressibility = 4.5e-05
ref_p = 1.0
; Free energy control stuff
free_energy = yes
init_lambda = 0.00
delta_lambda = 0
foreign_lambda = 0.05
sc-alpha = 0
sc-power = 1.0
sc-sigma = 0
couple-moltype = MOR ; name of moleculetype to
couple
couple-lambda0 = vdw ; vdW interactions
couple-lambda1 = vdw-q ; turn on everything
couple-intramol = no
dhdl_derivatives = yes ; this line (and the next
two) are
defaults
separate_dhdl_file = yes ; included only for pedantry
nstdhdl = 10
; Do not generate velocities
gen_vel = no
; options for bonds
constraints = all-bonds
; Type of constraint algorithm
constraint-algorithm = lincs
; Constrain the starting configuration
; since we are continuing from NPT
continuation = yes
; Highest order in the expansion of the constraint coupling matrix
lincs-order = 4
--
========================================
Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
========================================
--
gmx-users mailing list gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the www
interface
or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
--
========================================
Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
========================================
--
gmx-users mailing list gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the www
interface
or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
--
========================================
Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
========================================
--
gmx-users mailing list gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the www
interface
or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
--
========================================
Justin A. Lemkul
Ph.D. Candidate
ICTAS Doctoral Scholar
MILES-IGERT Trainee
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin
========================================
--
gmx-users mailing list gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists