Mark,

Since I was working with 4.6.2, I built 4.6.3 to see if this was the result of a bug in 4.6.2. It isn't I get the same error with 4.6.3, but that is the version I'll be working with from now on, since it's the latest. Since the problem occurs with both versions, might as well try to fix it in the latest version, right?

I compiled 4.6.3 with the following options to include debugging information:

cmake .. \
-DCMAKE_TOOLCHAIN_FILE=../cmake/Platform/BlueGeneP-static-XL-C.cmake \
  -DBUILD_SHARED_LIBS=OFF \
  -DGMX_MPI=ON \
  -DCMAKE_C_FLAGS="-O0 -g -qstrict -qarch=450 -qtune=450" \
  -DCMAKE_INSTALL_PREFIX=/scratch/bgapps/gromacs-4.6.3 \
  -DGMX_CPU_ACCELERATION=None \
  -DGMX_THREAD_MPI=OFF \
  -DGMX_OPENMP=OFF \
  -DGMX_DEFAULT_SUFFIX=ON \
  -DCMAKE_PREFIX_PATH=/scratch/bgapps/fftw-3.3.2 \
   2>&1 | tee cmake.log

For qarch, I removed the 'd' from the end, so that the double-FPU isn't used, which can cause problems if the data isn't aligned correctly. The -qstrict makes sure certain optimizations aren't performed. It should be superfluous with optimization levels below 3, but I through it in just to be safe, and set -O0. (of course, I think -g turns off all optizations, anyway)

On the BG/P, I had to install FFTW3 separately, and that wasn't installed with debugging active, so there are no symbols for FFTW.

One of my coworkers wrote a script that converts BG/P core files to stack traces. In all the kernels I've looked at so far (9 out of 64), the stack ends at a vfprintf call. For example:

-------------------------------------------------------------

/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/resolv/res_init.c:414
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/libio/wgenops.c:419
/scratch/pbisbal/build/gromacs-4.6.3/src/gmxlib/nonbonded/nb_kernel_c/nb_kernel_ElecRFCut_VdwBhamSh_GeomW4P1_c.c:673
??:0
/bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/sys/dcmf/../ccmi/executor/Broadcast.h:83
/bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpid/dcmfd/src/coll/reduce/reduce_algorithms.c:69
/bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpid/dcmfd/src/coll/bcast/bcast_algorithms.c:227
/scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/nbnxn_atomdata.c:779
/scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/nbnxn_atomdata.c:762
/scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/nbnxn_atomdata.c:374
/scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/calcmu.c:88
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/mdrun.c:113
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/runner.c:1492
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/genalg.c:467
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/calc_verletbuf.c:266
../stdio-common/printf_fphex.c:335
../stdio-common/printf_fphex.c:452
??:0
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819

-----------------------------------------------------------------

Another node with a different stack looks like this:

---------------------------------------------------------------

/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/libio/genops.c:982
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/string/memcpy.c:159
/scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/ns.c:423
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/runner.c:1646
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/genalg.c:467
/scratch/pbisbal/build/gromacs-4.6.3/src/kernel/calc_verletbuf.c:266
../stdio-common/printf_fphex.c:335
../stdio-common/printf_fphex.c:452
??:0
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819
/bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819

---------------------------------------------------------------

All the stacks look like one of these two.

Is any of this information useful? My coworker, who has a lot of experience developing for Blue Gene/P's, says this looks like an I/O problem, but he doesn't have the time to dig into the Gromacs source code for us. I'm willing to do some digging, but some guidance from someone who know the code well would be very helpful.

Prentice



On 08/06/2013 08:19 PM, Mark Abraham wrote:
That all looks fine so far. The core file processor won't help unless
you've compiled with -g. Hopefully cmake -DCMAKE_BUILD_TYPE=Debug will
do that, but I haven't actually checked that really works. If not, you
might have to hack cmake/Platform/BlueGeneP-static-XL-C.cmake.

Anyway, if you can compile with -g, then the core file will tell us in
what function it is dying, which might help locate the problem.

Mark

On Tue, Aug 6, 2013 at 11:43 PM, Prentice Bisbal
<prentice.bis...@rutgers.edu> wrote:
Dear GMX-users,

I need some assistance running Gromacs 4.6.3 on a Blue Gene/P. Although I
have  a background in Chemistry, I'm an experienced professional HPC admin
who's relatively new to supporting Blue Genes and Gromacs. My first Gromacs
user is having trouble running Gromacs on our BG/P. His jobs die and dump
core, with no obvious signs (not to me, at least) of where the problem lies.

I compiled Gromacs 4.6.3 with the following options:

------------------------------------------snip-------------------------------------------

cmake .. \
-DCMAKE_TOOLCHAIN_FILE=../cmake/Platform/BlueGeneP-static-XL-C.cmake \
   -DBUILD_SHARED_LIBS=OFF \
   -DGMX_MPI=ON \
   -DCMAKE_C_FLAGS="-O3 -qarch=450d -qtune=450" \
   -DCMAKE_INSTALL_PREFIX=/scratch/bgapps/gromacs-4.6.2 \
   -DGMX_CPU_ACCELERATION=None \
   -DGMX_THREAD_MPI=OFF \
   -DGMX_OPENMP=OFF \
   -DGMX_DEFAULT_SUFFIX=ON \
   -DCMAKE_PREFIX_PATH=/scratch/bgapps/fftw-3.3.2 \
    2>&1 | tee cmake.log

------------------------------------------snip-------------------------------------------

When one of my users submits a job, it dumps core. My scheduler is
LoadLeveler, and I used this JCF file to replicate the problem. I added the
'-debug 1' flag after searching the gmx-users archives:

------------------------------------------snip-------------------------------------------

#!/bin/bash
# @ job_name = xiang
# @ job_type = bluegene
# @ bg_size = 64
# @ class = small
# @ wall_clock_limit = 01:00:00,00:50:00
# @ error = job.$(Cluster).$(Process).err
# @ output = job.$(Cluster).$(Process).out
# @ environment = COPY_ALL;
# @ queue

source /scratch/bgapps/gromacs-4.6.2/bin/GMXRC.bash

------------------------------------------snip-------------------------------------------

/bgsys/drivers/ppcfloor/bin/mpirun
/scratch/bgapps/gromacs-4.6.2/bin/mdrun_mpi -pin off -deffnm sbm-b_dyn3 -v
-dlb yes -debug 1

The stderr file shows this at the bottom, which isn't too helpful:

------------------------------------------snip-------------------------------------------

Reading file sbm-b_dyn3.tpr, VERSION 4.6.2 (single precision)

Will use 48 particle-particle and 16 PME only nodes
This is a guess, check the performance at the end of the log file
Using 64 MPI processes
<Aug 06 17:25:55.303879> BE_MPI (ERROR): The error message in the job record
is as follows:
<Aug 06 17:25:55.303940> BE_MPI (ERROR):   "killed with signal 6"

-----------------------------------------snip-----------------------------------------------

I have a bunch of core files which I can analyze with the IBM Core file
processor, and I also have bunch of debug files from mdrun. I went through
about 12/64 of them, and didn't see anything that looked like an error.

Can anyone offer me any suggestions of what to look for, or additional
debugging steps I can take? Please keep in mind I'm the system administrator
and not an expert-user of gromacs, so I'm not sure if the inputs are
correct, or are at correct for my BG/P configuration. Any help will be
greatly appreciated.

Thanks,
Prentice

--
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the www
interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

--
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Reply via email to