Re-sending to list; original bounced when we had some issues with gmx-users over the weekend.
Mark ---------- Forwarded message ---------- From: Mark Abraham <mark.j.abra...@gmail.com> Date: Sat, Aug 10, 2013 at 11:49 AM Subject: Re: [gmx-users] Assistance needed running gromacs 4.6.3 on Blue Gene/P To: prentice.bis...@rutgers.edu, Discussion list for GROMACS users <gmx-users@gromacs.org> On Fri, Aug 9, 2013 at 6:03 PM, Prentice Bisbal <prentice.bis...@rutgers.edu> wrote: > Mark, > > Since I was working with 4.6.2, I built 4.6.3 to see if this was the result > of a bug in 4.6.2. It isn't I get the same error with 4.6.3, but that is the > version I'll be working with from now on, since it's the latest. Since the > problem occurs with both versions, might as well try to fix it in the latest > version, right? Yep. > I compiled 4.6.3 with the following options to include debugging > information: > > > cmake .. \ > -DCMAKE_TOOLCHAIN_FILE=../cmake/Platform/BlueGeneP-static-XL-C.cmake \ > -DBUILD_SHARED_LIBS=OFF \ > -DGMX_MPI=ON \ > -DCMAKE_C_FLAGS="-O0 -g -qstrict -qarch=450 -qtune=450" \ > -DCMAKE_INSTALL_PREFIX=/scratch/bgapps/gromacs-4.6.3 \ > > -DGMX_CPU_ACCELERATION=None \ > -DGMX_THREAD_MPI=OFF \ > -DGMX_OPENMP=OFF \ > -DGMX_DEFAULT_SUFFIX=ON \ > -DCMAKE_PREFIX_PATH=/scratch/bgapps/fftw-3.3.2 \ > 2>&1 | tee cmake.log > > For qarch, I removed the 'd' from the end, so that the double-FPU isn't > used, which can cause problems if the data isn't aligned correctly. The > -qstrict makes sure certain optimizations aren't performed. It should be > superfluous with optimization levels below 3, but I through it in just to be > safe, and set -O0. (of course, I think -g turns off all optizations, anyway) Mostly true, but mostly fine and immaterial :-) > On the BG/P, I had to install FFTW3 separately, and that wasn't installed > with debugging active, so there are no symbols for FFTW. Yeah, that won't be a problem. > One of my coworkers wrote a script that converts BG/P core files to stack > traces. In all the kernels I've looked at so far (9 out of 64), the stack > ends at a vfprintf call. For example: Functions like vfprintf with va_list arguments use a macro that was not implemented correctly on BG/L and BG/P. This has caused problems with GROMACS before. See http://www-01.ibm.com/support/docview.wss?uid=swg1LI73769 for details. If this turns out to be the problem, then compiling just the files that use va_list with -O0 should help (starting with src/gmxlib/gmx_fatal.c). Or perhaps update the compiler if IBM really did fix this some time, and/or file a support request with IBM. However... > ------------------------------------------------------------- > > /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819 > /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/resolv/res_init.c:414 > /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/libio/wgenops.c:419 > /scratch/pbisbal/build/gromacs-4.6.3/src/gmxlib/nonbonded/nb_kernel_c/nb_kernel_ElecRFCut_VdwBhamSh_GeomW4P1_c.c:673 > ??:0 > /bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/sys/dcmf/../ccmi/executor/Broadcast.h:83 > /bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpid/dcmfd/src/coll/reduce/reduce_algorithms.c:69 > /bghome/bgbuild/V1R4M2_200_2010-100508P/ppc/bgp/comm/lib/dev/mpich2/src/mpid/dcmfd/src/coll/bcast/bcast_algorithms.c:227 > /scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/nbnxn_atomdata.c:779 > /scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/nbnxn_atomdata.c:762 > /scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/nbnxn_atomdata.c:374 > /scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/calcmu.c:88 > /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/mdrun.c:113 > /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/runner.c:1492 > /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/genalg.c:467 > /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/calc_verletbuf.c:266 > ../stdio-common/printf_fphex.c:335 > ../stdio-common/printf_fphex.c:452 > ??:0 > /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819 > /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819 > /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819 > > ----------------------------------------------------------------- This is the kind of thing I wanted to see, but it looks like you are analysing a core file using an executable that was not the one that generated the core file. The above does not make sense as a stack trace. You will need to run the debug-enabled code and look at the stack trace with the same executable. If the problem is a va_list one, you might see the last function is gmx_fatal, as mdrun was trying to exit gracefully from some other normal error condition, it ran until the above implementation error while trying to issue the error message. > Another node with a different stack looks like this: > > --------------------------------------------------------------- > > /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819 > /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/libio/genops.c:982 > /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/string/memcpy.c:159 > /scratch/pbisbal/build/gromacs-4.6.3/src/mdlib/ns.c:423 > /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/runner.c:1646 > /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/genalg.c:467 > /scratch/pbisbal/build/gromacs-4.6.3/src/kernel/calc_verletbuf.c:266 > ../stdio-common/printf_fphex.c:335 > ../stdio-common/printf_fphex.c:452 > ??:0 > /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819 > /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819 > /bgsys/drivers/V1R4M2_200_2010-100508P/ppc/toolchain/gnu/glibc-2.4/stdio-common/vfprintf.c:1819 > > --------------------------------------------------------------- > > All the stacks look like one of these two. Same problem. > Is any of this information useful? My coworker, who has a lot of experience > developing for Blue Gene/P's, says this looks like an I/O problem, but he > doesn't have the time to dig into the Gromacs source code for us. I'm > willing to do some digging, but some guidance from someone who know the code > well would be very helpful. You've prompted me to remember what the issue probably is, but we haven't actually identified it yet until we have a proper stack trace. Mark > Prentice > > > > > On 08/06/2013 08:19 PM, Mark Abraham wrote: >> >> That all looks fine so far. The core file processor won't help unless >> you've compiled with -g. Hopefully cmake -DCMAKE_BUILD_TYPE=Debug will >> do that, but I haven't actually checked that really works. If not, you >> might have to hack cmake/Platform/BlueGeneP-static-XL-C.cmake. >> >> Anyway, if you can compile with -g, then the core file will tell us in >> what function it is dying, which might help locate the problem. >> >> Mark >> >> On Tue, Aug 6, 2013 at 11:43 PM, Prentice Bisbal >> <prentice.bis...@rutgers.edu> wrote: >>> >>> Dear GMX-users, >>> >>> I need some assistance running Gromacs 4.6.3 on a Blue Gene/P. Although I >>> have a background in Chemistry, I'm an experienced professional HPC >>> admin >>> who's relatively new to supporting Blue Genes and Gromacs. My first >>> Gromacs >>> user is having trouble running Gromacs on our BG/P. His jobs die and dump >>> core, with no obvious signs (not to me, at least) of where the problem >>> lies. >>> >>> I compiled Gromacs 4.6.3 with the following options: >>> >>> >>> ------------------------------------------snip------------------------------------------- >>> >>> cmake .. \ >>> -DCMAKE_TOOLCHAIN_FILE=../cmake/Platform/BlueGeneP-static-XL-C.cmake \ >>> -DBUILD_SHARED_LIBS=OFF \ >>> -DGMX_MPI=ON \ >>> -DCMAKE_C_FLAGS="-O3 -qarch=450d -qtune=450" \ >>> -DCMAKE_INSTALL_PREFIX=/scratch/bgapps/gromacs-4.6.2 \ >>> -DGMX_CPU_ACCELERATION=None \ >>> -DGMX_THREAD_MPI=OFF \ >>> -DGMX_OPENMP=OFF \ >>> -DGMX_DEFAULT_SUFFIX=ON \ >>> -DCMAKE_PREFIX_PATH=/scratch/bgapps/fftw-3.3.2 \ >>> 2>&1 | tee cmake.log >>> >>> >>> ------------------------------------------snip------------------------------------------- >>> >>> When one of my users submits a job, it dumps core. My scheduler is >>> LoadLeveler, and I used this JCF file to replicate the problem. I added >>> the >>> '-debug 1' flag after searching the gmx-users archives: >>> >>> >>> ------------------------------------------snip------------------------------------------- >>> >>> #!/bin/bash >>> # @ job_name = xiang >>> # @ job_type = bluegene >>> # @ bg_size = 64 >>> # @ class = small >>> # @ wall_clock_limit = 01:00:00,00:50:00 >>> # @ error = job.$(Cluster).$(Process).err >>> # @ output = job.$(Cluster).$(Process).out >>> # @ environment = COPY_ALL; >>> # @ queue >>> >>> source /scratch/bgapps/gromacs-4.6.2/bin/GMXRC.bash >>> >>> >>> ------------------------------------------snip------------------------------------------- >>> >>> /bgsys/drivers/ppcfloor/bin/mpirun >>> /scratch/bgapps/gromacs-4.6.2/bin/mdrun_mpi -pin off -deffnm sbm-b_dyn3 >>> -v >>> -dlb yes -debug 1 >>> >>> The stderr file shows this at the bottom, which isn't too helpful: >>> >>> >>> ------------------------------------------snip------------------------------------------- >>> >>> Reading file sbm-b_dyn3.tpr, VERSION 4.6.2 (single precision) >>> >>> Will use 48 particle-particle and 16 PME only nodes >>> This is a guess, check the performance at the end of the log file >>> Using 64 MPI processes >>> <Aug 06 17:25:55.303879> BE_MPI (ERROR): The error message in the job >>> record >>> is as follows: >>> <Aug 06 17:25:55.303940> BE_MPI (ERROR): "killed with signal 6" >>> >>> >>> -----------------------------------------snip----------------------------------------------- >>> >>> I have a bunch of core files which I can analyze with the IBM Core file >>> processor, and I also have bunch of debug files from mdrun. I went >>> through >>> about 12/64 of them, and didn't see anything that looked like an error. >>> >>> Can anyone offer me any suggestions of what to look for, or additional >>> debugging steps I can take? Please keep in mind I'm the system >>> administrator >>> and not an expert-user of gromacs, so I'm not sure if the inputs are >>> correct, or are at correct for my BG/P configuration. Any help will be >>> greatly appreciated. >>> >>> Thanks, >>> Prentice >>> >>> -- >>> gmx-users mailing list gmx-users@gromacs.org >>> http://lists.gromacs.org/mailman/listinfo/gmx-users >>> * Please search the archive at >>> http://www.gromacs.org/Support/Mailing_Lists/Search before posting! >>> * Please don't post (un)subscribe requests to the list. Use the www >>> interface or send it to gmx-users-requ...@gromacs.org. >>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > * Please don't post (un)subscribe requests to the list. Use the www > interface or send it to gmx-users-requ...@gromacs.org. > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists