Oh right. I had forgotten about cuda-memcheck. Thanks for reminding me. It has never saved me, yet, so it has not been etched in my brain like valgrind :)
On Sun, May 30, 2021 at 11:53 AM Jacob Faibussowitsch <[email protected]> wrote: > The problem was that I was accessing a device pointer on the host. > > Maybe the fact that valgrind did not print a source code line (it was in > host code) is a hint that you are accessing a device pointer? > > ==77820== Invalid read of size 4 > ==77820== at 0x7E69068: LandauKokkosJacobian (in > /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-notpl-cuda10/lib/libpetsc.so.3.015.0) > ==77820== by 0x7C598AF: LandauFormJacobian_Internal (plexland.c:212) > > > When in doubt use cuda-memcheck whenever doing any debugging with gpus, > its the cuda version of valgrind and I cannot recommend it enough. Not > directly related but it also comes with a suite of other useful gpu-related > tools that catch race conditions, uninitialized memory accesses and > deadlocks. > > https://docs.nvidia.com/cuda/cuda-memcheck/index.html > > Best regards, > > Jacob Faibussowitsch > (Jacob Fai - booss - oh - vitch) > > On May 30, 2021, at 09:06, Mark Adams <[email protected]> wrote: > > The problem was that I was accessing a device pointer on the host. > > Maybe the fact that valgrind did not print a source code line (it was in > host code) is a hint that you are accessing a device pointer? > > ==77820== Invalid read of size 4 > ==77820== at 0x7E69068: LandauKokkosJacobian (in > /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-notpl-cuda10/lib/libpetsc.so.3.015.0) > ==77820== by 0x7C598AF: LandauFormJacobian_Internal (plexland.c:212) > > This access is in landau.kokkos.cxx but no source line number. > > Thanks, > > > On Sun, May 30, 2021 at 12:48 AM Mark Adams <[email protected]> wrote: > >> >> >> On Sun, May 30, 2021 at 12:08 AM Barry Smith <[email protected]> wrote: >> >>> >>> Try without Valgrind, put a CHKMEMQ; just before the call to >>> LandauKokkosJacobian and as its first line. And run with -malloc_debug. >>> This is a less optimal way to find memory corruption but may be more useful >>> in this case. >>> >> >> I don't seem to get anything with this, but I now see that the segv is on >> the 2nd call to LandauKokkosJacobian, which adds the mass matrix, with >> shift. I am working on the mass matrix part now. Let me try adding print >> statements in LandauKokkosJacobian. (DDT failed to trace into that method, >> but let's see). >> >> Thanks, >> >> CHKMEMQ; >> PetscPrintf(PETSC_COMM_SELF,"call LandauKokkosJacobian\n"); >> ierr = >> LandauKokkosJacobian(ctx->plex,Nq,Eq_m,IPf,N,xdata,ctx->SData_d,ctx->subThreadBlockSize,shift,ctx->events,JacP);CHKERRQ(ierr); >> >> 00:37 adams/landau-mass-opt *= >> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/tutorials$ >> make PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10 -f mymake tiny >> EXTRA='-dm_mat_type aijkokkos -dm_vec_type kokkos -malloc_debug' >> DEVICE=kokkos >> jsrun -n 1 -c 1 -g 1 ./ex2 -dim 2 -ex2_test_type none -dm_landau_Ez 0 >> -petscspace_degree 3 -dm_preallocate_only -dm_landau_type p4est >> -dm_landau_ion_masses 1 -dm_landau_ion_charges 1 -dm_landau_thermal_temps >> 4,4 -dm_landau_n 1,1 -ts_monitorx -snes_rtol 1.e-14 -snes_stol 1.e-14 >> -snes_monitor -snes_converged_reason -snes_max_it 14 -ts_type beuler >> -ts_exact_final_time stepover -ts_max_snes_failures 1 -ts_rtol 5e-1 -ts_dt >> .5 -ts_max_steps 1 -pc_type lu -ksp_type preonly -dm_landau_amr_levels_max >> 13 -dm_landau_device_type kokkos -dm_mat_type aijkokkos -dm_vec_type kokkos* >> -malloc_debug* >> >> >> [0]FormLandau: 1280 IPs, 80 cells, totDim=32, Nb=16, Nq=16, >> elemMatSize=1024, dim=2, Tab: Nb=16 Nf=2 Np=16 cdim=2 N=1406 shift=0. >> >> *call LandauKokkosJacobian* 0 SNES Function norm 4.974994975313e-03 >> >> *call LandauKokkosJacobian*[0]PETSC ERROR: >> ------------------------------------------------------------------------ >> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >> probably memory access out of range >> [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger >> [0]PETSC ERROR: or see >> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >> [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS >> X to find memory corruption errors >> [0]PETSC ERROR: likely location of problem given in stack below >> [0]PETSC ERROR: --------------------- Stack Frames >> ------------------------------------ >> [0]PETSC ERROR: The EXACT line numbers in the error traceback are not >> available. >> [0]PETSC ERROR: instead the line number of the start of the function is >> given. >> [0]PETSC ERROR: #1 LandauKokkosJacobian() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/kokkos/landau.kokkos.cxx:272 >> [0]PETSC ERROR: #2 LandauFormJacobian_Internal() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/plexland.c:66 >> [0]PETSC ERROR: #3 LandauIJacobian() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/plexland.c:2093 >> [0]PETSC ERROR: #4 TS user implicit Jacobian() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/interface/ts.c:933 >> [0]PETSC ERROR: #5 TSComputeIJacobian() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/interface/ts.c:916 >> [0]PETSC ERROR: #6 SNESTSFormJacobian_Theta() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/impls/implicit/theta/theta.c:1000 >> [0]PETSC ERROR: #7 SNESTSFormJacobian() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/interface/ts.c:4407 >> [0]PETSC ERROR: #8 SNES user Jacobian function() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:2823 >> [0]PETSC ERROR: #9 SNESComputeJacobian() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:2782 >> [0]PETSC ERROR: #10 SNESSolve() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:4653 >> [0]PETSC ERROR: #11 TSTheta_SNESSolve() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/impls/implicit/theta/theta.c:184 >> [0]PETSC ERROR: #12 TSStep_Theta() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/impls/implicit/theta/theta.c:200 >> [0]PETSC ERROR: #13 TSStep() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/interface/ts.c:3548 >> [0]PETSC ERROR: --------------------- Error Message >> -------------------------------------------------------------- >> >> >>> >>> On May 29, 2021, at 10:46 PM, Junchao Zhang <[email protected]> >>> wrote: >>> >>> try gcc/6.4.0 >>> --Junchao Zhang >>> >>> >>> On Sat, May 29, 2021 at 9:50 PM Mark Adams <[email protected]> wrote: >>> >>>> And I grief using gcc-8.1.1 and get this error: >>>> >>>> /autofs/nccs-svm1_sw/summit/gcc/8.1.1/include/c++/8.1.1/type_traits(347): >>>> error: identifier "__ieee128" is undefined >>>> >>>> Any ideas? >>>> >>>> On Sat, May 29, 2021 at 10:39 PM Mark Adams <[email protected]> wrote: >>>> >>>>> And valgrind sees this. I think the jump to the function is going to >>>>> the wrong place. >>>>> I'm giving up on PGI but can try newer versions of GCC. (what is the >>>>> deal with the range of major releases, 4-10?) >>>>> (as I said this looks like an error that a user is getting so I'd like >>>>> to figure it out). >>>>> >>>>> 0 SNES Function norm 4.974994975313e-03 >>>>> ==77820== Invalid read of size 4 >>>>> ==77820== at 0x7E69068: LandauKokkosJacobian (in >>>>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-notpl-cuda10/lib/libpetsc.so.3.015.0) >>>>> ==77820== by 0x7C598AF: LandauFormJacobian_Internal (plexland.c:212) >>>>> ==77820== by 0x7C728D3: LandauIJacobian (plexland.c:2107) >>>>> ==77820== by 0x7C8C26B: TSComputeIJacobian (ts.c:934) >>>>> ==77820== by 0x7E28337: SNESTSFormJacobian_Theta (theta.c:1007) >>>>> ==77820== by 0x7CBBFD3: SNESTSFormJacobian (ts.c:4415) >>>>> ==77820== by 0x7AD84BF: SNESComputeJacobian (snes.c:2824) >>>>> ==77820== by 0x7BA945B: SNESSolve_NEWTONLS (ls.c:222) >>>>> ==77820== by 0x7AF336F: SNESSolve (snes.c:4769) >>>>> ==77820== by 0x7E19D13: TSTheta_SNESSolve (theta.c:185) >>>>> ==77820== by 0x7E1A8B7: TSStep_Theta (theta.c:223) >>>>> ==77820== by 0x7CB093F: TSStep (ts.c:3571) >>>>> ==77820== Address 0x96fff690 is in a --- anonymous segment >>>>> ==77820== >>>>> [0]PETSC ERROR: >>>>> ------------------------------------------------------------------------ >>>>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >>>>> probably memory access out of range >>>>> [0]PETSC ERROR: Try option -start_in_debugger or >>>>> -on_error_attach_debugger >>>>> [0]PETSC ERROR: or see >>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >>>>> [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac >>>>> OS X to find memory corruption errors >>>>> [0]PETSC ERROR: likely location of problem given in stack below >>>>> [0]PETSC ERROR: --------------------- Stack Frames >>>>> ------------------------------------ >>>>> [0]PETSC ERROR: The EXACT line numbers in the error traceback are not >>>>> available. >>>>> [0]PETSC ERROR: instead the line number of the start of the function >>>>> is given. >>>>> [0]PETSC ERROR: #1 LandauKokkosJacobian() at >>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/kokkos/landau.kokkos.cxx:272 >>>>> >>>>> On Sat, May 29, 2021 at 8:46 PM Mark Adams <[email protected]> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Sat, May 29, 2021 at 7:48 PM Barry Smith <[email protected]> wrote: >>>>>> >>>>>>> >>>>>>> I don't see why it is not running the Kokkos check. Here is the >>>>>>> rule right below the CUDA rule that is apparently running. >>>>>>> >>>>>>> check_build: >>>>>>> -@echo "Running check examples to verify correct >>>>>>> installation" >>>>>>> -@echo "Using PETSC_DIR=${PETSC_DIR} and >>>>>>> PETSC_ARCH=${PETSC_ARCH}" >>>>>>> +@cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} >>>>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} clean-legacy >>>>>>> +@cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} >>>>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} testex19 >>>>>>> +@if [ "${HYPRE_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = >>>>>>> "" ] && [ "${PETSC_SCALAR}" = "real" ]; then \ >>>>>>> cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} >>>>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} >>>>>>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_hypre; \ >>>>>>> fi; >>>>>>> +@if [ "${CUDA_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = "" >>>>>>> ] && [ "${PETSC_SCALAR}" = "real" ]; then \ >>>>>>> cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} >>>>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} >>>>>>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_cuda; \ >>>>>>> fi; >>>>>>> +@if [ "${KOKKOS_KERNELS_LIB}" != "" ] && [ >>>>>>> "${PETSC_WITH_BATCH}" = "" ] && [ "${PETSC_SCALAR}" = "real" ] && [ >>>>>>> "${PETSC_PRECISION}" = "double" ] && [ "${MPI_IS_MPIUNI}" = "0" ]; then >>>>>>> \ >>>>>>> cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} >>>>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} >>>>>>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex3k_kokkos; \ >>>>>>> fi; >>>>>>> >>>>>>> Regarding the debugging, if it is just one MPI rank (or even more) >>>>>>> with GDB it will trap the error and show the exact line of source code >>>>>>> where the error occurred and you can poke around at variables to see if >>>>>>> they look corrupt or wrong (for example crazy address in a pointer), I >>>>>>> don't know why your debugger is not giving more useful information. >>>>>>> >>>>>>> >>>>>> This is what I did (in DDT). It stopped at the function call and the >>>>>> data looked fine. I stepped into the call, but didn't get to it. The >>>>>> signal >>>>>> handler was called and I was dead. >>>>>> Maybe I did something in my branch. Can't see what, but I keep >>>>>> probing, >>>>>> Thanks, >>>>>> >>>>>> >>>>>>> Barry >>>>>>> >>>>>>> >>>>>>> > On May 29, 2021, at 2:16 PM, Mark Adams <[email protected]> wrote: >>>>>>> > >>>>>>> > I am running on Summit with Kokkos-CUDA and I am getting a segv >>>>>>> that looks like some sort of a compile/link mismatch. I also have a user >>>>>>> with a C++ code that is getting strange segvs when calling MatSetValues >>>>>>> with CUDA (I know MatSetValues is not a cupsarse method, but that is the >>>>>>> report that I have). I have no idea if these are related but they both >>>>>>> involve C -- C++ calls ... >>>>>>> > >>>>>>> > I started with a clean build (attached) and I ran in DDT. DDT >>>>>>> stopped at the call in plexland.c to the KokkosLanau operator. I stepped >>>>>>> into this function and then took this screenshot of the stack, with the >>>>>>> Kokkos call and PETSc signal handler. >>>>>>> > >>>>>>> > Make check does not seem to be running Kokkos tests: >>>>>>> > >>>>>>> > 15:02 adams/landau-mass-opt *= >>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc$ make >>>>>>> PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc >>>>>>> PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10 check >>>>>>> > Running check examples to verify correct installation >>>>>>> > Using PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc and >>>>>>> PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10 >>>>>>> > C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI >>>>>>> process >>>>>>> > C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI >>>>>>> processes >>>>>>> > C/C++ example src/snes/tutorials/ex19 run successfully with cuda >>>>>>> > Completed test examples >>>>>>> > >>>>>>> > Also, I ran this AM with another branch that had not been rebased >>>>>>> with main as recently as this branch (adams/landau-mass-opt). >>>>>>> > >>>>>>> > Any ideas? >>>>>>> > <make.log><configure.log><Screen Shot 2021-05-29 at 2.51.00 PM.png> >>>>>>> >>>>>>> >>> >
