Hi, Marcos, Could you build petsc in debug mode and then copy and paste the whole error stack message?
Thanks --Junchao Zhang On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users < [email protected]> wrote: > Hi, I'm trying to run a parallel matrix vector build and linear solution > with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix > build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda > enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the > following error: > > terminate called after throwing an instance of > 'thrust::system::system_error' > * what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: > an illegal memory access was encountered* > > Program received signal SIGABRT: Process abort signal. > > Backtrace for this error: > terminate called after throwing an instance of > 'thrust::system::system_error' > what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an > illegal memory access was encountered > > Program received signal SIGABRT: Process abort signal. > > I'm new to submitting jobs in slurm that also use GPU resources, so I > might be doing something wrong in my submission script. This is it: > > #!/bin/bash > #SBATCH -J test > #SBATCH -e /home/Issues/PETSc/test.err > #SBATCH -o /home/Issues/PETSc/test.log > #SBATCH --partition=batch > #SBATCH --ntasks=2 > #SBATCH --nodes=1 > #SBATCH --cpus-per-task=1 > #SBATCH --ntasks-per-node=2 > #SBATCH --time=01:00:00 > #SBATCH --gres=gpu:1 > > export OMP_NUM_THREADS=1 > module load cuda/11.5 > module load openmpi/4.1.1 > > cd /home/Issues/PETSc > *mpirun -n 2 */home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds > *-vec_type > mpicuda -mat_type mpiaijcusparse -pc_type gamg* > > If anyone has any suggestions on how o troubleshoot this please let me > know. > Thanks! > Marcos > > > >
