That is a good question. Looking at https://slurm.schedmd.com/gres.html#GPU_Management, I was wondering if you can share the output of your job so we can search CUDA_VISIBLE_DEVICES and see how GPUs were allocated.
--Junchao Zhang On Mon, Aug 21, 2023 at 2:38 PM Vanella, Marcos (Fed) < [email protected]> wrote: > Ok thanks Junchao, so is GPU 0 actually allocating memory for the 8 MPI > processes meshes but only working on 2 of them? > It says in the script it has allocated 2.4GB > Best, > Marcos > ------------------------------ > *From:* Junchao Zhang <[email protected]> > *Sent:* Monday, August 21, 2023 3:29 PM > *To:* Vanella, Marcos (Fed) <[email protected]> > *Cc:* PETSc users list <[email protected]>; Guan, Collin X. (Fed) < > [email protected]> > *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi > processes and 1 GPU > > Hi, Macros, > If you look at the PIDs of the nvidia-smi output, you will only find 8 > unique PIDs, which is expected since you allocated 8 MPI ranks per node. > The duplicate PIDs are usually for threads spawned by the MPI runtime > (for example, progress threads in MPI implementation). So your job script > and output are all good. > > Thanks. > > On Mon, Aug 21, 2023 at 2:00 PM Vanella, Marcos (Fed) < > [email protected]> wrote: > > Hi Junchao, something I'm noting related to running with cuda enabled > linear solvers (CG+HYPRE, CG+GAMG) is that for multi cpu-multi gpu > calculations, the GPU 0 in the node is taking what seems to be all > sub-matrices corresponding to all the MPI processes in the node. This is > the result of the nvidia-smi command on a node with 8 MPI processes (each > advancing the same number of unknowns in the calculation) and 4 GPU V100s: > > Mon Aug 21 14:36:07 2023 > > +---------------------------------------------------------------------------------------+ > | NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA > Version: 12.2 | > > |-----------------------------------------+----------------------+----------------------+ > | GPU Name Persistence-M | Bus-Id Disp.A | > Volatile Uncorr. ECC | > | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | > GPU-Util Compute M. | > | | | > MIG M. | > > |=========================================+======================+======================| > | 0 Tesla V100-SXM2-16GB On | 00000004:04:00.0 Off | > 0 | > | N/A 34C P0 63W / 300W | 2488MiB / 16384MiB | 0% > Default | > | | | > N/A | > > +-----------------------------------------+----------------------+----------------------+ > | 1 Tesla V100-SXM2-16GB On | 00000004:05:00.0 Off | > 0 | > | N/A 38C P0 56W / 300W | 638MiB / 16384MiB | 0% > Default | > | | | > N/A | > > +-----------------------------------------+----------------------+----------------------+ > | 2 Tesla V100-SXM2-16GB On | 00000035:03:00.0 Off | > 0 | > | N/A 35C P0 52W / 300W | 638MiB / 16384MiB | 0% > Default | > | | | > N/A | > > +-----------------------------------------+----------------------+----------------------+ > | 3 Tesla V100-SXM2-16GB On | 00000035:04:00.0 Off | > 0 | > | N/A 38C P0 53W / 300W | 638MiB / 16384MiB | 0% > Default | > | | | > N/A | > > +-----------------------------------------+----------------------+----------------------+ > > > > +---------------------------------------------------------------------------------------+ > | Processes: > | > | GPU GI CI PID Type Process name > GPU Memory | > | ID ID > Usage | > > |=======================================================================================| > | 0 N/A N/A 214626 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB | > | 0 N/A N/A 214627 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB | > | 0 N/A N/A 214628 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB | > | 0 N/A N/A 214629 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB | > | 0 N/A N/A 214630 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB | > | 0 N/A N/A 214631 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB | > | 0 N/A N/A 214632 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB | > | 0 N/A N/A 214633 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 308MiB | > | 1 N/A N/A 214627 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB | > | 1 N/A N/A 214631 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB | > | 2 N/A N/A 214628 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB | > | 2 N/A N/A 214632 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB | > | 3 N/A N/A 214629 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB | > | 3 N/A N/A 214633 C > ...d/ompi_gnu_linux/fds_ompi_gnu_linux 318MiB | > > +---------------------------------------------------------------------------------------+ > > > You can see that GPU 0 is connected to all 8 MPI Processes, each taking > about 300MB on it, whereas GPUs 1,2 and 3 are working with 2 MPI Processes. > I'm wondering if this is expected or there are some changes I need to do on > my submission script/runtime parameters. > This is the script in this case (2 nodes, 8 MPI processes/node, 4 > GPU/node): > > #!/bin/bash > # ../../Utilities/Scripts/qfds.sh -p 2 -T db -d test.fds > #SBATCH -J test > #SBATCH -e /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.err > #SBATCH -o /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.log > #SBATCH --partition=gpu > #SBATCH --ntasks=16 > #SBATCH --ntasks-per-node=8 > #SBATCH --cpus-per-task=1 > #SBATCH --nodes=2 > #SBATCH --time=01:00:00 > #SBATCH --gres=gpu:4 > > export OMP_NUM_THREADS=1 > # modules > module load cuda/11.7 > module load gcc/11.2.1/toolset > module load openmpi/4.1.4/gcc-11.2.1-cuda-11.7 > > cd /home/mnv/Firemodels_fork/fds/Issues/PETSc > > srun -N 2 -n 16 > /home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux > test.fds -pc_type gamg -mat_type aijcusparse -vec_type cuda > > Thank you for the advice, > Marcos > > > >
