Hi Junchao, something I'm noting related to running with cuda enabled linear
solvers (CG+HYPRE, CG+GAMG) is that for multi cpu-multi gpu calculations, the
GPU 0 in the node is taking what seems to be all sub-matrices corresponding to
all the MPI processes in the node. This is the result of the nvidia-smi command
on a node with 8 MPI processes (each advancing the same number of unknowns in
the calculation) and 4 GPU V100s:
Mon Aug 21 14:36:07 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version:
12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile
Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util
Compute M. |
| | |
MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-16GB On | 00000004:04:00.0 Off |
0 |
| N/A 34C P0 63W / 300W | 2488MiB / 16384MiB | 0%
Default |
| | |
N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2-16GB On | 00000004:05:00.0 Off |
0 |
| N/A 38C P0 56W / 300W | 638MiB / 16384MiB | 0%
Default |
| | |
N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2-16GB On | 00000035:03:00.0 Off |
0 |
| N/A 35C P0 52W / 300W | 638MiB / 16384MiB | 0%
Default |
| | |
N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2-16GB On | 00000035:04:00.0 Off |
0 |
| N/A 38C P0 53W / 300W | 638MiB / 16384MiB | 0%
Default |
| | |
N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:
|
| GPU GI CI PID Type Process name
GPU Memory |
| ID ID
Usage |
|=======================================================================================|
| 0 N/A N/A 214626 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
318MiB |
| 0 N/A N/A 214627 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
308MiB |
| 0 N/A N/A 214628 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
308MiB |
| 0 N/A N/A 214629 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
308MiB |
| 0 N/A N/A 214630 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
318MiB |
| 0 N/A N/A 214631 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
308MiB |
| 0 N/A N/A 214632 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
308MiB |
| 0 N/A N/A 214633 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
308MiB |
| 1 N/A N/A 214627 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
318MiB |
| 1 N/A N/A 214631 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
318MiB |
| 2 N/A N/A 214628 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
318MiB |
| 2 N/A N/A 214632 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
318MiB |
| 3 N/A N/A 214629 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
318MiB |
| 3 N/A N/A 214633 C ...d/ompi_gnu_linux/fds_ompi_gnu_linux
318MiB |
+---------------------------------------------------------------------------------------+
You can see that GPU 0 is connected to all 8 MPI Processes, each taking about
300MB on it, whereas GPUs 1,2 and 3 are working with 2 MPI Processes. I'm
wondering if this is expected or there are some changes I need to do on my
submission script/runtime parameters.
This is the script in this case (2 nodes, 8 MPI processes/node, 4 GPU/node):
#!/bin/bash
# ../../Utilities/Scripts/qfds.sh -p 2 -T db -d test.fds
#SBATCH -J test
#SBATCH -e /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.err
#SBATCH -o /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.log
#SBATCH --partition=gpu
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH --nodes=2
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:4
export OMP_NUM_THREADS=1
# modules
module load cuda/11.7
module load gcc/11.2.1/toolset
module load openmpi/4.1.4/gcc-11.2.1-cuda-11.7
cd /home/mnv/Firemodels_fork/fds/Issues/PETSc
srun -N 2 -n 16
/home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds
-pc_type gamg -mat_type aijcusparse -vec_type cuda
Thank you for the advice,
Marcos