For the size problem you are running this is very unexpected. The reductions
should only start to dominate for thousands of MPI ranks, not 2.
First thing I recommend is to run the streams benchmark. Then check the binding
that MPI is doing for the two processes. You want to bind to cores in different
NUMA regions. It could be that it is binding both processes to cores that share
the same cache. The MatSolve should be embarressingly parallel but you are
getting almost no speed up in it, so something very "wrong" is happening.
Event Count Time (sec) Flop
--- Global --- --- Stage ---- Total
Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct
%T %F %M %L %R %T %F %M %L %R Mflop/s
VecDot 182 1.0 1.9998e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00
0.0e+00 7 5 0 0 0 7 5 0 0 0 1129
VecDotNorm2 91 1.0 6.6214e-02 1.0 2.26e+08 1.0 0.0e+00 0.0e+00
0.0e+00 2 5 0 0 0 2 5 0 0 0 3409
VecNorm 92 1.0 1.4790e-01 1.0 1.14e+08 1.0 0.0e+00 0.0e+00
0.0e+00 5 3 0 0 0 5 3 0 0 0 771
VecDot 198 1.0 2.1037e+00 1.1 1.23e+08 1.0 0.0e+00 0.0e+00
2.0e+02 33 5 0 0 43 33 5 0 0 45 117
VecDotNorm2 99 1.0 5.0169e-01 1.2 1.23e+08 1.0 0.0e+00 0.0e+00
9.9e+01 7 5 0 0 22 7 5 0 0 22 489
VecNorm 100 1.0 1.3131e+00 1.0 6.20e+07 1.0 0.0e+00 0.0e+00
1.0e+02 21 3 0 0 22 21 3 0 0 23 94
VecScatterEnd 198 1.0 7.6160e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 10 0 0 0 0 10 0 0 0 0 0
MatSolve 183 1.0 1.0882e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00
0.0e+00 40 35 0 0 0 40 35 0 0 0 1318
MatSolve 199 1.0 8.9666e-01 1.2 7.75e+08 1.0 0.0e+00 0.0e+00
0.0e+00 13 35 0 0 0 13 35 0 0 0 1729
> On Sep 8, 2023, at 5:22 PM, Chris Hewson <[email protected]> wrote:
>
> Thanks for the quick response.
>
> The links to the log view files are below:
> 2 ranks:
> https://drive.google.com/file/d/1PGRsiHypWtN5h3uxdJBKy9WzEkE0mUgO/view?usp=drive_link
>
> 1 rank:
> https://drive.google.com/file/d/1hB2XyoNtLMHseZUT7jCuiixTQeBi_tjJ/view?usp=drive_link
>
> I'll also attach them to this email:
> **************************** 1 RANK ******************************
> ------------------------------------------------------------------ PETSc
> Performance Summary:
> ------------------------------------------------------------------
>
> ./petsc-testing on a named ubuntu-office with 1 processor, by chewson Fri
> Sep 8 15:16:51 2023
> Using Petsc Release Version 3.19.5, unknown
>
> Max Max/Min Avg Total
> Time (sec): 2.746e+00 1.000 2.746e+00
> Objects: 2.100e+01 1.000 2.100e+01
> Flops: 4.117e+09 1.000 4.117e+09 4.117e+09
> Flops/sec: 1.499e+09 1.000 1.499e+09 1.499e+09
> MPI Msg Count: 0.000e+00 0.000 0.000e+00 0.000e+00
> MPI Msg Len (bytes): 0.000e+00 0.000 0.000e+00 0.000e+00
> MPI Reductions: 0.000e+00 0.000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N -->
> 2N flops
> and VecAXPY() for complex vectors of length N -->
> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages ---
> -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total Count %Total
> Avg %Total Count %Total
> 0: Main Stage: 2.7458e+00 100.0% 4.1167e+09 100.0% 0.000e+00 0.0%
> 0.000e+00 0.0% 0.000e+00 0.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting
> output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flop: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> AvgLen: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> %T - percent time in this phase %F - percent flop in this phase
> %M - percent messages in this phase %L - percent message lengths in
> this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
> all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flop
> --- Global --- --- Stage ---- Total
> Max Ratio Max Ratio Max Ratio Mess AvgLen
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult 182 1.0 8.0351e-01 1.0 1.43e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 29 35 0 0 0 29 35 0 0 0 1775
> MatSolve 183 1.0 1.0882e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 40 35 0 0 0 40 35 0 0 0 1318
> MatLUFactorNum 1 1.0 1.3892e-02 1.0 1.30e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 934
> MatILUFactorSym 1 1.0 2.1567e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> MatAssemblyBegin 1 1.0 1.0420e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyEnd 1 1.0 6.9049e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetRowIJ 1 1.0 3.8500e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetOrdering 1 1.0 1.7026e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatLoad 1 1.0 6.6749e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
> VecDot 182 1.0 1.9998e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 7 5 0 0 0 7 5 0 0 0 1129
> VecDotNorm2 91 1.0 6.6214e-02 1.0 2.26e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 2 5 0 0 0 2 5 0 0 0 3409
> VecNorm 92 1.0 1.4790e-01 1.0 1.14e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 5 3 0 0 0 5 3 0 0 0 771
> VecCopy 2 1.0 6.8473e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 3 1.0 1.3256e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAXPBYCZ 182 1.0 1.6542e-01 1.0 4.51e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 6 11 0 0 0 6 11 0 0 0 2729
> VecWAXPY 182 1.0 1.4476e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 5 5 0 0 0 5 5 0 0 0 1559
> VecLoad 2 1.0 1.0104e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSetUp 1 1.0 9.9204e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 1 1.0 2.6210e+00 1.0 4.10e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 95 100 0 0 0 95 100 0 0 0 1566
> PCSetUp 1 1.0 3.7232e-02 1.0 1.30e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 349
> PCApply 183 1.0 1.0885e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00
> 0.0e+00 40 35 0 0 0 40 35 0 0 0 1318
>
> --- Event Stage 1: Unknown
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Object Type Creations Destructions. Reports information only for
> process 0.
>
> --- Event Stage 0: Main Stage
>
> Viewer 4 1
> Matrix 3 1
> Vector 9 1
> Krylov Solver 1 0
> Preconditioner 1 0
> Index Set 3 0
>
> --- Event Stage 1: Unknown
>
> ========================================================================================================================
> Average time to get PetscTime(): 1.51e-08
> #PETSc Option Table entries:
> -log_view # (source: command line)
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: --with-debugging=0 --prefix=/opt/anl/petsc-3.19.5
> --download-mumps --download-scalapack --with-mpi=1
> --with-mpi-dir=/opt/anl/mpich COPTFLAGS=-O2 CXXOPTFLAGS=-O2 FOPTFLAGS=-O2
> -----------------------------------------
> Libraries compiled on 2023-09-08 16:27:49 on ubuntu-office
> Machine characteristics: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
> Using PETSc directory: /opt/anl/petsc-3.19.5
> Using PETSc arch:
> -----------------------------------------
>
> Using C compiler: /opt/anl/mpich/bin/mpicc -fPIC -Wall -Wwrite-strings
> -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow
> -fstack-protector -fvisibility=hidden -O2
> Using Fortran compiler: /opt/anl/mpich/bin/mpif90 -fPIC -Wall
> -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch
> -Wno-unused-dummy-argument -O2
> -----------------------------------------
>
> Using include paths: -I/opt/anl/petsc-3.19.5/include -I/opt/anl/mpich/include
> -----------------------------------------
>
> Using C linker: /opt/anl/mpich/bin/mpicc
> Using Fortran linker: /opt/anl/mpich/bin/mpif90
> Using libraries: -Wl,-rpath,/opt/anl/petsc-3.19.5/lib
> -L/opt/anl/petsc-3.19.5/lib -lpetsc -Wl,-rpath,/opt/anl/petsc-3.19.5/lib
> -L/opt/anl/petsc-3.19.5/lib -Wl,-rpath,/opt/anl/mpich/lib
> -L/opt/anl/mpich/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/11
> -L/usr/lib/gcc/x86_64-linux-gnu/11 -ldmumps -lmumps_common -lpord -lpthread
> -lscalapack -llapack -lblas -lm -lX11 -lmpifort -lmpi -lgfortran -lm
> -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -lquadmath
> -----------------------------------------
>
> ************************* 2 RANKS ************************************
> ------------------------------------------------------------------ PETSc
> Performance Summary:
> ------------------------------------------------------------------
>
> ./petsc-testing on a named ubuntu-office with 2 processors, by chewson Fri
> Sep 8 15:16:43 2023
> Using Petsc Release Version 3.19.5, unknown
>
> Max Max/Min Avg Total
> Time (sec): 6.167e+00 1.001 6.164e+00
> Objects: 3.200e+01 1.000 3.200e+01
> Flops: 2.233e+09 1.000 2.233e+09 4.467e+09
> Flops/sec: 3.625e+08 1.001 3.623e+08 7.247e+08
> MPI Msg Count: 2.050e+02 1.000 2.050e+02 4.100e+02
> MPI Msg Len (bytes): 3.437e+07 1.000 1.676e+05 6.874e+07
> MPI Reductions: 4.580e+02 1.000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N -->
> 2N flops
> and VecAXPY() for complex vectors of length N -->
> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages ---
> -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total Count %Total
> Avg %Total Count %Total
> 0: Main Stage: 6.1642e+00 100.0% 4.4670e+09 100.0% 4.100e+02 100.0%
> 1.676e+05 100.0% 4.400e+02 96.1%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting
> output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flop: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> AvgLen: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> %T - percent time in this phase %F - percent flop in this phase
> %M - percent messages in this phase %L - percent message lengths in
> this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
> all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flop
> --- Global --- --- Stage ---- Total
> Max Ratio Max Ratio Max Ratio Mess AvgLen
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> BuildTwoSided 1 1.0 3.1824e-05 1.0 0.00e+00 0.0 2.0e+00 4.0e+00
> 1.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatMult 198 1.0 1.3480e+00 1.4 7.76e+08 1.0 4.0e+02 9.4e+04
> 0.0e+00 19 35 97 54 0 19 35 97 54 0 1151
> MatSolve 199 1.0 8.9666e-01 1.2 7.75e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 13 35 0 0 0 13 35 0 0 0 1729
> MatLUFactorNum 1 1.0 7.1852e-03 1.0 6.43e+06 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 1789
> MatILUFactorSym 1 1.0 1.0472e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyBegin 1 1.0 9.8700e-07 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyEnd 1 1.0 6.8341e-03 1.1 0.00e+00 0.0 4.0e+00 2.3e+04
> 5.0e+00 0 0 1 0 1 0 0 1 0 1 0
> MatGetRowIJ 1 1.0 1.9930e-06 6.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatGetOrdering 1 1.0 7.4472e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatLoad 1 1.0 9.8562e-02 1.0 0.00e+00 0.0 1.0e+01 2.7e+06
> 1.7e+01 2 0 2 39 4 2 0 2 39 4 0
> VecDot 198 1.0 2.1037e+00 1.1 1.23e+08 1.0 0.0e+00 0.0e+00
> 2.0e+02 33 5 0 0 43 33 5 0 0 45 117
> VecDotNorm2 99 1.0 5.0169e-01 1.2 1.23e+08 1.0 0.0e+00 0.0e+00
> 9.9e+01 7 5 0 0 22 7 5 0 0 22 489
> VecNorm 100 1.0 1.3131e+00 1.0 6.20e+07 1.0 0.0e+00 0.0e+00
> 1.0e+02 21 3 0 0 22 21 3 0 0 23 94
> VecCopy 2 1.0 7.4971e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecSet 202 1.0 8.0035e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0
> VecAXPBYCZ 198 1.0 1.2889e-01 1.5 2.46e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 2 11 0 0 0 2 11 0 0 0 3811
> VecWAXPY 198 1.0 9.1526e-02 1.0 1.23e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 1 5 0 0 0 1 5 0 0 0 2683
> VecLoad 2 1.0 9.8983e-03 1.0 0.00e+00 0.0 4.0e+00 1.2e+06
> 1.6e+01 0 0 1 7 3 0 0 1 7 4 0
> VecScatterBegin 198 1.0 1.2941e-03 1.0 0.00e+00 0.0 4.0e+02 9.4e+04
> 0.0e+00 0 0 97 54 0 0 0 97 54 0 0
> VecScatterEnd 198 1.0 7.6160e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 10 0 0 0 0 10 0 0 0 0 0
> SFSetGraph 1 1.0 7.6630e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> SFSetUp 1 1.0 1.2410e-04 1.0 0.00e+00 0.0 4.0e+00 2.3e+04
> 1.0e+00 0 0 1 0 0 0 0 1 0 0 0
> SFPack 198 1.0 5.1814e-05 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> SFUnpack 198 1.0 3.8273e-05 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSetUp 2 1.0 4.7077e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 1 1.0 6.0344e+00 1.0 2.23e+09 1.0 4.0e+02 9.4e+04
> 4.0e+02 98 100 97 54 87 98 100 97 54 90 738
> PCSetUp 2 1.0 1.8496e-02 1.0 6.43e+06 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 695
> PCSetUpOnBlocks 1 1.0 1.8435e-02 1.0 6.43e+06 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 697
> PCApply 199 1.0 9.5103e-01 1.1 7.75e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 15 35 0 0 0 15 35 0 0 0 1630
>
> --- Event Stage 1: Unknown
>
> ------------------------------------------------------------------------------------------------------------------------
>
> Object Type Creations Destructions. Reports information only for
> process 0.
>
> --- Event Stage 0: Main Stage
>
> Viewer 4 1
> Matrix 5 1
> Vector 13 2
> Index Set 5 2
> Star Forest Graph 1 0
> Krylov Solver 2 0
> Preconditioner 2 0
>
> --- Event Stage 1: Unknown
>
> ========================================================================================================================
> Average time to get PetscTime(): 2.47e-08
> Average time for MPI_Barrier(): 4.406e-07
> Average time for zero size MPI_Send(): 4.769e-06
> #PETSc Option Table entries:
> -log_view # (source: command line)
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: --with-debugging=0 --prefix=/opt/anl/petsc-3.19.5
> --download-mumps --download-scalapack --with-mpi=1
> --with-mpi-dir=/opt/anl/mpich COPTFLAGS=-O2 CXXOPTFLAGS=-O2 FOPTFLAGS=-O2
> -----------------------------------------
> Libraries compiled on 2023-09-08 16:27:49 on ubuntu-office
> Machine characteristics: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
> Using PETSc directory: /opt/anl/petsc-3.19.5
> Using PETSc arch:
> -----------------------------------------
>
> Using C compiler: /opt/anl/mpich/bin/mpicc -fPIC -Wall -Wwrite-strings
> -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow
> -fstack-protector -fvisibility=hidden -O2
> Using Fortran compiler: /opt/anl/mpich/bin/mpif90 -fPIC -Wall
> -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch
> -Wno-unused-dummy-argument -O2
> -----------------------------------------
>
> Using include paths: -I/opt/anl/petsc-3.19.5/include -I/opt/anl/mpich/include
> -----------------------------------------
>
> Using C linker: /opt/anl/mpich/bin/mpicc
> Using Fortran linker: /opt/anl/mpich/bin/mpif90
> Using libraries: -Wl,-rpath,/opt/anl/petsc-3.19.5/lib
> -L/opt/anl/petsc-3.19.5/lib -lpetsc -Wl,-rpath,/opt/anl/petsc-3.19.5/lib
> -L/opt/anl/petsc-3.19.5/lib -Wl,-rpath,/opt/anl/mpich/lib
> -L/opt/anl/mpich/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/11
> -L/usr/lib/gcc/x86_64-linux-gnu/11 -ldmumps -lmumps_common -lpord -lpthread
> -lscalapack -llapack -lblas -lm -lX11 -lmpifort -lmpi -lgfortran -lm
> -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -lquadmath
> -----------------------------------------
>
> Chris
>
>
> On Fri, Sep 8, 2023 at 3:00 PM Barry Smith <[email protected]
> <mailto:[email protected]>> wrote:
>>
>> It would be very helpful if you could run on 1 and 2 ranks with -log_view
>> and send all the output.
>>
>>
>>
>>> On Sep 8, 2023, at 4:52 PM, Chris Hewson <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> Hi There,
>>>
>>> I am trying to solve a linear problem and am having an issue when I use
>>> more MPI processes with the KSPsolve slowing down considerably the more
>>> processes I add.
>>>
>>> The matrix itself is 620100 X 620100 with ~5 million non-zero entries, I am
>>> using petsc version 3.19.5 and have tried with a couple different versions
>>> of mpich getting the same behavior (v4.1.2 w/ device ch4:ofi and v3.3.2 w/
>>> ch3:sock).
>>>
>>> In testing, I've noticed the following trend for speed for the KSPSolve
>>> function call:
>>> 1 core: 4042 ms
>>> 2 core: 7085 ms
>>> 4 core: 26573 ms
>>> 8 core: 65745 ms
>>> 16 core: 149283 ms
>>>
>>> This was all done on a single node machine w/ 16 non-hyperthreaded cores.
>>> We solve quite a few different matrices with PETSc using MPI and haven't
>>> noticed an impact like this on performance before.
>>>
>>> I am very confused by this and am a little stumped at the moment as to why
>>> this was happening. I've been using the KSPBCGS solver to solve the
>>> problem. I have tried with multiple different solvers and pre-conditioners
>>> (we usually don't use a pre-conditioner for this part of our code).
>>>
>>> It did seem that using the piped BCGS solver did help improve the parallel
>>> speed slightly (maybe 15%), but it still doesn't come close to the single
>>> threaded speed.
>>>
>>> I'll attach a link to a folder that contains the specific A, x and b
>>> matrices for this problem, as well as a main.cpp file that I was using for
>>> testing.
>>>
>>> https://drive.google.com/drive/folders/1CEDinKxu8ZbKpLtwmqKqP1ZIDG7JvDI1?usp=sharing
>>>
>>> I was testing this in our main code base, but don't include that here, and
>>> observe very similar speed results to the ones above. We do use Metis to
>>> graph partition in our own code and checked the vector and matrix
>>> partitioning and that all made sense. I could be doing the partitioning
>>> incorrectly in the example (not 100% sure how it works with the viewer/load
>>> functions).
>>>
>>> Any insight or thoughts on this would be greatly appreciated.
>>>
>>> Thanks,
>>>
>>> Chris Hewson
>>> Senior Reservoir Simulation Engineer
>>> ResFrac
>>> +1.587.575.9792
>>