Re: [petsc-users] Slower performance using more MPI processes

Barry Smith Fri, 08 Sep 2023 14:54:02 -0700

For the size problem you are running this is very unexpected.  The reductions 
should only start to dominate for thousands of MPI ranks, not 2.


First thing I recommend is to run the streams benchmark. Then check the binding 
that MPI is doing for the two processes. You want to bind to cores in different 
NUMA regions. It could be that it is binding both processes to cores that share 
the same cache. The MatSolve should be embarressingly parallel but you are 
getting almost no speed up in it, so something very "wrong" is happening.



Event                Count      Time (sec)     Flop                             
 --- Global ---  --- Stage ----  Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s

VecDot               182 1.0 1.9998e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  7  5  0  0  0   7  5  0  0  0  1129
VecDotNorm2           91 1.0 6.6214e-02 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  2  5  0  0  0   2  5  0  0  0  3409
VecNorm               92 1.0 1.4790e-01 1.0 1.14e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  5  3  0  0  0   5  3  0  0  0   771

VecDot               198 1.0 2.1037e+00 1.1 1.23e+08 1.0 0.0e+00 0.0e+00 
2.0e+02 33  5  0  0 43  33  5  0  0 45   117
VecDotNorm2           99 1.0 5.0169e-01 1.2 1.23e+08 1.0 0.0e+00 0.0e+00 
9.9e+01  7  5  0  0 22   7  5  0  0 22   489
VecNorm              100 1.0 1.3131e+00 1.0 6.20e+07 1.0 0.0e+00 0.0e+00 
1.0e+02 21  3  0  0 22  21  3  0  0 23    94
VecScatterEnd        198 1.0 7.6160e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00 10  0  0  0  0  10  0  0  0  0     0

MatSolve             183 1.0 1.0882e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 
0.0e+00 40 35  0  0  0  40 35  0  0  0  1318
MatSolve             199 1.0 8.9666e-01 1.2 7.75e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 13 35  0  0  0  13 35  0  0  0  1729


> On Sep 8, 2023, at 5:22 PM, Chris Hewson <[email protected]> wrote:
> 
> Thanks for the quick response.
> 
> The links to the log view files are below:
> 2 ranks:
> https://drive.google.com/file/d/1PGRsiHypWtN5h3uxdJBKy9WzEkE0mUgO/view?usp=drive_link
> 
> 1 rank:
> https://drive.google.com/file/d/1hB2XyoNtLMHseZUT7jCuiixTQeBi_tjJ/view?usp=drive_link
> 
> I'll also attach them to this email:
> **************************** 1 RANK ******************************
> ------------------------------------------------------------------ PETSc 
> Performance Summary: 
> ------------------------------------------------------------------
> 
> ./petsc-testing on a  named ubuntu-office with 1 processor, by chewson Fri 
> Sep  8 15:16:51 2023
> Using Petsc Release Version 3.19.5, unknown 
> 
>                          Max       Max/Min     Avg       Total
> Time (sec):           2.746e+00     1.000   2.746e+00
> Objects:              2.100e+01     1.000   2.100e+01
> Flops:                4.117e+09     1.000   4.117e+09  4.117e+09
> Flops/sec:            1.499e+09     1.000   1.499e+09  1.499e+09
> MPI Msg Count:        0.000e+00     0.000   0.000e+00  0.000e+00
> MPI Msg Len (bytes):  0.000e+00     0.000   0.000e+00  0.000e+00
> MPI Reductions:       0.000e+00     0.000
> 
> Flop counting convention: 1 flop = 1 real number operation of type 
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N --> 
> 2N flops
>                             and VecAXPY() for complex vectors of length N --> 
> 8N flops
> 
> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  
> -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total    Count   %Total   
>   Avg         %Total    Count   %Total
>  0:      Main Stage: 2.7458e+00 100.0%  4.1167e+09 100.0%  0.000e+00   0.0%  
> 0.000e+00        0.0%  0.000e+00   0.0%
> 
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting 
> output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flop: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    AvgLen: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and 
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flop in this phase
>       %M - percent messages in this phase     %L - percent message lengths in 
> this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over 
> all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop                           
>    --- Global ---  --- Stage ----  Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> MatMult              182 1.0 8.0351e-01 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 29 35  0  0  0  29 35  0  0  0  1775
> MatSolve             183 1.0 1.0882e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 40 35  0  0  0  40 35  0  0  0  1318
> MatLUFactorNum         1 1.0 1.3892e-02 1.0 1.30e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0   934
> MatILUFactorSym        1 1.0 2.1567e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> MatAssemblyBegin       1 1.0 1.0420e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 6.9049e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            1 1.0 3.8500e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 1.7026e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatLoad                1 1.0 6.6749e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
> VecDot               182 1.0 1.9998e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  7  5  0  0  0   7  5  0  0  0  1129
> VecDotNorm2           91 1.0 6.6214e-02 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  2  5  0  0  0   2  5  0  0  0  3409
> VecNorm               92 1.0 1.4790e-01 1.0 1.14e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  5  3  0  0  0   5  3  0  0  0   771
> VecCopy                2 1.0 6.8473e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet                 3 1.0 1.3256e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPBYCZ           182 1.0 1.6542e-01 1.0 4.51e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  6 11  0  0  0   6 11  0  0  0  2729
> VecWAXPY             182 1.0 1.4476e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  5  5  0  0  0   5  5  0  0  0  1559
> VecLoad                2 1.0 1.0104e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetUp               1 1.0 9.9204e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 2.6210e+00 1.0 4.10e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 95 100  0  0  0  95 100  0  0  0  1566
> PCSetUp                1 1.0 3.7232e-02 1.0 1.30e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0   349
> PCApply              183 1.0 1.0885e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 
> 0.0e+00 40 35  0  0  0  40 35  0  0  0  1318
> 
> --- Event Stage 1: Unknown
> 
> ------------------------------------------------------------------------------------------------------------------------
> 
> Object Type          Creations   Destructions. Reports information only for 
> process 0.
> 
> --- Event Stage 0: Main Stage
> 
>               Viewer     4              1
>               Matrix     3              1
>               Vector     9              1
>        Krylov Solver     1              0
>       Preconditioner     1              0
>            Index Set     3              0
> 
> --- Event Stage 1: Unknown
> 
> ========================================================================================================================
> Average time to get PetscTime(): 1.51e-08
> #PETSc Option Table entries:
> -log_view # (source: command line)
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: --with-debugging=0 --prefix=/opt/anl/petsc-3.19.5 
> --download-mumps --download-scalapack --with-mpi=1 
> --with-mpi-dir=/opt/anl/mpich COPTFLAGS=-O2 CXXOPTFLAGS=-O2 FOPTFLAGS=-O2
> -----------------------------------------
> Libraries compiled on 2023-09-08 16:27:49 on ubuntu-office 
> Machine characteristics: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
> Using PETSc directory: /opt/anl/petsc-3.19.5
> Using PETSc arch: 
> -----------------------------------------
> 
> Using C compiler: /opt/anl/mpich/bin/mpicc  -fPIC -Wall -Wwrite-strings 
> -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow 
> -fstack-protector -fvisibility=hidden -O2  
> Using Fortran compiler: /opt/anl/mpich/bin/mpif90  -fPIC -Wall 
> -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch 
> -Wno-unused-dummy-argument -O2    
> -----------------------------------------
> 
> Using include paths: -I/opt/anl/petsc-3.19.5/include -I/opt/anl/mpich/include
> -----------------------------------------
> 
> Using C linker: /opt/anl/mpich/bin/mpicc
> Using Fortran linker: /opt/anl/mpich/bin/mpif90
> Using libraries: -Wl,-rpath,/opt/anl/petsc-3.19.5/lib 
> -L/opt/anl/petsc-3.19.5/lib -lpetsc -Wl,-rpath,/opt/anl/petsc-3.19.5/lib 
> -L/opt/anl/petsc-3.19.5/lib -Wl,-rpath,/opt/anl/mpich/lib 
> -L/opt/anl/mpich/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/11 
> -L/usr/lib/gcc/x86_64-linux-gnu/11 -ldmumps -lmumps_common -lpord -lpthread 
> -lscalapack -llapack -lblas -lm -lX11 -lmpifort -lmpi -lgfortran -lm 
> -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -lquadmath
> -----------------------------------------
> 
> ************************* 2 RANKS ************************************
> ------------------------------------------------------------------ PETSc 
> Performance Summary: 
> ------------------------------------------------------------------
> 
> ./petsc-testing on a  named ubuntu-office with 2 processors, by chewson Fri 
> Sep  8 15:16:43 2023
> Using Petsc Release Version 3.19.5, unknown 
> 
>                          Max       Max/Min     Avg       Total
> Time (sec):           6.167e+00     1.001   6.164e+00
> Objects:              3.200e+01     1.000   3.200e+01
> Flops:                2.233e+09     1.000   2.233e+09  4.467e+09
> Flops/sec:            3.625e+08     1.001   3.623e+08  7.247e+08
> MPI Msg Count:        2.050e+02     1.000   2.050e+02  4.100e+02
> MPI Msg Len (bytes):  3.437e+07     1.000   1.676e+05  6.874e+07
> MPI Reductions:       4.580e+02     1.000
> 
> Flop counting convention: 1 flop = 1 real number operation of type 
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N --> 
> 2N flops
>                             and VecAXPY() for complex vectors of length N --> 
> 8N flops
> 
> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  
> -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total    Count   %Total   
>   Avg         %Total    Count   %Total
>  0:      Main Stage: 6.1642e+00 100.0%  4.4670e+09 100.0%  4.100e+02 100.0%  
> 1.676e+05      100.0%  4.400e+02  96.1%
> 
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting 
> output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flop: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    AvgLen: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and 
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flop in this phase
>       %M - percent messages in this phase     %L - percent message lengths in 
> this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over 
> all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop                           
>    --- Global ---  --- Stage ----  Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> BuildTwoSided          1 1.0 3.1824e-05 1.0 0.00e+00 0.0 2.0e+00 4.0e+00 
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMult              198 1.0 1.3480e+00 1.4 7.76e+08 1.0 4.0e+02 9.4e+04 
> 0.0e+00 19 35 97 54  0  19 35 97 54  0  1151
> MatSolve             199 1.0 8.9666e-01 1.2 7.75e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00 13 35  0  0  0  13 35  0  0  0  1729
> MatLUFactorNum         1 1.0 7.1852e-03 1.0 6.43e+06 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0  1789
> MatILUFactorSym        1 1.0 1.0472e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyBegin       1 1.0 9.8700e-07 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyEnd         1 1.0 6.8341e-03 1.1 0.00e+00 0.0 4.0e+00 2.3e+04 
> 5.0e+00  0  0  1  0  1   0  0  1  0  1     0
> MatGetRowIJ            1 1.0 1.9930e-06 6.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 7.4472e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatLoad                1 1.0 9.8562e-02 1.0 0.00e+00 0.0 1.0e+01 2.7e+06 
> 1.7e+01  2  0  2 39  4   2  0  2 39  4     0
> VecDot               198 1.0 2.1037e+00 1.1 1.23e+08 1.0 0.0e+00 0.0e+00 
> 2.0e+02 33  5  0  0 43  33  5  0  0 45   117
> VecDotNorm2           99 1.0 5.0169e-01 1.2 1.23e+08 1.0 0.0e+00 0.0e+00 
> 9.9e+01  7  5  0  0 22   7  5  0  0 22   489
> VecNorm              100 1.0 1.3131e+00 1.0 6.20e+07 1.0 0.0e+00 0.0e+00 
> 1.0e+02 21  3  0  0 22  21  3  0  0 23    94
> VecCopy                2 1.0 7.4971e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet               202 1.0 8.0035e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> VecAXPBYCZ           198 1.0 1.2889e-01 1.5 2.46e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  2 11  0  0  0   2 11  0  0  0  3811
> VecWAXPY             198 1.0 9.1526e-02 1.0 1.23e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00  1  5  0  0  0   1  5  0  0  0  2683
> VecLoad                2 1.0 9.8983e-03 1.0 0.00e+00 0.0 4.0e+00 1.2e+06 
> 1.6e+01  0  0  1  7  3   0  0  1  7  4     0
> VecScatterBegin      198 1.0 1.2941e-03 1.0 0.00e+00 0.0 4.0e+02 9.4e+04 
> 0.0e+00  0  0 97 54  0   0  0 97 54  0     0
> VecScatterEnd        198 1.0 7.6160e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00 10  0  0  0  0  10  0  0  0  0     0
> SFSetGraph             1 1.0 7.6630e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> SFSetUp                1 1.0 1.2410e-04 1.0 0.00e+00 0.0 4.0e+00 2.3e+04 
> 1.0e+00  0  0  1  0  0   0  0  1  0  0     0
> SFPack               198 1.0 5.1814e-05 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> SFUnpack             198 1.0 3.8273e-05 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetUp               2 1.0 4.7077e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve               1 1.0 6.0344e+00 1.0 2.23e+09 1.0 4.0e+02 9.4e+04 
> 4.0e+02 98 100 97 54 87  98 100 97 54 90   738
> PCSetUp                2 1.0 1.8496e-02 1.0 6.43e+06 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   695
> PCSetUpOnBlocks        1 1.0 1.8435e-02 1.0 6.43e+06 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   697
> PCApply              199 1.0 9.5103e-01 1.1 7.75e+08 1.0 0.0e+00 0.0e+00 
> 0.0e+00 15 35  0  0  0  15 35  0  0  0  1630
> 
> --- Event Stage 1: Unknown
> 
> ------------------------------------------------------------------------------------------------------------------------
> 
> Object Type          Creations   Destructions. Reports information only for 
> process 0.
> 
> --- Event Stage 0: Main Stage
> 
>               Viewer     4              1
>               Matrix     5              1
>               Vector    13              2
>            Index Set     5              2
>    Star Forest Graph     1              0
>        Krylov Solver     2              0
>       Preconditioner     2              0
> 
> --- Event Stage 1: Unknown
> 
> ========================================================================================================================
> Average time to get PetscTime(): 2.47e-08
> Average time for MPI_Barrier(): 4.406e-07
> Average time for zero size MPI_Send(): 4.769e-06
> #PETSc Option Table entries:
> -log_view # (source: command line)
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: --with-debugging=0 --prefix=/opt/anl/petsc-3.19.5 
> --download-mumps --download-scalapack --with-mpi=1 
> --with-mpi-dir=/opt/anl/mpich COPTFLAGS=-O2 CXXOPTFLAGS=-O2 FOPTFLAGS=-O2
> -----------------------------------------
> Libraries compiled on 2023-09-08 16:27:49 on ubuntu-office 
> Machine characteristics: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
> Using PETSc directory: /opt/anl/petsc-3.19.5
> Using PETSc arch: 
> -----------------------------------------
> 
> Using C compiler: /opt/anl/mpich/bin/mpicc  -fPIC -Wall -Wwrite-strings 
> -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow 
> -fstack-protector -fvisibility=hidden -O2  
> Using Fortran compiler: /opt/anl/mpich/bin/mpif90  -fPIC -Wall 
> -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch 
> -Wno-unused-dummy-argument -O2    
> -----------------------------------------
> 
> Using include paths: -I/opt/anl/petsc-3.19.5/include -I/opt/anl/mpich/include
> -----------------------------------------
> 
> Using C linker: /opt/anl/mpich/bin/mpicc
> Using Fortran linker: /opt/anl/mpich/bin/mpif90
> Using libraries: -Wl,-rpath,/opt/anl/petsc-3.19.5/lib 
> -L/opt/anl/petsc-3.19.5/lib -lpetsc -Wl,-rpath,/opt/anl/petsc-3.19.5/lib 
> -L/opt/anl/petsc-3.19.5/lib -Wl,-rpath,/opt/anl/mpich/lib 
> -L/opt/anl/mpich/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/11 
> -L/usr/lib/gcc/x86_64-linux-gnu/11 -ldmumps -lmumps_common -lpord -lpthread 
> -lscalapack -llapack -lblas -lm -lX11 -lmpifort -lmpi -lgfortran -lm 
> -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -lquadmath
> -----------------------------------------
> 
> Chris
> 
> 
> On Fri, Sep 8, 2023 at 3:00 PM Barry Smith <[email protected] 
> <mailto:[email protected]>> wrote:
>> 
>>   It would be very helpful if you could run on 1 and 2 ranks with -log_view 
>> and send all the output.
>> 
>>   
>> 
>>> On Sep 8, 2023, at 4:52 PM, Chris Hewson <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hi There,
>>> 
>>> I am trying to solve a linear problem and am having an issue when I use 
>>> more MPI processes with the KSPsolve slowing down considerably the more 
>>> processes I add.
>>> 
>>> The matrix itself is 620100 X 620100 with ~5 million non-zero entries, I am 
>>> using petsc version 3.19.5 and have tried with a couple different versions 
>>> of mpich getting the same behavior (v4.1.2 w/ device ch4:ofi and v3.3.2 w/ 
>>> ch3:sock).
>>> 
>>> In testing, I've noticed the following trend for speed for the KSPSolve 
>>> function call:
>>> 1 core: 4042 ms
>>> 2 core: 7085 ms
>>> 4 core: 26573 ms
>>> 8 core: 65745 ms
>>> 16 core: 149283 ms
>>> 
>>> This was all done on a single node machine w/ 16 non-hyperthreaded cores. 
>>> We solve quite a few different matrices with PETSc using MPI and haven't 
>>> noticed an impact like this on performance before.
>>> 
>>> I am very confused by this and am a little stumped at the moment as to why 
>>> this was happening. I've been using the KSPBCGS solver to solve the 
>>> problem. I have tried with multiple different solvers and pre-conditioners 
>>> (we usually don't use a pre-conditioner for this part of our code). 
>>> 
>>> It did seem that using the piped BCGS solver did help improve the parallel 
>>> speed slightly (maybe 15%), but it still doesn't come close to the single 
>>> threaded speed. 
>>> 
>>> I'll attach a link to a folder that contains the specific A, x and b 
>>> matrices for this problem, as well as a main.cpp file that I was using for 
>>> testing. 
>>> 
>>> https://drive.google.com/drive/folders/1CEDinKxu8ZbKpLtwmqKqP1ZIDG7JvDI1?usp=sharing
>>> 
>>> I was testing this in our main code base, but don't include that here, and 
>>> observe very similar speed results to the ones above. We do use Metis to 
>>> graph partition in our own code and checked the vector and matrix 
>>> partitioning and that all made sense. I could be doing the partitioning 
>>> incorrectly in the example (not 100% sure how it works with the viewer/load 
>>> functions).
>>> 
>>> Any insight or thoughts on this would be greatly appreciated.
>>> 
>>> Thanks,
>>> 
>>> Chris Hewson
>>> Senior Reservoir Simulation Engineer
>>> ResFrac
>>> +1.587.575.9792
>>

Re: [petsc-users] Slower performance using more MPI processes

Reply via email to