Hi Barry, Looking at the streams output, that definitely seems to be the culprit, I get an output log of:
Running streams with '/opt/anl/mpich/bin/mpiexec ' using 'NPMAX=16' 1 33795.3914 Rate (MB/s) 2 37155.5788 Rate (MB/s) 1.09943 3 35549.7490 Rate (MB/s) 1.05191 4 35045.2962 Rate (MB/s) 1.03698 5 42029.1432 Rate (MB/s) 1.24364 6 38167.6186 Rate (MB/s) 1.12937 7 39850.2228 Rate (MB/s) 1.17916 8 36602.9043 Rate (MB/s) 1.08307 9 36740.0487 Rate (MB/s) 1.08713 10 34839.9674 Rate (MB/s) 1.03091 11 31107.6189 Rate (MB/s) 0.920469 12 35057.6326 Rate (MB/s) 1.03735 13 35030.9879 Rate (MB/s) 1.03656 14 33619.9223 Rate (MB/s) 0.994808 15 33503.3583 Rate (MB/s) 0.991359 16 27556.2016 Rate (MB/s) 0.815383 [image: MPIscaling.png] Chris On Fri, Sep 8, 2023 at 3:53 PM Barry Smith <[email protected]> wrote: > > For the size problem you are running this is very unexpected. The > reductions should only start to dominate for thousands of MPI ranks, not 2. > > First thing I recommend is to run the streams benchmark. Then check the > binding that MPI is doing for the two processes. You want to bind to cores > in different NUMA regions. It could be that it is binding both processes to > cores that share the same cache. The MatSolve should be embarressingly > parallel but you are getting almost no speed up in it, so something very > "wrong" is happening. > > > > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > VecDot 182 1.0 1.9998e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 7 5 0 0 0 7 5 0 0 0 1129 > VecDotNorm2 91 1.0 6.6214e-02 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 2 5 0 0 0 2 5 0 0 0 3409 > VecNorm 92 1.0 1.4790e-01 1.0 1.14e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 5 3 0 0 0 5 3 0 0 0 771 > > VecDot 198 1.0 2.1037e+00 1.1 1.23e+08 1.0 0.0e+00 0.0e+00 > 2.0e+02 33 5 0 0 43 33 5 0 0 45 117 > VecDotNorm2 99 1.0 5.0169e-01 1.2 1.23e+08 1.0 0.0e+00 0.0e+00 > 9.9e+01 7 5 0 0 22 7 5 0 0 22 489 > VecNorm 100 1.0 1.3131e+00 1.0 6.20e+07 1.0 0.0e+00 0.0e+00 > 1.0e+02 21 3 0 0 22 21 3 0 0 23 94 > VecScatterEnd 198 1.0 7.6160e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 10 0 0 0 0 10 0 0 0 0 0 > > MatSolve 183 1.0 1.0882e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 40 35 0 0 0 40 35 0 0 0 1318 > MatSolve 199 1.0 8.9666e-01 1.2 7.75e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 13 35 0 0 0 13 35 0 0 0 1729 > > > On Sep 8, 2023, at 5:22 PM, Chris Hewson <[email protected]> wrote: > > Thanks for the quick response. > > The links to the log view files are below: > 2 ranks: > > https://drive.google.com/file/d/1PGRsiHypWtN5h3uxdJBKy9WzEkE0mUgO/view?usp=drive_link > > 1 rank: > > https://drive.google.com/file/d/1hB2XyoNtLMHseZUT7jCuiixTQeBi_tjJ/view?usp=drive_link > > I'll also attach them to this email: > **************************** 1 RANK ****************************** > ------------------------------------------------------------------ PETSc > Performance Summary: > ------------------------------------------------------------------ > > ./petsc-testing on a named ubuntu-office with 1 processor, by chewson Fri > Sep 8 15:16:51 2023 > Using Petsc Release Version 3.19.5, unknown > > Max Max/Min Avg Total > Time (sec): 2.746e+00 1.000 2.746e+00 > Objects: 2.100e+01 1.000 2.100e+01 > Flops: 4.117e+09 1.000 4.117e+09 4.117e+09 > Flops/sec: 1.499e+09 1.000 1.499e+09 1.499e+09 > MPI Msg Count: 0.000e+00 0.000 0.000e+00 0.000e+00 > MPI Msg Len (bytes): 0.000e+00 0.000 0.000e+00 0.000e+00 > MPI Reductions: 0.000e+00 0.000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N > --> 2N flops > and VecAXPY() for complex vectors of length N > --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages > --- -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total Count > %Total Avg %Total Count %Total > 0: Main Stage: 2.7458e+00 100.0% 4.1167e+09 100.0% 0.000e+00 > 0.0% 0.000e+00 0.0% 0.000e+00 0.0% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flop: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all processors > Mess: number of messages sent > AvgLen: average message length (bytes) > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flop in this > phase > %M - percent messages in this phase %L - percent message lengths > in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over > all processors) > > ------------------------------------------------------------------------------------------------------------------------ > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 182 1.0 8.0351e-01 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 29 35 0 0 0 29 35 0 0 0 1775 > MatSolve 183 1.0 1.0882e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 40 35 0 0 0 40 35 0 0 0 1318 > MatLUFactorNum 1 1.0 1.3892e-02 1.0 1.30e+07 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 934 > MatILUFactorSym 1 1.0 2.1567e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > MatAssemblyBegin 1 1.0 1.0420e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 6.9049e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRowIJ 1 1.0 3.8500e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 1.7026e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatLoad 1 1.0 6.6749e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 > VecDot 182 1.0 1.9998e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 7 5 0 0 0 7 5 0 0 0 1129 > VecDotNorm2 91 1.0 6.6214e-02 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 2 5 0 0 0 2 5 0 0 0 3409 > VecNorm 92 1.0 1.4790e-01 1.0 1.14e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 5 3 0 0 0 5 3 0 0 0 771 > VecCopy 2 1.0 6.8473e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 3 1.0 1.3256e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAXPBYCZ 182 1.0 1.6542e-01 1.0 4.51e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 6 11 0 0 0 6 11 0 0 0 2729 > VecWAXPY 182 1.0 1.4476e-01 1.0 2.26e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 5 5 0 0 0 5 5 0 0 0 1559 > VecLoad 2 1.0 1.0104e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSetUp 1 1.0 9.9204e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 2.6210e+00 1.0 4.10e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 95 100 0 0 0 95 100 0 0 0 1566 > PCSetUp 1 1.0 3.7232e-02 1.0 1.30e+07 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 349 > PCApply 183 1.0 1.0885e+00 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 > 0.0e+00 40 35 0 0 0 40 35 0 0 0 1318 > > --- Event Stage 1: Unknown > > > ------------------------------------------------------------------------------------------------------------------------ > > Object Type Creations Destructions. Reports information only > for process 0. > > --- Event Stage 0: Main Stage > > Viewer 4 1 > Matrix 3 1 > Vector 9 1 > Krylov Solver 1 0 > Preconditioner 1 0 > Index Set 3 0 > > --- Event Stage 1: Unknown > > > ======================================================================================================================== > Average time to get PetscTime(): 1.51e-08 > #PETSc Option Table entries: > -log_view # (source: command line) > #End of PETSc Option Table entries > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 sizeof(PetscInt) 4 > Configure options: --with-debugging=0 --prefix=/opt/anl/petsc-3.19.5 > --download-mumps --download-scalapack --with-mpi=1 > --with-mpi-dir=/opt/anl/mpich COPTFLAGS=-O2 CXXOPTFLAGS=-O2 FOPTFLAGS=-O2 > ----------------------------------------- > Libraries compiled on 2023-09-08 16:27:49 on ubuntu-office > Machine characteristics: Linux-6.2.0-26-generic-x86_64-with-glibc2.35 > Using PETSc directory: /opt/anl/petsc-3.19.5 > Using PETSc arch: > ----------------------------------------- > > Using C compiler: /opt/anl/mpich/bin/mpicc -fPIC -Wall -Wwrite-strings > -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow > -fstack-protector -fvisibility=hidden -O2 > Using Fortran compiler: /opt/anl/mpich/bin/mpif90 -fPIC -Wall > -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch > -Wno-unused-dummy-argument -O2 > ----------------------------------------- > > Using include paths: -I/opt/anl/petsc-3.19.5/include > -I/opt/anl/mpich/include > ----------------------------------------- > > Using C linker: /opt/anl/mpich/bin/mpicc > Using Fortran linker: /opt/anl/mpich/bin/mpif90 > Using libraries: -Wl,-rpath,/opt/anl/petsc-3.19.5/lib > -L/opt/anl/petsc-3.19.5/lib -lpetsc -Wl,-rpath,/opt/anl/petsc-3.19.5/lib > -L/opt/anl/petsc-3.19.5/lib -Wl,-rpath,/opt/anl/mpich/lib > -L/opt/anl/mpich/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/11 > -L/usr/lib/gcc/x86_64-linux-gnu/11 -ldmumps -lmumps_common -lpord -lpthread > -lscalapack -llapack -lblas -lm -lX11 -lmpifort -lmpi -lgfortran -lm > -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -lquadmath > ----------------------------------------- > > ************************* 2 RANKS ************************************ > ------------------------------------------------------------------ PETSc > Performance Summary: > ------------------------------------------------------------------ > > ./petsc-testing on a named ubuntu-office with 2 processors, by chewson > Fri Sep 8 15:16:43 2023 > Using Petsc Release Version 3.19.5, unknown > > Max Max/Min Avg Total > Time (sec): 6.167e+00 1.001 6.164e+00 > Objects: 3.200e+01 1.000 3.200e+01 > Flops: 2.233e+09 1.000 2.233e+09 4.467e+09 > Flops/sec: 3.625e+08 1.001 3.623e+08 7.247e+08 > MPI Msg Count: 2.050e+02 1.000 2.050e+02 4.100e+02 > MPI Msg Len (bytes): 3.437e+07 1.000 1.676e+05 6.874e+07 > MPI Reductions: 4.580e+02 1.000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N > --> 2N flops > and VecAXPY() for complex vectors of length N > --> 8N flops > > Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages > --- -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total Count > %Total Avg %Total Count %Total > 0: Main Stage: 6.1642e+00 100.0% 4.4670e+09 100.0% 4.100e+02 > 100.0% 1.676e+05 100.0% 4.400e+02 96.1% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flop: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all processors > Mess: number of messages sent > AvgLen: average message length (bytes) > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flop in this > phase > %M - percent messages in this phase %L - percent message lengths > in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over > all processors) > > ------------------------------------------------------------------------------------------------------------------------ > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > BuildTwoSided 1 1.0 3.1824e-05 1.0 0.00e+00 0.0 2.0e+00 4.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMult 198 1.0 1.3480e+00 1.4 7.76e+08 1.0 4.0e+02 9.4e+04 > 0.0e+00 19 35 97 54 0 19 35 97 54 0 1151 > MatSolve 199 1.0 8.9666e-01 1.2 7.75e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 13 35 0 0 0 13 35 0 0 0 1729 > MatLUFactorNum 1 1.0 7.1852e-03 1.0 6.43e+06 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 1789 > MatILUFactorSym 1 1.0 1.0472e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyBegin 1 1.0 9.8700e-07 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 6.8341e-03 1.1 0.00e+00 0.0 4.0e+00 2.3e+04 > 5.0e+00 0 0 1 0 1 0 0 1 0 1 0 > MatGetRowIJ 1 1.0 1.9930e-06 6.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 7.4472e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatLoad 1 1.0 9.8562e-02 1.0 0.00e+00 0.0 1.0e+01 2.7e+06 > 1.7e+01 2 0 2 39 4 2 0 2 39 4 0 > VecDot 198 1.0 2.1037e+00 1.1 1.23e+08 1.0 0.0e+00 0.0e+00 > 2.0e+02 33 5 0 0 43 33 5 0 0 45 117 > VecDotNorm2 99 1.0 5.0169e-01 1.2 1.23e+08 1.0 0.0e+00 0.0e+00 > 9.9e+01 7 5 0 0 22 7 5 0 0 22 489 > VecNorm 100 1.0 1.3131e+00 1.0 6.20e+07 1.0 0.0e+00 0.0e+00 > 1.0e+02 21 3 0 0 22 21 3 0 0 23 94 > VecCopy 2 1.0 7.4971e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 202 1.0 8.0035e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > VecAXPBYCZ 198 1.0 1.2889e-01 1.5 2.46e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 2 11 0 0 0 2 11 0 0 0 3811 > VecWAXPY 198 1.0 9.1526e-02 1.0 1.23e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 5 0 0 0 1 5 0 0 0 2683 > VecLoad 2 1.0 9.8983e-03 1.0 0.00e+00 0.0 4.0e+00 1.2e+06 > 1.6e+01 0 0 1 7 3 0 0 1 7 4 0 > VecScatterBegin 198 1.0 1.2941e-03 1.0 0.00e+00 0.0 4.0e+02 9.4e+04 > 0.0e+00 0 0 97 54 0 0 0 97 54 0 0 > VecScatterEnd 198 1.0 7.6160e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 10 0 0 0 0 10 0 0 0 0 0 > SFSetGraph 1 1.0 7.6630e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > SFSetUp 1 1.0 1.2410e-04 1.0 0.00e+00 0.0 4.0e+00 2.3e+04 > 1.0e+00 0 0 1 0 0 0 0 1 0 0 0 > SFPack 198 1.0 5.1814e-05 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > SFUnpack 198 1.0 3.8273e-05 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSetUp 2 1.0 4.7077e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 6.0344e+00 1.0 2.23e+09 1.0 4.0e+02 9.4e+04 > 4.0e+02 98 100 97 54 87 98 100 97 54 90 738 > PCSetUp 2 1.0 1.8496e-02 1.0 6.43e+06 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 695 > PCSetUpOnBlocks 1 1.0 1.8435e-02 1.0 6.43e+06 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 697 > PCApply 199 1.0 9.5103e-01 1.1 7.75e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 15 35 0 0 0 15 35 0 0 0 1630 > > --- Event Stage 1: Unknown > > > ------------------------------------------------------------------------------------------------------------------------ > > Object Type Creations Destructions. Reports information only > for process 0. > > --- Event Stage 0: Main Stage > > Viewer 4 1 > Matrix 5 1 > Vector 13 2 > Index Set 5 2 > Star Forest Graph 1 0 > Krylov Solver 2 0 > Preconditioner 2 0 > > --- Event Stage 1: Unknown > > > ======================================================================================================================== > Average time to get PetscTime(): 2.47e-08 > Average time for MPI_Barrier(): 4.406e-07 > Average time for zero size MPI_Send(): 4.769e-06 > #PETSc Option Table entries: > -log_view # (source: command line) > #End of PETSc Option Table entries > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 sizeof(PetscInt) 4 > Configure options: --with-debugging=0 --prefix=/opt/anl/petsc-3.19.5 > --download-mumps --download-scalapack --with-mpi=1 > --with-mpi-dir=/opt/anl/mpich COPTFLAGS=-O2 CXXOPTFLAGS=-O2 FOPTFLAGS=-O2 > ----------------------------------------- > Libraries compiled on 2023-09-08 16:27:49 on ubuntu-office > Machine characteristics: Linux-6.2.0-26-generic-x86_64-with-glibc2.35 > Using PETSc directory: /opt/anl/petsc-3.19.5 > Using PETSc arch: > ----------------------------------------- > > Using C compiler: /opt/anl/mpich/bin/mpicc -fPIC -Wall -Wwrite-strings > -Wno-unknown-pragmas -Wno-lto-type-mismatch -Wno-stringop-overflow > -fstack-protector -fvisibility=hidden -O2 > Using Fortran compiler: /opt/anl/mpich/bin/mpif90 -fPIC -Wall > -ffree-line-length-none -ffree-line-length-0 -Wno-lto-type-mismatch > -Wno-unused-dummy-argument -O2 > ----------------------------------------- > > Using include paths: -I/opt/anl/petsc-3.19.5/include > -I/opt/anl/mpich/include > ----------------------------------------- > > Using C linker: /opt/anl/mpich/bin/mpicc > Using Fortran linker: /opt/anl/mpich/bin/mpif90 > Using libraries: -Wl,-rpath,/opt/anl/petsc-3.19.5/lib > -L/opt/anl/petsc-3.19.5/lib -lpetsc -Wl,-rpath,/opt/anl/petsc-3.19.5/lib > -L/opt/anl/petsc-3.19.5/lib -Wl,-rpath,/opt/anl/mpich/lib > -L/opt/anl/mpich/lib -Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/11 > -L/usr/lib/gcc/x86_64-linux-gnu/11 -ldmumps -lmumps_common -lpord -lpthread > -lscalapack -llapack -lblas -lm -lX11 -lmpifort -lmpi -lgfortran -lm > -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -lquadmath > ----------------------------------------- > > Chris > > > On Fri, Sep 8, 2023 at 3:00 PM Barry Smith <[email protected]> wrote: > >> >> It would be very helpful if you could run on 1 and 2 ranks with >> -log_view and send all the output. >> >> >> >> On Sep 8, 2023, at 4:52 PM, Chris Hewson <[email protected]> wrote: >> >> Hi There, >> >> I am trying to solve a linear problem and am having an issue when I use >> more MPI processes with the KSPsolve slowing down considerably the more >> processes I add. >> >> The matrix itself is 620100 X 620100 with ~5 million non-zero entries, I >> am using petsc version 3.19.5 and have tried with a couple different >> versions of mpich getting the same behavior (v4.1.2 w/ device ch4:ofi and >> v3.3.2 w/ ch3:sock). >> >> In testing, I've noticed the following trend for speed for the KSPSolve >> function call: >> 1 core: 4042 ms >> 2 core: 7085 ms >> 4 core: 26573 ms >> 8 core: 65745 ms >> 16 core: 149283 ms >> >> This was all done on a single node machine w/ 16 non-hyperthreaded cores. >> We solve quite a few different matrices with PETSc using MPI and haven't >> noticed an impact like this on performance before. >> >> I am very confused by this and am a little stumped at the moment as to >> why this was happening. I've been using the KSPBCGS solver to solve the >> problem. I have tried with multiple different solvers and pre-conditioners >> (we usually don't use a pre-conditioner for this part of our code). >> >> It did seem that using the piped BCGS solver did help improve the >> parallel speed slightly (maybe 15%), but it still doesn't come close to the >> single threaded speed. >> >> I'll attach a link to a folder that contains the specific A, x and b >> matrices for this problem, as well as a main.cpp file that I was using for >> testing. >> >> >> https://drive.google.com/drive/folders/1CEDinKxu8ZbKpLtwmqKqP1ZIDG7JvDI1?usp=sharing >> >> I was testing this in our main code base, but don't include that here, >> and observe very similar speed results to the ones above. We do use Metis >> to graph partition in our own code and checked the vector and matrix >> partitioning and that all made sense. I could be doing the partitioning >> incorrectly in the example (not 100% sure how it works with the viewer/load >> functions). >> >> Any insight or thoughts on this would be greatly appreciated. >> >> Thanks, >> >> *Chris Hewson* >> Senior Reservoir Simulation Engineer >> ResFrac >> +1.587.575.9792 >> >> >> >
