Analysis of performance of parallel code as processors increase

Barry Smith Fri, 6 Jun 2008 21:23:15 -0500

    You are not using hypre, you are using block Jacobi with ILU on  
the blocks.


    The number of iterations goes from around 4000 to around 5000 in  
going from 4 to 8 processes,
this is why you do not see such a great speedup.

    Barry

On Jun 6, 2008, at 8:07 PM, Ben Tay wrote:

> Hi,
>
> I have coded in parallel using PETSc and Hypre. I found that going  
> from 1 to 4 processors gives an almost 4 times increase. However  
> from 4 to 8 processors only increase performance by 1.2-1.5 instead  
> of 2.
>
> Is the slowdown due to the size of the matrix being not large  
> enough? Currently I am using 600x2160 to do the benchmark. Even when  
> increase the matrix size to 900x3240  or 1200x2160, the performance  
> increase is also not much. Is it possible to use -log_summary find  
> out the error? I have attached the log file comparison for the 4 and  
> 8 processors, I found that some event like VecScatterEnd, VecNorm  
> and MatAssemblyBegin have much higher ratios. Does it indicate  
> something? Another strange thing is that MatAssemblyBegin for the 4  
> pros has a much higher ratio than the 8pros. I thought there should  
> be less communications for the 4 pros case, and so the ratio should  
> be lower. Does it mean there's some communication problem at that  
> time?
>
> Thank you very much.
>
> Regards
>
>
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript - 
> r -fCourier9' to print this document            ***
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance  
> Summary: ----------------------------------------------
>
> ./a.out on a atlas3-mp named atlas3-c43 with 4 processors, by  
> g0306332 Fri Jun  6 17:29:26 2008
> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST  
> 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                         Max       Max/Min        Avg      Total
> Time (sec):           1.750e+03      1.00043   1.750e+03
> Objects:              4.200e+01      1.00000   4.200e+01
> Flops:                6.961e+10      1.00074   6.959e+10  2.784e+11
> Flops/sec:            3.980e+07      1.00117   3.978e+07  1.591e+08
> MPI Messages:         8.168e+03      2.00000   6.126e+03  2.450e+04
> MPI Message Lengths:  5.525e+07      2.00000   6.764e+03  1.658e+08
> MPI Reductions:       3.203e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type  
> (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of  
> length N --> 2N flops
>                            and VecAXPY() for complex vectors of  
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---  
> Messages ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts    
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 1.7495e+03 100.0%  2.7837e+11 100.0%  2.450e+04  
> 100.0%  6.764e+03      100.0%  1.281e+04 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on  
> interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops/sec: Max - maximum over all processors
>                       Ratio - ratio of maximum to minimum over all  
> processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with  
> PetscLogStagePush() and PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in  
> this phase
>      %M - percent messages in this phase     %L - percent message  
> lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max  
> time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
>      ##########################################################
>      #                                                        #
>      #                          WARNING!!!                    #
>      #                                                        #
>      #   This code was run without the PreLoadBegin()         #
>      #   macros. To get timing results we always recommend    #
>      #   preloading. otherwise timing numbers may be          #
>      #   meaningless.                                         #
>      ##########################################################
>
>
> Event                Count      Time (sec)     Flops/ 
> sec                         --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg  
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult             4082 1.0 8.2037e+01 1.5 4.67e+08 1.5 2.4e+04 6.8e 
> +03 0.0e+00  4 37100100  0   4 37100100  0  1240
> MatSolve            1976 1.0 1.3250e+02 1.5 2.52e+08 1.5 0.0e+00 0.0e 
> +00 0.0e+00  6 31  0  0  0   6 31  0  0  0   655
> MatLUFactorNum       300 1.0 3.8260e+01 1.2 2.07e+08 1.2 0.0e+00 0.0e 
> +00 0.0e+00  2  9  0  0  0   2  9  0  0  0   668
> MatILUFactorSym        1 1.0 2.2550e-01 2.7 0.00e+00 0.0 0.0e+00 0.0e 
> +00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatConvert             1 1.0 2.9182e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyBegin     301 1.0 1.0776e+021228.9 0.00e+00 0.0 0.0e+00  
> 0.0e+00 6.0e+02  4  0  0  0  5   4  0  0  0  5     0
> MatAssemblyEnd       301 1.0 9.6146e+00 1.1 0.00e+00 0.0 1.2e+01 3.6e 
> +03 3.1e+02  1  0  0  0  2   1  0  0  0  2     0
> MatGetRow         324000 1.0 1.2161e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            3 1.0 5.0068e-06 1.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 2.1279e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetup             601 1.0 2.5108e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve             600 1.0 1.2353e+03 1.0 5.64e+07 1.0 2.4e+04 6.8e 
> +03 8.3e+03 71100100100 65  71100100100 65   225
> PCSetUp              601 1.0 4.0116e+01 1.2 1.96e+08 1.2 0.0e+00 0.0e 
> +00 5.0e+00  2  9  0  0  0   2  9  0  0  0   637
> PCSetUpOnBlocks      300 1.0 3.8513e+01 1.2 2.06e+08 1.2 0.0e+00 0.0e 
> +00 3.0e+00  2  9  0  0  0   2  9  0  0  0   664
> PCApply             4682 1.0 1.0566e+03 1.0 2.12e+07 1.0 0.0e+00 0.0e 
> +00 0.0e+00 59 31  0  0  0  59 31  0  0  0    82
> VecDot              4812 1.0 8.2762e+00 1.1 4.00e+08 1.1 0.0e+00 0.0e 
> +00 4.8e+03  0  4  0  0 38   0  4  0  0 38  1507
> VecNorm             3479 1.0 9.2739e+01 8.3 3.15e+08 8.3 0.0e+00 0.0e 
> +00 3.5e+03  4  5  0  0 27   4  5  0  0 27   152
> VecCopy              900 1.0 2.0819e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              5882 1.0 9.4626e+00 1.5 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY             5585 1.0 1.5397e+01 1.5 4.67e+08 1.5 0.0e+00 0.0e 
> +00 0.0e+00  1  7  0  0  0   1  7  0  0  0  1273
> VecAYPX             2879 1.0 1.0303e+01 1.6 4.45e+08 1.6 0.0e+00 0.0e 
> +00 0.0e+00  0  4  0  0  0   0  4  0  0  0  1146
> VecWAXPY            2406 1.0 7.7902e+00 1.6 3.14e+08 1.6 0.0e+00 0.0e 
> +00 0.0e+00  0  2  0  0  0   0  2  0  0  0   801
> VecAssemblyBegin    1200 1.0 8.4259e+00 3.8 0.00e+00 0.0 0.0e+00 0.0e 
> +00 3.6e+03  0  0  0  0 28   0  0  0  0 28     0
> VecAssemblyEnd      1200 1.0 2.4173e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin     4082 1.0 1.2512e-01 1.5 0.00e+00 0.0 2.4e+04 6.8e 
> +03 0.0e+00  0  0100100  0   0  0100100  0     0
> VecScatterEnd       4082 1.0 2.0954e+0153.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants'  
> Mem.
>
> --- Event Stage 0: Main Stage
>
>              Matrix     7              7  321241092     0
>       Krylov Solver     3              3          8     0
>      Preconditioner     3              3        528     0
>           Index Set     7              7    7785600     0
>                 Vec    20             20   46685344     0
>         Vec Scatter     2              2          0     0
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> Average time to get PetscTime(): 1.90735e-07
> Average time for MPI_Barrier(): 1.45912e-05
> Average time for zero size MPI_Send(): 7.27177e-06
> OptionTable: -log_summary test4_600
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8  
> sizeof(PetscScalar) 8
> Configure run at: Tue Jan  8 22:22:08 2008
> Configure options: --with-memcmp-ok --sizeof_char=1 -- 
> sizeof_void_p=8 --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 -- 
> sizeof_long_long=8 --sizeof_float=4 --sizeof_double=8 -- 
> bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with- 
> vendor-compilers=intel --with-x=0 --with-hypre-dir=/home/enduser/ 
> g0306332/lib/hypre --with-debugging=0 --with-batch=1 --with-mpi- 
> shared=0 --with-mpi-include=/usr/local/topspin/mpi/mpich/include -- 
> with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a --with- 
> mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun --with-blas-lapack- 
> dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
> -----------------------------------------
> Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
> Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed  
> Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
> Using PETSc arch: atlas3-mpi
> -----------------------------------------
> Using C compiler: mpicc -fPIC -O
> Using Fortran compiler: mpif90 -I. -fPIC -O
> -----------------------------------------
> Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 -I/ 
> nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi -I/nfs/ 
> home/enduser/g0306332/petsc-2.3.3-p8/include -I/home/enduser/ 
> g0306332/lib/hypre/include -I/usr/local/topspin/mpi/mpich/include
> ------------------------------------------
> Using C linker: mpicc -fPIC -O
> Using Fortran linker: mpif90 -I. -fPIC -O
> Using libraries: -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3- 
> p8/lib/atlas3-mpi -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/ 
> atlas3-mpi -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat - 
> lpetscvec -lpetsc        -Wl,-rpath,/home/enduser/g0306332/lib/hypre/ 
> lib -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE -Wl,-rpath,/opt/ 
> mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/ 
> opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat- 
> linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/ 
> opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,- 
> rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/usr/local/ 
> topspin/mpi/mpich/lib -L/usr/local/topspin/mpi/mpich/lib -lmpich - 
> Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/ 
> lib/em64t -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/ 
> local/ofed/lib64 -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/ 
> 0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs - 
> libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/ 
> intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/ 
> 3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s - 
> lmpichf90nc -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,- 
> rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib - 
> lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,- 
> rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib - 
> Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc+ 
> + -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,- 
> rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 - 
> Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,- 
> rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 - 
> Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/ 
> 0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -Wl,- 
> rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs - 
> libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/ 
> intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/ 
> 3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
> ------------------------------------------
> ************************************************************************************************************************
> ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript - 
> r -fCourier9' to print this document            ***
> ************************************************************************************************************************
>
> ---------------------------------------------- PETSc Performance  
> Summary: ----------------------------------------------
>
> ./a.out on a atlas3-mp named atlas3-c18 with 8 processors, by  
> g0306332 Fri Jun  6 17:23:25 2008
> Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST  
> 2007 HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                         Max       Max/Min        Avg      Total
> Time (sec):           1.140e+03      1.00019   1.140e+03
> Objects:              4.200e+01      1.00000   4.200e+01
> Flops:                4.620e+10      1.00158   4.619e+10  3.695e+11
> Flops/sec:            4.053e+07      1.00177   4.051e+07  3.241e+08
> MPI Messages:         9.954e+03      2.00000   8.710e+03  6.968e+04
> MPI Message Lengths:  7.224e+07      2.00000   7.257e+03  5.057e+08
> MPI Reductions:       1.716e+03      1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type  
> (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of  
> length N --> 2N flops
>                            and VecAXPY() for complex vectors of  
> length N --> 8N flops
>
> Summary of Stages:   ----- Time ------  ----- Flops -----  ---  
> Messages ---  -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts    
> %Total     Avg         %Total   counts   %Total
> 0:      Main Stage: 1.1402e+03 100.0%  3.6953e+11 100.0%  6.968e+04  
> 100.0%  7.257e+03      100.0%  1.372e+04 100.0%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on  
> interpreting output.
> Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops/sec: Max - maximum over all processors
>                       Ratio - ratio of maximum to minimum over all  
> processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with  
> PetscLogStagePush() and PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in  
> this phase
>      %M - percent messages in this phase     %L - percent message  
> lengths in this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max  
> time over all processors)
> ------------------------------------------------------------------------------------------------------------------------
>
>
>      ##########################################################
>      #                                                        #
>      #                          WARNING!!!                    #
>      #                                                        #
>      #   This code was run without the PreLoadBegin()         #
>      #   macros. To get timing results we always recommend    #
>      #   preloading. otherwise timing numbers may be          #
>      #   meaningless.                                         #
>      ##########################################################
>
>
> Event                Count      Time (sec)     Flops/ 
> sec                         --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg  
> len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> MatMult             4975 1.0 7.8154e+01 1.9 4.19e+08 1.9 7.0e+04 7.3e 
> +03 0.0e+00  5 38100100  0   5 38100100  0  1798
> MatSolve            2855 1.0 1.0870e+02 1.8 2.57e+08 1.8 0.0e+00 0.0e 
> +00 0.0e+00  7 34  0  0  0   7 34  0  0  0  1153
> MatLUFactorNum       300 1.0 2.3238e+01 1.5 2.07e+08 1.5 0.0e+00 0.0e 
> +00 0.0e+00  2  7  0  0  0   2  7  0  0  0  1099
> MatILUFactorSym        1 1.0 6.1973e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e 
> +00 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatConvert             1 1.0 1.4168e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyBegin     301 1.0 6.9683e+01 8.6 0.00e+00 0.0 0.0e+00 0.0e 
> +00 6.0e+02  4  0  0  0  4   4  0  0  0  4     0
> MatAssemblyEnd       301 1.0 6.2247e+00 1.2 0.00e+00 0.0 2.8e+01 3.6e 
> +03 3.1e+02  0  0  0  0  2   0  0  0  0  2     0
> MatGetRow         162000 1.0 6.0330e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetRowIJ            3 1.0 9.0599e-06 3.2 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatGetOrdering         1 1.0 5.6710e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e 
> +00 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSetup             601 1.0 1.5631e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve             600 1.0 8.1668e+02 1.0 5.66e+07 1.0 7.0e+04 7.3e 
> +03 9.2e+03 72100100100 67  72100100100 67   452
> PCSetUp              601 1.0 2.4372e+01 1.5 1.93e+08 1.5 0.0e+00 0.0e 
> +00 5.0e+00  2  7  0  0  0   2  7  0  0  0  1048
> PCSetUpOnBlocks      300 1.0 2.3303e+01 1.5 2.07e+08 1.5 0.0e+00 0.0e 
> +00 3.0e+00  2  7  0  0  0   2  7  0  0  0  1096
> PCApply             5575 1.0 6.5344e+02 1.1 2.57e+07 1.1 0.0e+00 0.0e 
> +00 0.0e+00 55 34  0  0  0  55 34  0  0  0   192
> VecDot              4840 1.0 6.8932e+00 1.3 3.07e+08 1.3 0.0e+00 0.0e 
> +00 4.8e+03  1  3  0  0 35   1  3  0  0 35  1820
> VecNorm             4365 1.0 1.2250e+02 3.6 6.82e+07 3.6 0.0e+00 0.0e 
> +00 4.4e+03  8  5  0  0 32   8  5  0  0 32   153
> VecCopy              900 1.0 1.4297e+00 1.8 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSet              6775 1.0 8.1405e+00 1.8 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY             6485 1.0 1.0003e+01 1.9 5.73e+08 1.9 0.0e+00 0.0e 
> +00 0.0e+00  1  7  0  0  0   1  7  0  0  0  2420
> VecAYPX             3765 1.0 7.8289e+00 2.0 5.17e+08 2.0 0.0e+00 0.0e 
> +00 0.0e+00  0  4  0  0  0   0  4  0  0  0  2092
> VecWAXPY            2420 1.0 3.8504e+00 1.9 3.80e+08 1.9 0.0e+00 0.0e 
> +00 0.0e+00  0  2  0  0  0   0  2  0  0  0  1629
> VecAssemblyBegin    1200 1.0 9.2808e+00 3.4 0.00e+00 0.0 0.0e+00 0.0e 
> +00 3.6e+03  1  0  0  0 26   1  0  0  0 26     0
> VecAssemblyEnd      1200 1.0 2.3313e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin     4975 1.0 2.2727e-01 2.6 0.00e+00 0.0 7.0e+04 7.3e 
> +03 0.0e+00  0  0100100  0   0  0100100  0     0
> VecScatterEnd       4975 1.0 2.7557e+0168.1 0.00e+00 0.0 0.0e+00 0.0e 
> +00 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type          Creations   Destructions   Memory  Descendants'  
> Mem.
>
> --- Event Stage 0: Main Stage
>
>              Matrix     7              7  160595412     0
>       Krylov Solver     3              3          8     0
>      Preconditioner     3              3        528     0
>           Index Set     7              7    3897600     0
>                 Vec    20             20   23357344     0
>         Vec Scatter     2              2          0     0
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> Average time to get PetscTime(): 1.19209e-07
> Average time for MPI_Barrier(): 2.10285e-05
> Average time for zero size MPI_Send(): 7.59959e-06
> OptionTable: -log_summary test8_600
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8  
> sizeof(PetscScalar) 8
> Configure run at: Tue Jan  8 22:22:08 2008
> Configure options: --with-memcmp-ok --sizeof_char=1 -- 
> sizeof_void_p=8 --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 -- 
> sizeof_long_long=8 --sizeof_float=4 --sizeof_double=8 -- 
> bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 --with- 
> vendor-compilers=intel --with-x=0 --with-hypre-dir=/home/enduser/ 
> g0306332/lib/hypre --with-debugging=0 --with-batch=1 --with-mpi- 
> shared=0 --with-mpi-include=/usr/local/topspin/mpi/mpich/include -- 
> with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a --with- 
> mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun --with-blas-lapack- 
> dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
> -----------------------------------------
> Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
> Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed  
> Jul 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
> Using PETSc arch: atlas3-mpi
> -----------------------------------------
> Using C compiler: mpicc -fPIC -O
> Using Fortran compiler: mpif90 -I. -fPIC -O
> -----------------------------------------
> Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 -I/ 
> nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi -I/nfs/ 
> home/enduser/g0306332/petsc-2.3.3-p8/include -I/home/enduser/ 
> g0306332/lib/hypre/include -I/usr/local/topspin/mpi/mpich/include
> ------------------------------------------
> Using C linker: mpicc -fPIC -O
> Using Fortran linker: mpif90 -I. -fPIC -O
> Using libraries: -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3- 
> p8/lib/atlas3-mpi -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/ 
> atlas3-mpi -lpetscts -lpetscsnes -lpetscksp -lpetscdm -lpetscmat - 
> lpetscvec -lpetsc        -Wl,-rpath,/home/enduser/g0306332/lib/hypre/ 
> lib -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE -Wl,-rpath,/opt/ 
> mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/ 
> opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat- 
> linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,-rpath,/ 
> opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 -Wl,- 
> rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/usr/local/ 
> topspin/mpi/mpich/lib -L/usr/local/topspin/mpi/mpich/lib -lmpich - 
> Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/ 
> lib/em64t -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/ 
> local/ofed/lib64 -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/ 
> 0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs - 
> libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/ 
> intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/ 
> 3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s - 
> lmpichf90nc -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,- 
> rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib - 
> lifport -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,- 
> rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib - 
> Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc+ 
> + -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/ 
> local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/ 
> usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,- 
> rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 - 
> Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lstdc++ -lcxaguard -Wl,- 
> rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 - 
> Wl,-rpath,/opt/intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64- 
> redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -Wl,-rpath,/opt/mvapich/ 
> 0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -Wl,- 
> rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs - 
> libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/ 
> intel/cce/9.1.049/lib -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/ 
> 3.4.6/ -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/ 
> lib64 -L/usr/lib64 -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
> ------------------------------------------

Analysis of performance of parallel code as processors increase

Reply via email to