Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue

Pierre Jolivet Fri, 21 Jun 2024 10:47:47 -0700

How do you set the variable?

$ MKL_VERBOSE=1 ./ex1 -ksp_converged_reason
MKL_VERBOSE oneMKL 2024.0 Update 1 Product build 20240215 for Intel(R) 64 
architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled 
processors, Lnx 2.80GHz lp64 intel_thread
MKL_VERBOSE DDOT(10,0x22127c0,1,0x22127c0,1) 2.02ms CNR:OFF Dyn:1 FastMM:1 
TID:0  NThr:1
MKL_VERBOSE DSCAL(10,0x7ffc9fb4ff08,0x22127c0,1) 12.67us CNR:OFF Dyn:1 FastMM:1 
TID:0  NThr:1
MKL_VERBOSE DDOT(10,0x22127c0,1,0x2212840,1) 1.52us CNR:OFF Dyn:1 FastMM:1 
TID:0  NThr:1
MKL_VERBOSE DDOT(10,0x2212840,1,0x2212840,1) 167ns CNR:OFF Dyn:1 FastMM:1 TID:0 
 NThr:1
[...]


> On 21 Jun 2024, at 7:37 PM, Yongzhong Li <[email protected]> 
> wrote:
> 
> This Message Is From an External Sender 
> This message came from outside your organization.
> Hello all,
> 
> I set MKL_VERBOSE = 1, but observed no print output specific to the use of 
> MKL. Does PETSc enable this verbose output?
> 
> Best,
> Yongzhong
> 
> 
>  
> From: Pierre Jolivet <[email protected] <mailto:[email protected]>>
> Date: Friday, June 21, 2024 at 1:36 AM
> To: Junchao Zhang <[email protected] <mailto:[email protected]>>
> Cc: Yongzhong Li <[email protected] 
> <mailto:[email protected]>>, [email protected] 
> <mailto:[email protected]> <[email protected] 
> <mailto:[email protected]>>
> Subject: Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc 
> KSPSolve Performance Issue
> 
> 你通常不会收到来自 [email protected] <mailto:[email protected]> 的电子邮件。了解这一点为什么很重要 
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!eXBeeIXo9Yqgp2nypqwKYimLnGBZXnF4dXxgLM1UoOIO6n8nt3XlfgjVWLPWJh4UOa5NNpx-nrJb_H828XRQKUREfR2m69oCbxI$>
>  
>  
> 
> 
> On 21 Jun 2024, at 6:42 AM, Junchao Zhang <[email protected] 
> <mailto:[email protected]>> wrote:
>  
> This Message Is From an External Sender
> This message came from outside your organization.
> I remember there are some MKL env vars to print MKL routines called. 
>  
> The environment variable is MKL_VERBOSE
>  
> Thanks,
> Pierre
> 
> 
> Maybe we can try it to see what MKL routines are really used and then we can 
> understand why some petsc functions did not speed up  
> 
> --Junchao Zhang
>  
>  
> On Thu, Jun 20, 2024 at 10:39 PM Yongzhong Li <[email protected] 
> <mailto:[email protected]>> wrote:
> This Message Is From an External Sender
> This message came from outside your organization.
>  
> Hi Barry, sorry for my last results. I didn’t fully understand the stage 
> profiling and logging in PETSc, now I only record KSPSolve() stage of my 
> program. Some sample codes are as follow,
> 
>                 // Static variable to keep track of the stage counter
>                 static int stageCounter = 1;
>  
>                 // Generate a unique stage name
>                 std::ostringstream oss;
>                 oss << "Stage " << stageCounter << " of Code";
>                 std::string stageName = oss.str();
>  
>                 // Register the stage
>                 PetscLogStage stagenum;
>  
>                 PetscLogStageRegister(stageName.c_str(), &stagenum);
>                 PetscLogStagePush(stagenum);
>  
>                 KSPSolve(*ksp_ptr, b, x);
>  
>                 PetscLogStagePop();
>                 stageCounter++;
> 
> I have attached my new logging results, there are 1 main stage and 4 other 
> stages where each one is KSPSolve() call.
> 
> To provide some additional backgrounds, if you recall, I have been trying to 
> get efficient iterative solution using multithreading. I found out by 
> compiling PETSc with Intel MKL library instead of OpenBLAS, I am able to 
> perform sparse matrix-vector multiplication faster, I am using MATSEQAIJMKL. 
> This makes the shell matrix vector product in each iteration scale well with 
> the #of threads. However, I found out the total GMERS solve time (~KSPSolve() 
> time) is not scaling well the #of threads.
> 
> From the logging results I learned that when performing KSPSolve(), there are 
> some CPU overheads in PCApply() and KSPGMERSOrthog(). I ran my programs using 
> different number of threads and plotted the time consumption for PCApply() 
> and KSPGMERSOrthog() against #of thread. I found out these two operations are 
> not scaling with the threads at all! My results are attached as the pdf to 
> give you a clear view.
> 
> My questions is,
> 
> From my understanding, in PCApply, MatSolve() is involved, KSPGMERSOrthog() 
> will have many vector operations, so why these two parts can’t scale well 
> with the # of threads when the intel MKL library is linked?
> 
> Thank you,
> Yongzhong
>  
> From: Barry Smith <[email protected] <mailto:[email protected]>>
> Date: Friday, June 14, 2024 at 11:36 AM
> To: Yongzhong Li <[email protected] 
> <mailto:[email protected]>>
> Cc: [email protected] <mailto:[email protected]> 
> <[email protected] <mailto:[email protected]>>, 
> [email protected] <mailto:[email protected]> 
> <[email protected] <mailto:[email protected]>>, Piero Triverio 
> <[email protected] <mailto:[email protected]>>
> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance 
> Issue
> 
>  
>    I am a bit confused. Without the initial guess computation, there are 
> still a bunch of events I don't understand 
>  
> MatTranspose          79 1.0 4.0598e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMatMultSym        110 1.0 1.7419e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> MatMatMultNum         90 1.0 1.2640e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> MatMatMatMultSym      20 1.0 1.3049e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> MatRARtSym            25 1.0 1.2492e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> MatMatTrnMultSym      25 1.0 8.8265e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMatTrnMultNum      25 1.0 2.4820e+02 1.0 6.83e+10 1.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   1  0  0  0  0   275
> MatTrnMatMultSym      10 1.0 7.2984e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatTrnMatMultNum      10 1.0 9.3128e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  
> in addition there are many more VecMAXPY then VecMDot (in GMRES they are each 
> done the same number of times)
>  
> VecMDot             5588 1.0 1.7183e+03 1.0 2.06e+13 1.0 0.0e+00 0.0e+00 
> 0.0e+00  8 10  0  0  0   8 10  0  0  0 12016
> VecMAXPY           22412 1.0 8.4898e+03 1.0 4.17e+13 1.0 0.0e+00 0.0e+00 
> 0.0e+00 39 20  0  0  0  39 20  0  0  0  4913
>  
> Finally there are a huge number of 
>  
> MatMultAdd        258048 1.0 1.4178e+03 1.0 6.10e+13 1.0 0.0e+00 0.0e+00 
> 0.0e+00  7 29  0  0  0   7 29  0  0  0 43025
>  
> Are you making calls to all these routines? Are you doing this inside your 
> MatMult() or before you call KSPSolve?
>  
> The reason I wanted you to make a simpler run without the initial guess code 
> is that your events are far more complicated than would be produced by GMRES 
> alone so it is not possible to understand the behavior you are seeing without 
> fully understanding all the events happening in the code.
>  
>   Barry
>  
>  
> 
> On Jun 14, 2024, at 1:19 AM, Yongzhong Li <[email protected] 
> <mailto:[email protected]>> wrote:
>  
> Thanks, I have attached the results without using any KSPGuess. At low 
> frequency, the iteration steps are quite close to the one with KSPGuess, 
> specifically 
> 
>   KSPGuess Object: 1 MPI process
>     type: fischer
>     Model 1, size 200
> 
> However, I found at higher frequency, the # of iteration steps are  
> significant higher than the one with KSPGuess, I have attahced both of the 
> results for your reference.
> 
> Moreover, could I ask why the one without the KSPGuess options can be used 
> for a baseline comparsion? What are we comparing here? How does it relate to 
> the performance issue/bottleneck I found? “I have noticed that the time taken 
> by KSPSolve is almost two times greater than the CPU time for matrix-vector 
> product multiplied by the number of iteration” 
> 
> Thank you!
> Yongzhong
>  
> From: Barry Smith <[email protected] <mailto:[email protected]>>
> Date: Thursday, June 13, 2024 at 2:14 PM
> To: Yongzhong Li <[email protected] 
> <mailto:[email protected]>>
> Cc: [email protected] <mailto:[email protected]> 
> <[email protected] <mailto:[email protected]>>, 
> [email protected] <mailto:[email protected]> 
> <[email protected] <mailto:[email protected]>>, Piero Triverio 
> <[email protected] <mailto:[email protected]>>
> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance 
> Issue
> 
>  
>   Can you please run the same thing without the  KSPGuess option(s) for a 
> baseline comparison?
>  
>    Thanks
>  
>    Barry
>  
> 
> On Jun 13, 2024, at 1:27 PM, Yongzhong Li <[email protected] 
> <mailto:[email protected]>> wrote:
>  
> This Message Is From an External Sender
> This message came from outside your organization.
> Hi Matt,
> 
> I have rerun the program with the keys you provided. The system output when 
> performing ksp solve and the final petsc log output were stored in a .txt 
> file attached for your reference.
> 
> Thanks!
> Yongzhong
>  
> From: Matthew Knepley <[email protected] <mailto:[email protected]>>
> Date: Wednesday, June 12, 2024 at 6:46 PM
> To: Yongzhong Li <[email protected] 
> <mailto:[email protected]>>
> Cc: [email protected] <mailto:[email protected]> 
> <[email protected] <mailto:[email protected]>>, 
> [email protected] <mailto:[email protected]> 
> <[email protected] <mailto:[email protected]>>, Piero Triverio 
> <[email protected] <mailto:[email protected]>>
> Subject: Re: [petsc-maint] Assistance Needed with PETSc KSPSolve Performance 
> Issue
> 
> 你通常不会收到来自 [email protected] <mailto:[email protected]> 的电子邮件。了解这一点为什么很重要 
> <https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qk-oCEwo4$>
>      
> On Wed, Jun 12, 2024 at 6:36 PM Yongzhong Li <[email protected] 
> <mailto:[email protected]>> wrote:
> Dear PETSc’s developers, I hope this email finds you well. I am currently 
> working on a project using PETSc and have encountered a performance issue 
> with the KSPSolve function. Specifically, I have noticed that the time taken 
> by KSPSolve is 
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>  
> ZjQcmQRYFpfptBannerEnd
> Dear PETSc’s developers,
> I hope this email finds you well.
> I am currently working on a project using PETSc and have encountered a 
> performance issue with the KSPSolve function. Specifically, I have noticed 
> that the time taken by KSPSolve is almost two times greater than the CPU time 
> for matrix-vector product multiplied by the number of iteration steps. I use 
> C++ chrono to record CPU time.
> For context, I am using a shell system matrix A. Despite my efforts to 
> parallelize the matrix-vector product (Ax), the overall solve time remains 
> higher than the matrix vector product per iteration indicates when multiple 
> threads were used. Here are a few details of my setup:
> Matrix Type: Shell system matrix
> Preconditioner: Shell PC
> Parallel Environment: Using Intel MKL as PETSc’s BLAS/LAPACK library, 
> multithreading is enabled
> I have considered several potential reasons, such as preconditioner setup, 
> additional solver operations, and the inherent overhead of using a shell 
> system matrix. However, since KSPSolve is a high-level API, I have been 
> unable to pinpoint the exact cause of the increased solve time.
> Have you observed the same issue? Could you please provide some experience on 
> how to diagnose and address this performance discrepancy? Any insights or 
> recommendations you could offer would be greatly appreciated.
>  
> For any performance question like this, we need to see the output of your 
> code run with
>  
>   -ksp_view -ksp_monitor_true_residual -ksp_converged_reason -log_view
>  
>   Thanks,
>  
>      Matt
>  
> Thank you for your time and assistance.
> Best regards,
> Yongzhong
> -----------------------------------------------------------
> Yongzhong Li
> PhD student | Electromagnetics Group
> Department of Electrical & Computer Engineering
> University of Toronto
> https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!dyMF1oRvr6dKSgMF8DY1CbNZpPH1TLs6jQQPaBSD91BavByk95ynHW8SxAFI8F3BNIxHhs0HO2I4dpeIlVq2fQ$
>   
> <https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!cuLttMJEcegaqu461Bt4QLsO4fASfLM5vjRbtyNhWJQiInbjgNwkGNdkFE1ebSbFjOUatYB0-jd2yQWMWzqkDFFjwMvNl3ZKAr8$>
>  
> 
>  
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
>  
> https://urldefense.us/v3/__https://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!dyMF1oRvr6dKSgMF8DY1CbNZpPH1TLs6jQQPaBSD91BavByk95ynHW8SxAFI8F3BNIxHhs0HO2I4dpfRK52EeQ$
>   
> <https://urldefense.us/v3/__http://www.cse.buffalo.edu/*knepley/__;fg!!G_uCfscf7eWS!djGfJnEhNJROfsMsBJy5u_KoRKbug55xZ64oHKUFnH2cWku_Th1hwt4TDdoMd8pWYVDzJeqJslMNZwpO3y0Et94d31qkNOuenGA$>
> <ksp_petsc_log.txt>
>  
> <ksp_petsc_log.txt><ksp_petsc_log_noguess.txt>

Re: [petsc-users] [petsc-maint] Assistance Needed with PETSc KSPSolve Performance Issue

Reply via email to