Re: [petsc-dev] bad cpu/MPI performance problem

Barry Smith Sun, 08 Jan 2023 13:07:36 -0800


> On Jan 8, 2023, at 2:44 PM, Matthew Knepley <knep...@gmail.com> wrote:
> 
> On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsm...@petsc.dev 
> <mailto:bsm...@petsc.dev>> wrote:
>> 
>>   Mark,
>> 
>>   Looks like the error checking in PetscCommDuplicate() is doing its job. It 
>> is reporting an attempt to use an PETSc object constructer on a subset of 
>> ranks of an MPI_Comm (which is, of course, fundamentally impossible in the 
>> PETSc/MPI model)
>> 
>> Note that nroots can be negative on a particular rank but 
>> DMPlexLabelComplete_Internal() is collective on sf based on the comment in 
>> the code below
>> 
>> 
>> struct _p_PetscSF {
>> ....
>>   PetscInt     nroots;  /* Number of root vertices on current process 
>> (candidates for incoming edges) */
>> 
>> But the next routine calls a collective only when nroots >= 0 
>> 
>> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel label, 
>> PetscBool completeCells){
>> ...
>>   PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL));
>>   if (nroots >= 0) {
>>     DMLabel         lblRoots, lblLeaves;
>>     IS              valueIS, pointIS;
>>     const PetscInt *values;
>>     PetscInt        numValues, v;
>> 
>>     /* Pull point contributions from remote leaves into local roots */
>>     PetscCall(DMLabelGather(label, sfPoint, &lblLeaves));
>> 
>> 
>> The code is four years old? How come this problem of calling the 
>> constructure on a subset of ranks hasn't come up since day 1? 
> 
> The contract here is that it should be impossible to have nroots < 0 (meaning 
> the SF is not setup) on a subset of processes. Do we know that this is 
> happening?


   Then there should definitely be a PetscCheck() that nroots !< 0 here in the 
code and there is absolutely no need for the if (nroots >=0) specific root in 
the code.


> 
>   Thanks,
> 
>     Matt
>  
>>> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfad...@lbl.gov 
>>> <mailto:mfad...@lbl.gov>> wrote:
>>> 
>>> I am running on Crusher, CPU only, 64 cores per node with Plex/PetscFE. 
>>> In going up to 64 nodes, something really catastrophic is happening. 
>>> I understand I am not using the machine the way it was intended, but I just 
>>> want to see if there are any options that I could try for a quick fix/help.
>>> 
>>> In a debug build I get a stack trace on many but not all of the 4K 
>>> processes. 
>>> Alas, I am not sure why this job was terminated but every process that I 
>>> checked, that had an "ERROR", had this stack:
>>> 
>>> 11:57 main *+= crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ 
>>> grep ERROR slurm-245063.out |g 3160
>>> [3160]PETSC ERROR: 
>>> ------------------------------------------------------------------------
>>> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the 
>>> batch system) has told this process to end
>>> [3160]PETSC ERROR: Try option -start_in_debugger or 
>>> -on_error_attach_debugger
>>> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
>>> https://petsc.org/release/faq/
>>> [3160]PETSC ERROR: ---------------------  Stack Frames 
>>> ------------------------------------
>>> [3160]PETSC ERROR: The line numbers in the error traceback are not always 
>>> exact.
>>> [3160]PETSC ERROR: #1 MPI function
>>> [3160]PETSC ERROR: #2 PetscCommDuplicate() at 
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248
>>> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at 
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56
>>> [3160]PETSC ERROR: #4 PetscSFCreate() at 
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65
>>> [3160]PETSC ERROR: #5 DMLabelGather() at 
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932
>>> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at 
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177
>>> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at 
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227
>>> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at 
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301
>>> [3160]PETSC ERROR: #9 DMCopyDS() at 
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117
>>> [3160]PETSC ERROR: #10 DMCopyDisc() at 
>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143
>>> [3160]PETSC ERROR: #11 SetupDiscretization() at 
>>> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755
>>> 
>>> Maybe the MPI is just getting overwhelmed. 
>>> 
>>> And I was able to get one run to to work (one TS with beuler), and the 
>>> solver performance was horrendous and I see this (attached):
>>> 
>>> Time (sec):           1.601e+02     1.001   1.600e+02
>>> VecMDot           111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 0.0e+00 
>>> 1.1e+05 30  4  0  0 23  30  4  0  0 23   499
>>> VecNorm           163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 0.0e+00 
>>> 1.6e+05 39  2  0  0 34  39  2  0  0 34   139
>>> VecNormalize      154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 0.0e+00 
>>> 1.5e+05 38  2  0  0 32  38  2  0  0 32   189
>>> etc,
>>> KSPSolve               3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 6.0e+01 
>>> 2.8e+05 72 95 45 72 58  72 95 45 72 58  4772
>>> 
>>> Any ideas would be welcome,
>>> Thanks,
>>> Mark
>>> <cushersolve.txt>
>> 
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

Re: [petsc-dev] bad cpu/MPI performance problem

Reply via email to