Re: [petsc-dev] bad cpu/MPI performance problem

Barry Smith Sun, 08 Jan 2023 13:13:35 -0800

   There is a bug in the routine DMPlexLabelComplete_Internal()! The code 
should definitely not have the code route around if (nroots >=0) because 
checking the nroots value to decide on the code route is simply nonsense (if 
one "knows" "by contract" that nroots is >=0 then the if () test is not needed.


   The first thing to do is to fix the bug with a PetscCheck() remove the 
nonsensical if (nroots >=0) check and rerun you code to see what happens.

  Barry

Yes it is possible that in your run the nroots is always >= 0 and some MPI bug 
is causing the problem but this doesn't change the fact that the current code 
is buggy and needs to be fixed before blaming some other bug for the problem.



> On Jan 8, 2023, at 4:04 PM, Mark Adams <mfad...@lbl.gov> wrote:
> 
> 
> 
> On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <knep...@gmail.com 
> <mailto:knep...@gmail.com>> wrote:
>> On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsm...@petsc.dev 
>> <mailto:bsm...@petsc.dev>> wrote:
>>> 
>>>   Mark,
>>> 
>>>   Looks like the error checking in PetscCommDuplicate() is doing its job. 
>>> It is reporting an attempt to use an PETSc object constructer on a subset 
>>> of ranks of an MPI_Comm (which is, of course, fundamentally impossible in 
>>> the PETSc/MPI model)
>>> 
>>> Note that nroots can be negative on a particular rank but 
>>> DMPlexLabelComplete_Internal() is collective on sf based on the comment in 
>>> the code below
>>> 
>>> 
>>> struct _p_PetscSF {
>>> ....
>>>   PetscInt     nroots;  /* Number of root vertices on current process 
>>> (candidates for incoming edges) */
>>> 
>>> But the next routine calls a collective only when nroots >= 0 
>>> 
>>> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel label, 
>>> PetscBool completeCells){
>>> ...
>>>   PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL));
>>>   if (nroots >= 0) {
>>>     DMLabel         lblRoots, lblLeaves;
>>>     IS              valueIS, pointIS;
>>>     const PetscInt *values;
>>>     PetscInt        numValues, v;
>>> 
>>>     /* Pull point contributions from remote leaves into local roots */
>>>     PetscCall(DMLabelGather(label, sfPoint, &lblLeaves));
>>> 
>>> 
>>> The code is four years old? How come this problem of calling the 
>>> constructure on a subset of ranks hasn't come up since day 1? 
>> 
>> The contract here is that it should be impossible to have nroots < 0 
>> (meaning the SF is not setup) on a subset of processes. Do we know that this 
>> is happening?
> 
> Can't imagine a code bug here. Very simple code.
> 
> This code does use GAMG as the coarse grid solver in a pretty extreme way.
> GAMG is fairly complicated and not used on such small problems with high 
> parallelism.
> It is conceivable that its a GAMG bug, but that is not what was going on in 
> my initial emal here.
> 
> Here is a run that timed out, but it should not have so I think this is the 
> same issue. I always have perfectly distributed grids like this.
> 
> DM Object: box 2048 MPI processes
>   type: plex
> box in 2 dimensions:
>   Min/Max of 0-cells per rank: 8385/8580
>   Min/Max of 1-cells per rank: 24768/24960
>   Min/Max of 2-cells per rank: 16384/16384
> Labels:
>   celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385))
>   depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384))
>   marker: 1 strata with value/size (1 (385))
>   Face Sets: 1 strata with value/size (1 (381))
>   Defined by transform from:
>   DM_0x84000002_1 in 2 dimensions:
>     Min/Max of 0-cells per rank:   2145/2244
>     Min/Max of 1-cells per rank:   6240/6336
>     Min/Max of 2-cells per rank:   4096/4096
>   Labels:
>     celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145))
>     depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096))
>     marker: 1 strata with value/size (1 (193))
>     Face Sets: 1 strata with value/size (1 (189))
>     Defined by transform from:
>     DM_0x84000002_2 in 2 dimensions:
>       Min/Max of 0-cells per rank:     561/612
>       Min/Max of 1-cells per rank:     1584/1632
>       Min/Max of 2-cells per rank:     1024/1024
>     Labels:
>       celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561))
>       depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024))
>       marker: 1 strata with value/size (1 (97))
>       Face Sets: 1 strata with value/size (1 (93))
>       Defined by transform from:
>       DM_0x84000002_3 in 2 dimensions:
>         Min/Max of 0-cells per rank:       153/180
>         Min/Max of 1-cells per rank:       408/432
>         Min/Max of 2-cells per rank:       256/256
>       Labels:
>         celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153))
>         depth: 3 strata with value/size (0 (153), 1 (408), 2 (256))
>         marker: 1 strata with value/size (1 (49))
>         Face Sets: 1 strata with value/size (1 (45))
>         Defined by transform from:
>         DM_0x84000002_4 in 2 dimensions:
>           Min/Max of 0-cells per rank:         45/60
>           Min/Max of 1-cells per rank:         108/120
>           Min/Max of 2-cells per rank:         64/64
>         Labels:
>           celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45))
>           depth: 3 strata with value/size (0 (45), 1 (108), 2 (64))
>           marker: 1 strata with value/size (1 (25))
>           Face Sets: 1 strata with value/size (1 (21))
>           Defined by transform from:
>           DM_0x84000002_5 in 2 dimensions:
>             Min/Max of 0-cells per rank:           15/24
>             Min/Max of 1-cells per rank:           30/36
>             Min/Max of 2-cells per rank:           16/16
>           Labels:
>             celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15))
>             depth: 3 strata with value/size (0 (15), 1 (30), 2 (16))
>             marker: 1 strata with value/size (1 (13))
>             Face Sets: 1 strata with value/size (1 (9))
>             Defined by transform from:
>             DM_0x84000002_6 in 2 dimensions:
>               Min/Max of 0-cells per rank:             6/12
>               Min/Max of 1-cells per rank:             9/12
>               Min/Max of 2-cells per rank:             4/4
>             Labels:
>               depth: 3 strata with value/size (0 (6), 1 (9), 2 (4))
>               celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4))
>               marker: 1 strata with value/size (1 (7))
>               Face Sets: 1 strata with value/size (1 (3))
> 0 TS dt 0.001 time 0.
> MHD    0) time =         0, Eergy=  2.3259668003585e+00 (plot ID 0)
>     0 SNES Function norm 5.415286407365e-03
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT 
> 2023-01-08T15:32:43 DUE TO TIME LIMIT ***
> 
>  
>> 
>>   Thanks,
>> 
>>     Matt
>>  
>>>> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfad...@lbl.gov 
>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>> 
>>>> I am running on Crusher, CPU only, 64 cores per node with Plex/PetscFE. 
>>>> In going up to 64 nodes, something really catastrophic is happening. 
>>>> I understand I am not using the machine the way it was intended, but I 
>>>> just want to see if there are any options that I could try for a quick 
>>>> fix/help.
>>>> 
>>>> In a debug build I get a stack trace on many but not all of the 4K 
>>>> processes. 
>>>> Alas, I am not sure why this job was terminated but every process that I 
>>>> checked, that had an "ERROR", had this stack:
>>>> 
>>>> 11:57 main *+= 
>>>> crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep ERROR 
>>>> slurm-245063.out |g 3160
>>>> [3160]PETSC ERROR: 
>>>> ------------------------------------------------------------------------
>>>> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the 
>>>> batch system) has told this process to end
>>>> [3160]PETSC ERROR: Try option -start_in_debugger or 
>>>> -on_error_attach_debugger
>>>> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
>>>> https://petsc.org/release/faq/
>>>> [3160]PETSC ERROR: ---------------------  Stack Frames 
>>>> ------------------------------------
>>>> [3160]PETSC ERROR: The line numbers in the error traceback are not always 
>>>> exact.
>>>> [3160]PETSC ERROR: #1 MPI function
>>>> [3160]PETSC ERROR: #2 PetscCommDuplicate() at 
>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248
>>>> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at 
>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56
>>>> [3160]PETSC ERROR: #4 PetscSFCreate() at 
>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65
>>>> [3160]PETSC ERROR: #5 DMLabelGather() at 
>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932
>>>> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at 
>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177
>>>> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at 
>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227
>>>> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at 
>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301
>>>> [3160]PETSC ERROR: #9 DMCopyDS() at 
>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117
>>>> [3160]PETSC ERROR: #10 DMCopyDisc() at 
>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143
>>>> [3160]PETSC ERROR: #11 SetupDiscretization() at 
>>>> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755
>>>> 
>>>> Maybe the MPI is just getting overwhelmed. 
>>>> 
>>>> And I was able to get one run to to work (one TS with beuler), and the 
>>>> solver performance was horrendous and I see this (attached):
>>>> 
>>>> Time (sec):           1.601e+02     1.001   1.600e+02
>>>> VecMDot           111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 0.0e+00 
>>>> 1.1e+05 30  4  0  0 23  30  4  0  0 23   499
>>>> VecNorm           163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 0.0e+00 
>>>> 1.6e+05 39  2  0  0 34  39  2  0  0 34   139
>>>> VecNormalize      154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 0.0e+00 
>>>> 1.5e+05 38  2  0  0 32  38  2  0  0 32   189
>>>> etc,
>>>> KSPSolve               3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 6.0e+01 
>>>> 2.8e+05 72 95 45 72 58  72 95 45 72 58  4772
>>>> 
>>>> Any ideas would be welcome,
>>>> Thanks,
>>>> Mark
>>>> <cushersolve.txt>
>>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
>> 
>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

Re: [petsc-dev] bad cpu/MPI performance problem

Reply via email to