Re: [petsc-dev] bad cpu/MPI performance problem

Barry Smith Mon, 09 Jan 2023 21:03:19 -0800

  Mark,

    You could write a simple pure MPI program to see if you can get a hang due 
to a mess up MPI implementation.


    for (i=0 to something big
       MPI_Allreduce()
       MPI_Bcast() from zero
   
if this hangs you have something to give to NERSC.

> On Jan 9, 2023, at 11:21 PM, Barry Smith <bsm...@petsc.dev> wrote:
> 
> 
>    Ok, this is a completely different location of the problem.  
> 
>    The code is hanging in two different locations (for different ranks)
> 
> MPI_Allreduce()
> [2180]PETSC ERROR: #1 MPI function
> [2180]PETSC ERROR: #2 DMPlexView_Ascii() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plex.c:1439
> 
> MPI_Bcast()
> [72]PETSC ERROR: #1 MPI function
> [72]PETSC ERROR: #2 PetscObjectName() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/pname.c:123
> [72]PETSC ERROR: #3 PetscObjectGetName() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/pgname.c:27
> [72]PETSC ERROR: #4 DMPlexView_Ascii() at 
> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plex.c:1434
> 
> 
> It appears that the object has a name on some ranks (most) (normally this is 
> set with with PetscObjectSetName()) but no name on other ranks.
> 
> -----
> There are some "bugs" in the current code (that should be fixed but are not 
> the root of this problem)
> 
> 1)
> 
>     PetscCall(PetscObjectGetName((PetscObject)dm, &name));
>     if (name) PetscCall(PetscViewerASCIIPrintf(viewer, "%s in %" PetscInt_FMT 
> " dimension%s:\n", name, dim, dim == 1 ? "" : "s"));
>     else PetscCall(PetscViewerASCIIPrintf(viewer, "Mesh in %" PetscInt_FMT " 
> dimension%s:\n", dim, dim == 1 ? "" : "s"));
> 
>  PetscObjectGetName() ALWAYS returns a non-null name so the if (name) check 
> is nonsense and should be removed
> 
> 2) 
> 
> /*@C
>    PetscObjectSetName - Sets a string name associated with a PETSc object.
> 
>    Not Collective
> 
>    This is dangerous phrasing; while it is true it  does not need be called 
> on all ranks, if it is called on a subset of ranks them PetscObjectGetName() 
> will trigger a PetscObjectName() which triggers an MPI_Bcast() on a subset of 
> ranks which will cause a hang. 
> 
>     It should say Logically Collective Not Not Collective
> 
> ----
> The problem is when viewing the coarsest of the six DM, all the finer DM are 
> viewed correctly. The coarser DM are all named automatically with 
> PetscObjectName() so have names like DM_0x84000002_1
> 
> 
> I cannot speculate on how it got into this state (since my last conclusion 
> was so wrong :-)). 
> 
> Mark, 
> 
>   What happens if you run the same problem (same number ranks etc etc) but do 
> not use -dm_refine_hierarchy 6 (so the coarse DM is the only dm).
> 
> Barry
> 
> 
> 
>> On Jan 9, 2023, at 7:56 PM, Mark Adams <mfad...@lbl.gov> wrote:
>> 
>> OK, here is a timeout one with 4K processors.
>> 
>> On Mon, Jan 9, 2023 at 3:13 PM Barry Smith <bsm...@petsc.dev 
>> <mailto:bsm...@petsc.dev>> wrote:
>>> 
>>> 
>>>> On Jan 9, 2023, at 12:13 PM, Mark Adams <mfad...@lbl.gov 
>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>> 
>>>> Sorry, I deleted it. I put a 128 node job with a debug version to try to 
>>>> reproduce it, but finished w/o this error.
>>>> With the same bad performance as my original log.
>>>> I think this message comes from a (time out) signal.
>>> 
>>>   Sure but we would need to know if there are multiple different places in 
>>> the code that the time-out is happening. 
>>> 
>>> 
>>> 
>>>> 
>>>> What I am seeing though is just really bad performance on large jobs. It 
>>>> is sudden. Good scaling up to about 32 nodes and then 100x slow down in 
>>>> the solver, the vec norms, on 128 nodes (and 64).
>>>> 
>>>> Thanks,
>>>> Mark
>>>> 
>>>> 
>>>> On Sun, Jan 8, 2023 at 8:07 PM Barry Smith <bsm...@petsc.dev 
>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>> 
>>>>>   Rather cryptic use of that integer variable for anyone who did not 
>>>>> write the code, but I guess one was short on bytes when it was written 
>>>>> :-).
>>>>> 
>>>>>   Ok, based on what Jed said, if we are certain that either all or none 
>>>>> of the nroots are -1, since you are getting hangs in the 
>>>>> PetscCommDuplicate() this might indicate some ranks have called a 
>>>>> creation on some OTHER object, that not all ranks are calling. Of course, 
>>>>> could be an MPI bug.
>>>>> 
>>>>>   Mark, can you send the file containing all the output from the Signal 
>>>>> Terminate run, maybe there will be a hint in there of a different 
>>>>> constructor being called on some ranks.
>>>>> 
>>>>>   Barry
>>>>> 
>>>>> 
>>>>> 
>>>>>>>>>> DMLabelGather
>>>>> 
>>>>>> On Jan 8, 2023, at 4:32 PM, Mark Adams <mfad...@lbl.gov 
>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>> 
>>>>>> 
>>>>>> On Sun, Jan 8, 2023 at 4:13 PM Barry Smith <bsm...@petsc.dev 
>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>> 
>>>>>>>    There is a bug in the routine DMPlexLabelComplete_Internal()! The 
>>>>>>> code should definitely not have the code route around if (nroots >=0) 
>>>>>>> because checking the nroots value to decide on the code route is simply 
>>>>>>> nonsense (if one "knows" "by contract" that nroots is >=0 then the if 
>>>>>>> () test is not needed.
>>>>>>> 
>>>>>>>    The first thing to do is to fix the bug with a PetscCheck() remove 
>>>>>>> the nonsensical if (nroots >=0) check and rerun you code to see what 
>>>>>>> happens.
>>>>>> 
>>>>>> This does not fix the bug right? It just fails cleanly, right?
>>>>>> 
>>>>>> I do have lots of empty processors in the first GAMG coarse grid. I just 
>>>>>> saw that the first GAMG coarse grid reduces the processor count to 4, 
>>>>>> from 4K.
>>>>>> This is one case where coarse grids could be repartitioned, for once 
>>>>>> that can be used.
>>>>>> 
>>>>>> Do you have a bug fix suggestion for me to try?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>>> 
>>>>>>>   Barry
>>>>>>> 
>>>>>>> Yes it is possible that in your run the nroots is always >= 0 and some 
>>>>>>> MPI bug is causing the problem but this doesn't change the fact that 
>>>>>>> the current code is buggy and needs to be fixed before blaming some 
>>>>>>> other bug for the problem.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jan 8, 2023, at 4:04 PM, Mark Adams <mfad...@lbl.gov 
>>>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <knep...@gmail.com 
>>>>>>>> <mailto:knep...@gmail.com>> wrote:
>>>>>>>>> On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsm...@petsc.dev 
>>>>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>>>>> 
>>>>>>>>>>   Mark,
>>>>>>>>>> 
>>>>>>>>>>   Looks like the error checking in PetscCommDuplicate() is doing its 
>>>>>>>>>> job. It is reporting an attempt to use an PETSc object constructer 
>>>>>>>>>> on a subset of ranks of an MPI_Comm (which is, of course, 
>>>>>>>>>> fundamentally impossible in the PETSc/MPI model)
>>>>>>>>>> 
>>>>>>>>>> Note that nroots can be negative on a particular rank but 
>>>>>>>>>> DMPlexLabelComplete_Internal() is collective on sf based on the 
>>>>>>>>>> comment in the code below
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> struct _p_PetscSF {
>>>>>>>>>> ....
>>>>>>>>>>   PetscInt     nroots;  /* Number of root vertices on current 
>>>>>>>>>> process (candidates for incoming edges) */
>>>>>>>>>> 
>>>>>>>>>> But the next routine calls a collective only when nroots >= 0 
>>>>>>>>>> 
>>>>>>>>>> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel 
>>>>>>>>>> label, PetscBool completeCells){
>>>>>>>>>> ...
>>>>>>>>>>   PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL));
>>>>>>>>>>   if (nroots >= 0) {
>>>>>>>>>>     DMLabel         lblRoots, lblLeaves;
>>>>>>>>>>     IS              valueIS, pointIS;
>>>>>>>>>>     const PetscInt *values;
>>>>>>>>>>     PetscInt        numValues, v;
>>>>>>>>>> 
>>>>>>>>>>     /* Pull point contributions from remote leaves into local roots 
>>>>>>>>>> */
>>>>>>>>>>     PetscCall(DMLabelGather(label, sfPoint, &lblLeaves));
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> The code is four years old? How come this problem of calling the 
>>>>>>>>>> constructure on a subset of ranks hasn't come up since day 1? 
>>>>>>>>> 
>>>>>>>>> The contract here is that it should be impossible to have nroots < 0 
>>>>>>>>> (meaning the SF is not setup) on a subset of processes. Do we know 
>>>>>>>>> that this is happening?
>>>>>>>> 
>>>>>>>> Can't imagine a code bug here. Very simple code.
>>>>>>>> 
>>>>>>>> This code does use GAMG as the coarse grid solver in a pretty extreme 
>>>>>>>> way.
>>>>>>>> GAMG is fairly complicated and not used on such small problems with 
>>>>>>>> high parallelism.
>>>>>>>> It is conceivable that its a GAMG bug, but that is not what was going 
>>>>>>>> on in my initial emal here.
>>>>>>>> 
>>>>>>>> Here is a run that timed out, but it should not have so I think this 
>>>>>>>> is the same issue. I always have perfectly distributed grids like this.
>>>>>>>> 
>>>>>>>> DM Object: box 2048 MPI processes
>>>>>>>>   type: plex
>>>>>>>> box in 2 dimensions:
>>>>>>>>   Min/Max of 0-cells per rank: 8385/8580
>>>>>>>>   Min/Max of 1-cells per rank: 24768/24960
>>>>>>>>   Min/Max of 2-cells per rank: 16384/16384
>>>>>>>> Labels:
>>>>>>>>   celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385))
>>>>>>>>   depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384))
>>>>>>>>   marker: 1 strata with value/size (1 (385))
>>>>>>>>   Face Sets: 1 strata with value/size (1 (381))
>>>>>>>>   Defined by transform from:
>>>>>>>>   DM_0x84000002_1 in 2 dimensions:
>>>>>>>>     Min/Max of 0-cells per rank:   2145/2244
>>>>>>>>     Min/Max of 1-cells per rank:   6240/6336
>>>>>>>>     Min/Max of 2-cells per rank:   4096/4096
>>>>>>>>   Labels:
>>>>>>>>     celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145))
>>>>>>>>     depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096))
>>>>>>>>     marker: 1 strata with value/size (1 (193))
>>>>>>>>     Face Sets: 1 strata with value/size (1 (189))
>>>>>>>>     Defined by transform from:
>>>>>>>>     DM_0x84000002_2 in 2 dimensions:
>>>>>>>>       Min/Max of 0-cells per rank:     561/612
>>>>>>>>       Min/Max of 1-cells per rank:     1584/1632
>>>>>>>>       Min/Max of 2-cells per rank:     1024/1024
>>>>>>>>     Labels:
>>>>>>>>       celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561))
>>>>>>>>       depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024))
>>>>>>>>       marker: 1 strata with value/size (1 (97))
>>>>>>>>       Face Sets: 1 strata with value/size (1 (93))
>>>>>>>>       Defined by transform from:
>>>>>>>>       DM_0x84000002_3 in 2 dimensions:
>>>>>>>>         Min/Max of 0-cells per rank:       153/180
>>>>>>>>         Min/Max of 1-cells per rank:       408/432
>>>>>>>>         Min/Max of 2-cells per rank:       256/256
>>>>>>>>       Labels:
>>>>>>>>         celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153))
>>>>>>>>         depth: 3 strata with value/size (0 (153), 1 (408), 2 (256))
>>>>>>>>         marker: 1 strata with value/size (1 (49))
>>>>>>>>         Face Sets: 1 strata with value/size (1 (45))
>>>>>>>>         Defined by transform from:
>>>>>>>>         DM_0x84000002_4 in 2 dimensions:
>>>>>>>>           Min/Max of 0-cells per rank:         45/60
>>>>>>>>           Min/Max of 1-cells per rank:         108/120
>>>>>>>>           Min/Max of 2-cells per rank:         64/64
>>>>>>>>         Labels:
>>>>>>>>           celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45))
>>>>>>>>           depth: 3 strata with value/size (0 (45), 1 (108), 2 (64))
>>>>>>>>           marker: 1 strata with value/size (1 (25))
>>>>>>>>           Face Sets: 1 strata with value/size (1 (21))
>>>>>>>>           Defined by transform from:
>>>>>>>>           DM_0x84000002_5 in 2 dimensions:
>>>>>>>>             Min/Max of 0-cells per rank:           15/24
>>>>>>>>             Min/Max of 1-cells per rank:           30/36
>>>>>>>>             Min/Max of 2-cells per rank:           16/16
>>>>>>>>           Labels:
>>>>>>>>             celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15))
>>>>>>>>             depth: 3 strata with value/size (0 (15), 1 (30), 2 (16))
>>>>>>>>             marker: 1 strata with value/size (1 (13))
>>>>>>>>             Face Sets: 1 strata with value/size (1 (9))
>>>>>>>>             Defined by transform from:
>>>>>>>>             DM_0x84000002_6 in 2 dimensions:
>>>>>>>>               Min/Max of 0-cells per rank:             6/12
>>>>>>>>               Min/Max of 1-cells per rank:             9/12
>>>>>>>>               Min/Max of 2-cells per rank:             4/4
>>>>>>>>             Labels:
>>>>>>>>               depth: 3 strata with value/size (0 (6), 1 (9), 2 (4))
>>>>>>>>               celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4))
>>>>>>>>               marker: 1 strata with value/size (1 (7))
>>>>>>>>               Face Sets: 1 strata with value/size (1 (3))
>>>>>>>> 0 TS dt 0.001 time 0.
>>>>>>>> MHD    0) time =         0, Eergy=  2.3259668003585e+00 (plot ID 0)
>>>>>>>>     0 SNES Function norm 5.415286407365e-03
>>>>>>>> srun: Job step aborted: Waiting up to 32 seconds for job step to 
>>>>>>>> finish.
>>>>>>>> slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT 
>>>>>>>> 2023-01-08T15:32:43 DUE TO TIME LIMIT ***
>>>>>>>> 
>>>>>>>>  
>>>>>>>>> 
>>>>>>>>>   Thanks,
>>>>>>>>> 
>>>>>>>>>     Matt
>>>>>>>>>  
>>>>>>>>>>> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfad...@lbl.gov 
>>>>>>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I am running on Crusher, CPU only, 64 cores per node with 
>>>>>>>>>>> Plex/PetscFE. 
>>>>>>>>>>> In going up to 64 nodes, something really catastrophic is 
>>>>>>>>>>> happening. 
>>>>>>>>>>> I understand I am not using the machine the way it was intended, 
>>>>>>>>>>> but I just want to see if there are any options that I could try 
>>>>>>>>>>> for a quick fix/help.
>>>>>>>>>>> 
>>>>>>>>>>> In a debug build I get a stack trace on many but not all of the 4K 
>>>>>>>>>>> processes. 
>>>>>>>>>>> Alas, I am not sure why this job was terminated but every process 
>>>>>>>>>>> that I checked, that had an "ERROR", had this stack:
>>>>>>>>>>> 
>>>>>>>>>>> 11:57 main *+= 
>>>>>>>>>>> crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep 
>>>>>>>>>>> ERROR slurm-245063.out |g 3160
>>>>>>>>>>> [3160]PETSC ERROR: 
>>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>>> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process 
>>>>>>>>>>> (or the batch system) has told this process to end
>>>>>>>>>>> [3160]PETSC ERROR: Try option -start_in_debugger or 
>>>>>>>>>>> -on_error_attach_debugger
>>>>>>>>>>> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind 
>>>>>>>>>>> and https://petsc.org/release/faq/
>>>>>>>>>>> [3160]PETSC ERROR: ---------------------  Stack Frames 
>>>>>>>>>>> ------------------------------------
>>>>>>>>>>> [3160]PETSC ERROR: The line numbers in the error traceback are not 
>>>>>>>>>>> always exact.
>>>>>>>>>>> [3160]PETSC ERROR: #1 MPI function
>>>>>>>>>>> [3160]PETSC ERROR: #2 PetscCommDuplicate() at 
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248
>>>>>>>>>>> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at 
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56
>>>>>>>>>>> [3160]PETSC ERROR: #4 PetscSFCreate() at 
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65
>>>>>>>>>>> [3160]PETSC ERROR: #5 DMLabelGather() at 
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932
>>>>>>>>>>> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at 
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177
>>>>>>>>>>> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at 
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227
>>>>>>>>>>> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at 
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301
>>>>>>>>>>> [3160]PETSC ERROR: #9 DMCopyDS() at 
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117
>>>>>>>>>>> [3160]PETSC ERROR: #10 DMCopyDisc() at 
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143
>>>>>>>>>>> [3160]PETSC ERROR: #11 SetupDiscretization() at 
>>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755
>>>>>>>>>>> 
>>>>>>>>>>> Maybe the MPI is just getting overwhelmed. 
>>>>>>>>>>> 
>>>>>>>>>>> And I was able to get one run to to work (one TS with beuler), and 
>>>>>>>>>>> the solver performance was horrendous and I see this (attached):
>>>>>>>>>>> 
>>>>>>>>>>> Time (sec):           1.601e+02     1.001   1.600e+02
>>>>>>>>>>> VecMDot           111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 
>>>>>>>>>>> 0.0e+00 1.1e+05 30  4  0  0 23  30  4  0  0 23   499
>>>>>>>>>>> VecNorm           163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 
>>>>>>>>>>> 0.0e+00 1.6e+05 39  2  0  0 34  39  2  0  0 34   139
>>>>>>>>>>> VecNormalize      154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 
>>>>>>>>>>> 0.0e+00 1.5e+05 38  2  0  0 32  38  2  0  0 32   189
>>>>>>>>>>> etc,
>>>>>>>>>>> KSPSolve               3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 
>>>>>>>>>>> 6.0e+01 2.8e+05 72 95 45 72 58  72 95 45 72 58  4772
>>>>>>>>>>> 
>>>>>>>>>>> Any ideas would be welcome,
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Mark
>>>>>>>>>>> <cushersolve.txt>
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> What most experimenters take for granted before they begin their 
>>>>>>>>> experiments is infinitely more interesting than any results to which 
>>>>>>>>> their experiments lead.
>>>>>>>>> -- Norbert Wiener
>>>>>>>>> 
>>>>>>>>> https://www.cse.buffalo.edu/~knepley/ 
>>>>>>>>> <http://www.cse.buffalo.edu/~knepley/>
>>>>>>> 
>>>>> 
>>> 
>> <crushererror.txt.gz>
>

Re: [petsc-dev] bad cpu/MPI performance problem

Reply via email to