Mark, You could write a simple pure MPI program to see if you can get a hang due to a mess up MPI implementation.
for (i=0 to something big MPI_Allreduce() MPI_Bcast() from zero if this hangs you have something to give to NERSC. > On Jan 9, 2023, at 11:21 PM, Barry Smith <bsm...@petsc.dev> wrote: > > > Ok, this is a completely different location of the problem. > > The code is hanging in two different locations (for different ranks) > > MPI_Allreduce() > [2180]PETSC ERROR: #1 MPI function > [2180]PETSC ERROR: #2 DMPlexView_Ascii() at > /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plex.c:1439 > > MPI_Bcast() > [72]PETSC ERROR: #1 MPI function > [72]PETSC ERROR: #2 PetscObjectName() at > /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/pname.c:123 > [72]PETSC ERROR: #3 PetscObjectGetName() at > /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/pgname.c:27 > [72]PETSC ERROR: #4 DMPlexView_Ascii() at > /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plex.c:1434 > > > It appears that the object has a name on some ranks (most) (normally this is > set with with PetscObjectSetName()) but no name on other ranks. > > ----- > There are some "bugs" in the current code (that should be fixed but are not > the root of this problem) > > 1) > > PetscCall(PetscObjectGetName((PetscObject)dm, &name)); > if (name) PetscCall(PetscViewerASCIIPrintf(viewer, "%s in %" PetscInt_FMT > " dimension%s:\n", name, dim, dim == 1 ? "" : "s")); > else PetscCall(PetscViewerASCIIPrintf(viewer, "Mesh in %" PetscInt_FMT " > dimension%s:\n", dim, dim == 1 ? "" : "s")); > > PetscObjectGetName() ALWAYS returns a non-null name so the if (name) check > is nonsense and should be removed > > 2) > > /*@C > PetscObjectSetName - Sets a string name associated with a PETSc object. > > Not Collective > > This is dangerous phrasing; while it is true it does not need be called > on all ranks, if it is called on a subset of ranks them PetscObjectGetName() > will trigger a PetscObjectName() which triggers an MPI_Bcast() on a subset of > ranks which will cause a hang. > > It should say Logically Collective Not Not Collective > > ---- > The problem is when viewing the coarsest of the six DM, all the finer DM are > viewed correctly. The coarser DM are all named automatically with > PetscObjectName() so have names like DM_0x84000002_1 > > > I cannot speculate on how it got into this state (since my last conclusion > was so wrong :-)). > > Mark, > > What happens if you run the same problem (same number ranks etc etc) but do > not use -dm_refine_hierarchy 6 (so the coarse DM is the only dm). > > Barry > > > >> On Jan 9, 2023, at 7:56 PM, Mark Adams <mfad...@lbl.gov> wrote: >> >> OK, here is a timeout one with 4K processors. >> >> On Mon, Jan 9, 2023 at 3:13 PM Barry Smith <bsm...@petsc.dev >> <mailto:bsm...@petsc.dev>> wrote: >>> >>> >>>> On Jan 9, 2023, at 12:13 PM, Mark Adams <mfad...@lbl.gov >>>> <mailto:mfad...@lbl.gov>> wrote: >>>> >>>> Sorry, I deleted it. I put a 128 node job with a debug version to try to >>>> reproduce it, but finished w/o this error. >>>> With the same bad performance as my original log. >>>> I think this message comes from a (time out) signal. >>> >>> Sure but we would need to know if there are multiple different places in >>> the code that the time-out is happening. >>> >>> >>> >>>> >>>> What I am seeing though is just really bad performance on large jobs. It >>>> is sudden. Good scaling up to about 32 nodes and then 100x slow down in >>>> the solver, the vec norms, on 128 nodes (and 64). >>>> >>>> Thanks, >>>> Mark >>>> >>>> >>>> On Sun, Jan 8, 2023 at 8:07 PM Barry Smith <bsm...@petsc.dev >>>> <mailto:bsm...@petsc.dev>> wrote: >>>>> >>>>> Rather cryptic use of that integer variable for anyone who did not >>>>> write the code, but I guess one was short on bytes when it was written >>>>> :-). >>>>> >>>>> Ok, based on what Jed said, if we are certain that either all or none >>>>> of the nroots are -1, since you are getting hangs in the >>>>> PetscCommDuplicate() this might indicate some ranks have called a >>>>> creation on some OTHER object, that not all ranks are calling. Of course, >>>>> could be an MPI bug. >>>>> >>>>> Mark, can you send the file containing all the output from the Signal >>>>> Terminate run, maybe there will be a hint in there of a different >>>>> constructor being called on some ranks. >>>>> >>>>> Barry >>>>> >>>>> >>>>> >>>>>>>>>> DMLabelGather >>>>> >>>>>> On Jan 8, 2023, at 4:32 PM, Mark Adams <mfad...@lbl.gov >>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>> >>>>>> >>>>>> On Sun, Jan 8, 2023 at 4:13 PM Barry Smith <bsm...@petsc.dev >>>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>>> >>>>>>> There is a bug in the routine DMPlexLabelComplete_Internal()! The >>>>>>> code should definitely not have the code route around if (nroots >=0) >>>>>>> because checking the nroots value to decide on the code route is simply >>>>>>> nonsense (if one "knows" "by contract" that nroots is >=0 then the if >>>>>>> () test is not needed. >>>>>>> >>>>>>> The first thing to do is to fix the bug with a PetscCheck() remove >>>>>>> the nonsensical if (nroots >=0) check and rerun you code to see what >>>>>>> happens. >>>>>> >>>>>> This does not fix the bug right? It just fails cleanly, right? >>>>>> >>>>>> I do have lots of empty processors in the first GAMG coarse grid. I just >>>>>> saw that the first GAMG coarse grid reduces the processor count to 4, >>>>>> from 4K. >>>>>> This is one case where coarse grids could be repartitioned, for once >>>>>> that can be used. >>>>>> >>>>>> Do you have a bug fix suggestion for me to try? >>>>>> >>>>>> Thanks >>>>>> >>>>>>> >>>>>>> Barry >>>>>>> >>>>>>> Yes it is possible that in your run the nroots is always >= 0 and some >>>>>>> MPI bug is causing the problem but this doesn't change the fact that >>>>>>> the current code is buggy and needs to be fixed before blaming some >>>>>>> other bug for the problem. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Jan 8, 2023, at 4:04 PM, Mark Adams <mfad...@lbl.gov >>>>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <knep...@gmail.com >>>>>>>> <mailto:knep...@gmail.com>> wrote: >>>>>>>>> On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsm...@petsc.dev >>>>>>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>>>>>> >>>>>>>>>> Mark, >>>>>>>>>> >>>>>>>>>> Looks like the error checking in PetscCommDuplicate() is doing its >>>>>>>>>> job. It is reporting an attempt to use an PETSc object constructer >>>>>>>>>> on a subset of ranks of an MPI_Comm (which is, of course, >>>>>>>>>> fundamentally impossible in the PETSc/MPI model) >>>>>>>>>> >>>>>>>>>> Note that nroots can be negative on a particular rank but >>>>>>>>>> DMPlexLabelComplete_Internal() is collective on sf based on the >>>>>>>>>> comment in the code below >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> struct _p_PetscSF { >>>>>>>>>> .... >>>>>>>>>> PetscInt nroots; /* Number of root vertices on current >>>>>>>>>> process (candidates for incoming edges) */ >>>>>>>>>> >>>>>>>>>> But the next routine calls a collective only when nroots >= 0 >>>>>>>>>> >>>>>>>>>> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel >>>>>>>>>> label, PetscBool completeCells){ >>>>>>>>>> ... >>>>>>>>>> PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL)); >>>>>>>>>> if (nroots >= 0) { >>>>>>>>>> DMLabel lblRoots, lblLeaves; >>>>>>>>>> IS valueIS, pointIS; >>>>>>>>>> const PetscInt *values; >>>>>>>>>> PetscInt numValues, v; >>>>>>>>>> >>>>>>>>>> /* Pull point contributions from remote leaves into local roots >>>>>>>>>> */ >>>>>>>>>> PetscCall(DMLabelGather(label, sfPoint, &lblLeaves)); >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The code is four years old? How come this problem of calling the >>>>>>>>>> constructure on a subset of ranks hasn't come up since day 1? >>>>>>>>> >>>>>>>>> The contract here is that it should be impossible to have nroots < 0 >>>>>>>>> (meaning the SF is not setup) on a subset of processes. Do we know >>>>>>>>> that this is happening? >>>>>>>> >>>>>>>> Can't imagine a code bug here. Very simple code. >>>>>>>> >>>>>>>> This code does use GAMG as the coarse grid solver in a pretty extreme >>>>>>>> way. >>>>>>>> GAMG is fairly complicated and not used on such small problems with >>>>>>>> high parallelism. >>>>>>>> It is conceivable that its a GAMG bug, but that is not what was going >>>>>>>> on in my initial emal here. >>>>>>>> >>>>>>>> Here is a run that timed out, but it should not have so I think this >>>>>>>> is the same issue. I always have perfectly distributed grids like this. >>>>>>>> >>>>>>>> DM Object: box 2048 MPI processes >>>>>>>> type: plex >>>>>>>> box in 2 dimensions: >>>>>>>> Min/Max of 0-cells per rank: 8385/8580 >>>>>>>> Min/Max of 1-cells per rank: 24768/24960 >>>>>>>> Min/Max of 2-cells per rank: 16384/16384 >>>>>>>> Labels: >>>>>>>> celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385)) >>>>>>>> depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384)) >>>>>>>> marker: 1 strata with value/size (1 (385)) >>>>>>>> Face Sets: 1 strata with value/size (1 (381)) >>>>>>>> Defined by transform from: >>>>>>>> DM_0x84000002_1 in 2 dimensions: >>>>>>>> Min/Max of 0-cells per rank: 2145/2244 >>>>>>>> Min/Max of 1-cells per rank: 6240/6336 >>>>>>>> Min/Max of 2-cells per rank: 4096/4096 >>>>>>>> Labels: >>>>>>>> celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145)) >>>>>>>> depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096)) >>>>>>>> marker: 1 strata with value/size (1 (193)) >>>>>>>> Face Sets: 1 strata with value/size (1 (189)) >>>>>>>> Defined by transform from: >>>>>>>> DM_0x84000002_2 in 2 dimensions: >>>>>>>> Min/Max of 0-cells per rank: 561/612 >>>>>>>> Min/Max of 1-cells per rank: 1584/1632 >>>>>>>> Min/Max of 2-cells per rank: 1024/1024 >>>>>>>> Labels: >>>>>>>> celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561)) >>>>>>>> depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024)) >>>>>>>> marker: 1 strata with value/size (1 (97)) >>>>>>>> Face Sets: 1 strata with value/size (1 (93)) >>>>>>>> Defined by transform from: >>>>>>>> DM_0x84000002_3 in 2 dimensions: >>>>>>>> Min/Max of 0-cells per rank: 153/180 >>>>>>>> Min/Max of 1-cells per rank: 408/432 >>>>>>>> Min/Max of 2-cells per rank: 256/256 >>>>>>>> Labels: >>>>>>>> celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153)) >>>>>>>> depth: 3 strata with value/size (0 (153), 1 (408), 2 (256)) >>>>>>>> marker: 1 strata with value/size (1 (49)) >>>>>>>> Face Sets: 1 strata with value/size (1 (45)) >>>>>>>> Defined by transform from: >>>>>>>> DM_0x84000002_4 in 2 dimensions: >>>>>>>> Min/Max of 0-cells per rank: 45/60 >>>>>>>> Min/Max of 1-cells per rank: 108/120 >>>>>>>> Min/Max of 2-cells per rank: 64/64 >>>>>>>> Labels: >>>>>>>> celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45)) >>>>>>>> depth: 3 strata with value/size (0 (45), 1 (108), 2 (64)) >>>>>>>> marker: 1 strata with value/size (1 (25)) >>>>>>>> Face Sets: 1 strata with value/size (1 (21)) >>>>>>>> Defined by transform from: >>>>>>>> DM_0x84000002_5 in 2 dimensions: >>>>>>>> Min/Max of 0-cells per rank: 15/24 >>>>>>>> Min/Max of 1-cells per rank: 30/36 >>>>>>>> Min/Max of 2-cells per rank: 16/16 >>>>>>>> Labels: >>>>>>>> celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15)) >>>>>>>> depth: 3 strata with value/size (0 (15), 1 (30), 2 (16)) >>>>>>>> marker: 1 strata with value/size (1 (13)) >>>>>>>> Face Sets: 1 strata with value/size (1 (9)) >>>>>>>> Defined by transform from: >>>>>>>> DM_0x84000002_6 in 2 dimensions: >>>>>>>> Min/Max of 0-cells per rank: 6/12 >>>>>>>> Min/Max of 1-cells per rank: 9/12 >>>>>>>> Min/Max of 2-cells per rank: 4/4 >>>>>>>> Labels: >>>>>>>> depth: 3 strata with value/size (0 (6), 1 (9), 2 (4)) >>>>>>>> celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4)) >>>>>>>> marker: 1 strata with value/size (1 (7)) >>>>>>>> Face Sets: 1 strata with value/size (1 (3)) >>>>>>>> 0 TS dt 0.001 time 0. >>>>>>>> MHD 0) time = 0, Eergy= 2.3259668003585e+00 (plot ID 0) >>>>>>>> 0 SNES Function norm 5.415286407365e-03 >>>>>>>> srun: Job step aborted: Waiting up to 32 seconds for job step to >>>>>>>> finish. >>>>>>>> slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT >>>>>>>> 2023-01-08T15:32:43 DUE TO TIME LIMIT *** >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Matt >>>>>>>>> >>>>>>>>>>> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfad...@lbl.gov >>>>>>>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>>>>>>> >>>>>>>>>>> I am running on Crusher, CPU only, 64 cores per node with >>>>>>>>>>> Plex/PetscFE. >>>>>>>>>>> In going up to 64 nodes, something really catastrophic is >>>>>>>>>>> happening. >>>>>>>>>>> I understand I am not using the machine the way it was intended, >>>>>>>>>>> but I just want to see if there are any options that I could try >>>>>>>>>>> for a quick fix/help. >>>>>>>>>>> >>>>>>>>>>> In a debug build I get a stack trace on many but not all of the 4K >>>>>>>>>>> processes. >>>>>>>>>>> Alas, I am not sure why this job was terminated but every process >>>>>>>>>>> that I checked, that had an "ERROR", had this stack: >>>>>>>>>>> >>>>>>>>>>> 11:57 main *+= >>>>>>>>>>> crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep >>>>>>>>>>> ERROR slurm-245063.out |g 3160 >>>>>>>>>>> [3160]PETSC ERROR: >>>>>>>>>>> ------------------------------------------------------------------------ >>>>>>>>>>> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process >>>>>>>>>>> (or the batch system) has told this process to end >>>>>>>>>>> [3160]PETSC ERROR: Try option -start_in_debugger or >>>>>>>>>>> -on_error_attach_debugger >>>>>>>>>>> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind >>>>>>>>>>> and https://petsc.org/release/faq/ >>>>>>>>>>> [3160]PETSC ERROR: --------------------- Stack Frames >>>>>>>>>>> ------------------------------------ >>>>>>>>>>> [3160]PETSC ERROR: The line numbers in the error traceback are not >>>>>>>>>>> always exact. >>>>>>>>>>> [3160]PETSC ERROR: #1 MPI function >>>>>>>>>>> [3160]PETSC ERROR: #2 PetscCommDuplicate() at >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248 >>>>>>>>>>> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56 >>>>>>>>>>> [3160]PETSC ERROR: #4 PetscSFCreate() at >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65 >>>>>>>>>>> [3160]PETSC ERROR: #5 DMLabelGather() at >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932 >>>>>>>>>>> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177 >>>>>>>>>>> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227 >>>>>>>>>>> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301 >>>>>>>>>>> [3160]PETSC ERROR: #9 DMCopyDS() at >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117 >>>>>>>>>>> [3160]PETSC ERROR: #10 DMCopyDisc() at >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143 >>>>>>>>>>> [3160]PETSC ERROR: #11 SetupDiscretization() at >>>>>>>>>>> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755 >>>>>>>>>>> >>>>>>>>>>> Maybe the MPI is just getting overwhelmed. >>>>>>>>>>> >>>>>>>>>>> And I was able to get one run to to work (one TS with beuler), and >>>>>>>>>>> the solver performance was horrendous and I see this (attached): >>>>>>>>>>> >>>>>>>>>>> Time (sec): 1.601e+02 1.001 1.600e+02 >>>>>>>>>>> VecMDot 111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 >>>>>>>>>>> 0.0e+00 1.1e+05 30 4 0 0 23 30 4 0 0 23 499 >>>>>>>>>>> VecNorm 163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 >>>>>>>>>>> 0.0e+00 1.6e+05 39 2 0 0 34 39 2 0 0 34 139 >>>>>>>>>>> VecNormalize 154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 >>>>>>>>>>> 0.0e+00 1.5e+05 38 2 0 0 32 38 2 0 0 32 189 >>>>>>>>>>> etc, >>>>>>>>>>> KSPSolve 3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 >>>>>>>>>>> 6.0e+01 2.8e+05 72 95 45 72 58 72 95 45 72 58 4772 >>>>>>>>>>> >>>>>>>>>>> Any ideas would be welcome, >>>>>>>>>>> Thanks, >>>>>>>>>>> Mark >>>>>>>>>>> <cushersolve.txt> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> What most experimenters take for granted before they begin their >>>>>>>>> experiments is infinitely more interesting than any results to which >>>>>>>>> their experiments lead. >>>>>>>>> -- Norbert Wiener >>>>>>>>> >>>>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>>>> <http://www.cse.buffalo.edu/~knepley/> >>>>>>> >>>>> >>> >> <crushererror.txt.gz> >