> On Jan 8, 2023, at 2:44 PM, Matthew Knepley <knep...@gmail.com> wrote: > > On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsm...@petsc.dev > <mailto:bsm...@petsc.dev>> wrote: >> >> Mark, >> >> Looks like the error checking in PetscCommDuplicate() is doing its job. It >> is reporting an attempt to use an PETSc object constructer on a subset of >> ranks of an MPI_Comm (which is, of course, fundamentally impossible in the >> PETSc/MPI model) >> >> Note that nroots can be negative on a particular rank but >> DMPlexLabelComplete_Internal() is collective on sf based on the comment in >> the code below >> >> >> struct _p_PetscSF { >> .... >> PetscInt nroots; /* Number of root vertices on current process >> (candidates for incoming edges) */ >> >> But the next routine calls a collective only when nroots >= 0 >> >> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel label, >> PetscBool completeCells){ >> ... >> PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL)); >> if (nroots >= 0) { >> DMLabel lblRoots, lblLeaves; >> IS valueIS, pointIS; >> const PetscInt *values; >> PetscInt numValues, v; >> >> /* Pull point contributions from remote leaves into local roots */ >> PetscCall(DMLabelGather(label, sfPoint, &lblLeaves)); >> >> >> The code is four years old? How come this problem of calling the >> constructure on a subset of ranks hasn't come up since day 1? > > The contract here is that it should be impossible to have nroots < 0 (meaning > the SF is not setup) on a subset of processes. Do we know that this is > happening?
Then there should definitely be a PetscCheck() that nroots !< 0 here in the code and there is absolutely no need for the if (nroots >=0) specific root in the code. > > Thanks, > > Matt > >>> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfad...@lbl.gov >>> <mailto:mfad...@lbl.gov>> wrote: >>> >>> I am running on Crusher, CPU only, 64 cores per node with Plex/PetscFE. >>> In going up to 64 nodes, something really catastrophic is happening. >>> I understand I am not using the machine the way it was intended, but I just >>> want to see if there are any options that I could try for a quick fix/help. >>> >>> In a debug build I get a stack trace on many but not all of the 4K >>> processes. >>> Alas, I am not sure why this job was terminated but every process that I >>> checked, that had an "ERROR", had this stack: >>> >>> 11:57 main *+= crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ >>> grep ERROR slurm-245063.out |g 3160 >>> [3160]PETSC ERROR: >>> ------------------------------------------------------------------------ >>> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the >>> batch system) has told this process to end >>> [3160]PETSC ERROR: Try option -start_in_debugger or >>> -on_error_attach_debugger >>> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and >>> https://petsc.org/release/faq/ >>> [3160]PETSC ERROR: --------------------- Stack Frames >>> ------------------------------------ >>> [3160]PETSC ERROR: The line numbers in the error traceback are not always >>> exact. >>> [3160]PETSC ERROR: #1 MPI function >>> [3160]PETSC ERROR: #2 PetscCommDuplicate() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248 >>> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56 >>> [3160]PETSC ERROR: #4 PetscSFCreate() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65 >>> [3160]PETSC ERROR: #5 DMLabelGather() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932 >>> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177 >>> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227 >>> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301 >>> [3160]PETSC ERROR: #9 DMCopyDS() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117 >>> [3160]PETSC ERROR: #10 DMCopyDisc() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143 >>> [3160]PETSC ERROR: #11 SetupDiscretization() at >>> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755 >>> >>> Maybe the MPI is just getting overwhelmed. >>> >>> And I was able to get one run to to work (one TS with beuler), and the >>> solver performance was horrendous and I see this (attached): >>> >>> Time (sec): 1.601e+02 1.001 1.600e+02 >>> VecMDot 111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 0.0e+00 >>> 1.1e+05 30 4 0 0 23 30 4 0 0 23 499 >>> VecNorm 163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 0.0e+00 >>> 1.6e+05 39 2 0 0 34 39 2 0 0 34 139 >>> VecNormalize 154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 0.0e+00 >>> 1.5e+05 38 2 0 0 32 38 2 0 0 32 189 >>> etc, >>> KSPSolve 3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 6.0e+01 >>> 2.8e+05 72 95 45 72 58 72 95 45 72 58 4772 >>> >>> Any ideas would be welcome, >>> Thanks, >>> Mark >>> <cushersolve.txt> >> > > > -- > What most experimenters take for granted before they begin their experiments > is infinitely more interesting than any results to which their experiments > lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>