Dear all, First of all, a bit of context: I am trying to debug an error in my application where randomly I start seeing nan's. The probability of this increases with the number of MPI processors I use, so it looks like it is a data race of some sort. Any advice on the best way to find the error?
My current approach is to use project MUST[1] to help me find the issues. When I ran MUST with the debug version of my code on the local cluster, it returned a errors related to the MPI internalities of dealii/petsc(/MUMPS?). An exemplary output can be seen on errors.txt. The output stopping in "Solving... " suggested that the error was in between the following lines of my code: PetscPrintf(mpi_communicator, "Solving... \n"); > > computing_timer.enter_section("solve"); >> > >> SolverControl cn; > > PETScWrappers::SparseDirectMUMPS solver(cn, mpi_communicator); > > solver.set_symmetric_mode(false); > > solver.solve(system_matrix, distributed_dU, system_rhs); > > >> computing_timer.exit_section("solve"); > > PetscPrintf(mpi_communicator, "Solved! \n"); > > > Indeed, when I comment out the "solver.solve(system_matrix, distributed_dU, system_rhs); " line, it runs with no errors at all. Could this be the source of my issues? Also, how can I solve this specific issue? -- The deal.II project is located at http://www.dealii.org/ For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en --- You received this message because you are subscribed to the Google Groups "deal.II User Group" group. To unsubscribe from this group and stop receiving emails from it, send an email to dealii+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[MUST] MUST configuration ... centralized checks without application crash handling [MUST] Information: overwritting old intermediate data in directory "/homeb/inm1/lcampos/JuFold/build/must_temp"! [MUST] Using prebuilt infrastructure at /usr/local/software/jureca/Stages/2017b/software/MUST/1.5.0-gpsmpi-2017b-Python-2.7.14/modules//mode3-layer2 [MUST] Search for linked P^nMPI ... not found ... using LD_PRELOAD to load P^nMPI ... success [MUST] Executing application: Number of active cells: 512 (by partition: 21+21+21+22+22+21+21+21+21+21+22+22+21+21+22+21+21+22+22+22+21+21+21+21) Number of degrees of freedom: 2187 (by partition: 83+114+86+88+73+93+69+113+90+90+86+84+72+133+79+91+100+91+72+93+84+106+95+102) ==============================APPLYING EXTERNAL FORCE============================== Saving snapshot Assembling system Finished assembling Inc: 1 (time:0.0000e+00, dt:1.0000e-02, rel:000%, growth:000%), Iter: 0. Residual norm: 5.99e+01. Relative norm: 1.00e+00 Solving... rank 4 (of 24), pid 29208 catched MPI error nr 284282377 rank 7 (of 24), pid 29215 catched MPI error nr 284282377 rank 3 (of 24), pid 29205 catched MPI error nr 888262153 rank 1 (of 24), pid 29202 catched MPI error nr 552717833 rank 19 (of 24), pid 29255 catched MPI error nr 821153289 rank 12 (of 24), pid 29231 catched MPI error nr 82955785 rank 22 (of 24), pid 29260 catched MPI error nr 888262153 rank 8 (of 24), pid 29219 catched MPI error nr 1022479881 rank 5 (of 24), pid 29214 catched MPI error nr 686935561 Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffcf6298dac) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7fff49aa3b2c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffe0f21b53c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffc0535c4cc) failed MPI_Op_free(75).: Null Op pointer rank 11 (of 24), pid 29227 catched MPI error nr 1022479881 rank 14 (of 24), pid 29239 catched MPI error nr 150064649 rank 16 (of 24), pid 29245 catched MPI error nr 418500105 rank 13 (of 24), pid 29235 catched MPI error nr 82955785 rank 10 (of 24), pid 29226 catched MPI error nr 1022479881 rank 17 (of 24), pid 29248 catched MPI error nr 552717833 rank 6 (of 24), pid 29211 catched MPI error nr 351391241 rank 20 (of 24), pid 29258 catched MPI error nr 418500105 rank 9 (of 24), pid 29221 catched MPI error nr 888262153 rank 21 (of 24), pid 29259 catched MPI error nr 15846921 rank 18 (of 24), pid 29251 catched MPI error nr 955371017 Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffec3a67f0c) failed MPI_Op_free(75).: Null Op pointer rank 2 (of 24), pid 29201 catched MPI error nr 619826697 rank 0 (of 24), pid 29196 catched MPI error nr 754044425 Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffece191b0c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7fff8945c82c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffe3e9c0ecc) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffd3e28123c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffc6a35c14c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffd863a742c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffe19eaa64c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffc0c7890cc) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffedc04b5ac) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffc2ba7f24c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7fffbac1921c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7fff7513d8bc) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffea3e1ba4c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7fff0f79540c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffe41ef721c) failed MPI_Op_free(75).: Null Op pointer rank 15 (of 24), pid 29242 catched MPI error nr 15846921 Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffd8b3a56fc) failed MPI_Op_free(75).: Null Op pointer rank 23 (of 24), pid 29261 catched MPI error nr 351391241 Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7fff5ca4c91c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7ffd1e63440c) failed MPI_Op_free(75).: Null Op pointer Invalid MPI_Op, error stack: MPI_Op_free(111): MPI_Op_free(op=0x7fff87951d0c) failed MPI_Op_free(75).: Null Op pointer Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. Waiting up to 30 seconds for analyses to be finished. [MUST-ERROR] Execution finished, but no output found!