Dear all,

First of all, a bit of context:
I am trying to debug an error in my application where randomly I start 
seeing nan's. The probability of this increases with the number of MPI 
processors I use, so it looks like it is a data race of some sort. Any 
advice on the best way to find the error?

My current approach is to use project MUST[1] to help me find the issues. 
When I ran MUST with the debug version of my code on the local cluster, it 
returned a errors related to the MPI internalities of 
dealii/petsc(/MUMPS?). An exemplary output can be seen on errors.txt. The 
output stopping in "Solving... " suggested that the error was in between 
the following lines of my code:

PetscPrintf(mpi_communicator, "Solving... \n");
>
> computing_timer.enter_section("solve");
>>
>
>> SolverControl cn;
>
> PETScWrappers::SparseDirectMUMPS solver(cn, mpi_communicator);
>
> solver.set_symmetric_mode(false);
>
> solver.solve(system_matrix, distributed_dU, system_rhs); 
>
>
>> computing_timer.exit_section("solve");
>
> PetscPrintf(mpi_communicator, "Solved! \n");
>
>
>
 Indeed, when I comment out the "solver.solve(system_matrix, 
distributed_dU, system_rhs); " line, it runs with no errors at all.

Could this be the source of my issues? Also, how can I solve this specific 
issue?

-- 
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see 
https://groups.google.com/d/forum/dealii?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dealii+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
[MUST] MUST configuration ... centralized checks without application crash 
handling
[MUST] Information: overwritting old intermediate data in directory 
"/homeb/inm1/lcampos/JuFold/build/must_temp"!
[MUST] Using prebuilt infrastructure at 
/usr/local/software/jureca/Stages/2017b/software/MUST/1.5.0-gpsmpi-2017b-Python-2.7.14/modules//mode3-layer2
[MUST] Search for linked P^nMPI ... not found ... using LD_PRELOAD to load 
P^nMPI ... success
[MUST] Executing application:
    Number of active cells:       512 (by partition: 
21+21+21+22+22+21+21+21+21+21+22+22+21+21+22+21+21+22+22+22+21+21+21+21)
    Number of degrees of freedom: 2187 (by partition: 
83+114+86+88+73+93+69+113+90+90+86+84+72+133+79+91+100+91+72+93+84+106+95+102)
==============================APPLYING EXTERNAL 
FORCE==============================
Saving snapshot

Assembling system
Finished assembling
Inc:  1 (time:0.0000e+00, dt:1.0000e-02, rel:000%, growth:000%), Iter: 0. 
Residual norm:   5.99e+01. Relative norm:   1.00e+00 
Solving... 
rank 4 (of 24), pid 29208 catched MPI error nr 284282377
rank 7 (of 24), pid 29215 catched MPI error nr 284282377
rank 3 (of 24), pid 29205 catched MPI error nr 888262153
rank 1 (of 24), pid 29202 catched MPI error nr 552717833
rank 19 (of 24), pid 29255 catched MPI error nr 821153289
rank 12 (of 24), pid 29231 catched MPI error nr 82955785
rank 22 (of 24), pid 29260 catched MPI error nr 888262153
rank 8 (of 24), pid 29219 catched MPI error nr 1022479881
rank 5 (of 24), pid 29214 catched MPI error nr 686935561
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffcf6298dac) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7fff49aa3b2c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffe0f21b53c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffc0535c4cc) failed
MPI_Op_free(75).: Null Op pointer
rank 11 (of 24), pid 29227 catched MPI error nr 1022479881
rank 14 (of 24), pid 29239 catched MPI error nr 150064649
rank 16 (of 24), pid 29245 catched MPI error nr 418500105
rank 13 (of 24), pid 29235 catched MPI error nr 82955785
rank 10 (of 24), pid 29226 catched MPI error nr 1022479881
rank 17 (of 24), pid 29248 catched MPI error nr 552717833
rank 6 (of 24), pid 29211 catched MPI error nr 351391241
rank 20 (of 24), pid 29258 catched MPI error nr 418500105
rank 9 (of 24), pid 29221 catched MPI error nr 888262153
rank 21 (of 24), pid 29259 catched MPI error nr 15846921
rank 18 (of 24), pid 29251 catched MPI error nr 955371017
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffec3a67f0c) failed
MPI_Op_free(75).: Null Op pointer
rank 2 (of 24), pid 29201 catched MPI error nr 619826697
rank 0 (of 24), pid 29196 catched MPI error nr 754044425
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffece191b0c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7fff8945c82c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffe3e9c0ecc) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffd3e28123c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffc6a35c14c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffd863a742c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffe19eaa64c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffc0c7890cc) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffedc04b5ac) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffc2ba7f24c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7fffbac1921c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7fff7513d8bc) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffea3e1ba4c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7fff0f79540c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffe41ef721c) failed
MPI_Op_free(75).: Null Op pointer
rank 15 (of 24), pid 29242 catched MPI error nr 15846921
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffd8b3a56fc) failed
MPI_Op_free(75).: Null Op pointer
rank 23 (of 24), pid 29261 catched MPI error nr 351391241
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7fff5ca4c91c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffd1e63440c) failed
MPI_Op_free(75).: Null Op pointer
Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7fff87951d0c) failed
MPI_Op_free(75).: Null Op pointer
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
Waiting up to 30 seconds for analyses to be finished.
[MUST-ERROR] Execution finished, but no output found!

Reply via email to