Hi Mohammadali

"Signal number 11 SEGV", is the Unix/Linux signal for a memory
violation (a.k.a. segmentation violation or segmentation fault).
This normally happens when the program tries to read
or write in a memory area that it did not allocate, already
freed, or belongs to another process.
That is most likely a programming error on the FEM code,
probably not an MPI error, probably not a PETSC error either.

The "errorcode 59" seems to be the PETSC error message
issued when it receives a signal (in this case a
segmentation fault signal, I guess) from the operational
system (Linux, probably).
Apparently it simply throws the error message and
calls MPI_Abort, and the program stops.
This is what petscerror.h include file has about error code 59:

#define PETSC_ERR_SIG              59   /* signal received */

**

One suggestion is to compile the code with debugging flags (-g),
and attach a debugger to it. Not an easy task if you have many processes/ranks in your program, if your debugger is the default
Linux gdb, but it is not impossible to do either.
Depending on the computer you have, you may have a parallel debugger,
such as TotalView or DDT, which are more user friendly.

You could also compile it with the flag -traceback
(or -fbacktrace, the syntax depends on the compiler, check the compiler man page). This at least will tell you the location in the program where the segmentation fault happened (in the STDERR file of your job).

I hope this helps.
Gus Correa

PS - The zip attachment with your "myjob.sh" script
was removed from the email.
Many email server programs remove zip for safety.
Files with ".sh" suffix are also removed in general.
You could compress it with gzip or bzip2 instead.

On 11/15/2016 02:40 PM, Beheshti, Mohammadali wrote:
Hi,



I am running simulations in a software which uses ompi to solve an FEM
problem.  From time to time I receive the error “

MPI_ABORT was invoked on rank 0 in communicator compute with errorcode
59” in the output file while for the larger simulations (with larger FEM
mesh) I almost always get this error. I don’t have any idea what is the
cause of this error. The error file contains a PETSC error: ”caught
signal number 11 SEGV”. I am running my jobs on a HPC system which has
Open MPI version 2.0.0.  I am also using a bash file (myjob.sh) which is
attached. The ompi_info - - all  command and ifconfig command outputs
are also attached. I appreciate any help in this regard.



Thanks



Ali





**************************

Mohammadali Beheshti

Post-Doctoral Fellow

Department of Medicine (Cardiology)

Toronto General Research Institute

University Health Network

Tel: 416-340-4800 <tel:416-340-4800> ext. 6837



**************************




This e-mail may contain confidential and/or privileged information for
the sole use of the intended recipient.
Any review or distribution by anyone other than the person for whom it
was originally intended is strictly prohibited.
If you have received this e-mail in error, please contact the sender and
delete all copies.
Opinions, conclusions or other information contained in this e-mail may
not be that of the organization.



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to