Hi , All:
I running a Open MPI (1.3.4) program by 200 parallel processes.
But, the program is terminated with
--------------------------------------------------------------------------mpirun
noticed that process rank 0 with PID 77967 on node n342 exited on signal 9
(Killed).--------------------------------------------------------------------------
After searching, the signal 9 means:
the
process is currently in an unworkable state and should be terminated with
extreme prejudice
If a process does not respond to any other
termination signals, sending it a SIGKILL signal will almost always cause it to
go away.
The system will generate SIGKILL for a process itself under
some unusual conditions where the program cannot possibly continue to run (even
to run a signal handler).
But, the error message does not indicate any possible reasons for the
termination.
There is a FOR loop in the main() program, if the loop number is small (< 200),
the program works well, but if it becomes lager and larger, the program will
got SIGKILL.
The cluster where I am running the MPI program does not allow running debug
tools.
If I run it on a workstation, it will take a very very long time (for > 200
loops) in order to get the error occur again.
What can I do to find the possible bugs ?
Any help is really appreciated.
thanks
Jack