Hi , All: 
I running a Open MPI (1.3.4) program by 200 parallel processes. 
But, the program is terminated with 
--------------------------------------------------------------------------mpirun
 noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 
(Killed).--------------------------------------------------------------------------
After searching, the signal 9 means: 
the
process is currently in an unworkable state and should be terminated with
extreme prejudice
 If a process does not respond to any other
termination signals, sending it a SIGKILL signal will almost always cause it to
go away.
 The system will generate SIGKILL for a process itself under
some unusual conditions where the program cannot possibly continue to run (even
to run a signal handler). 
But, the error message does not indicate any possible reasons for the 
termination. 
There is a FOR loop in the main() program, if the loop number is small (< 200), 
the program works well, but if it becomes lager and larger, the program will 
got SIGKILL. 
The cluster where I am running the MPI program does not allow running debug 
tools. 
If I run it on a workstation, it will take a very very long time (for > 200 
loops) in order to get the error occur again. 
What can I do to find the possible bugs ? 
Any help is really appreciated. 
thanks
Jack




                                          

Reply via email to