Try adding some print statements so you can see where the error occurs.

On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:

> Hi , All: 
> 
> I running a Open MPI (1.3.4) program by 200 parallel processes. 
> 
> But, the program is terminated with 
> 
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 77967 on node n342 exited on 
> signal 9 (Killed).
> --------------------------------------------------------------------------
> 
> After searching, the signal 9 means: 
> 
> the process is currently in an unworkable state and should be terminated with 
> extreme prejudice
> 
>  If a process does not respond to any other termination signals, sending it a 
> SIGKILL signal will almost always cause it to go away.
> 
>  The system will generate SIGKILL for a process itself under some unusual 
> conditions where the program cannot possibly continue to run (even to run a 
> signal handler).
>  
> But, the error message does not indicate any possible reasons for the 
> termination. 
> 
> There is a FOR loop in the main() program, if the loop number is small (< 
> 200), the program works well, 
> but if it becomes lager and larger, the program will got SIGKILL. 
> 
> The cluster where I am running the MPI program does not allow running debug 
> tools. 
> 
> If I run it on a workstation, it will take a very very long time (for > 200 
> loops) in order to 
> get the error occur again. 
> 
> What can I do to find the possible bugs ? 
> 
> Any help is really appreciated. 
> 
> thanks
> 
> Jack
> 
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to