On Feb 3, 2009, at 10:15 PM, Hana Milani wrote:

sorry if I didn't answer:

Have you checked to ensure that the job manager is not killing your job?

I am not quite sure what you mean by job manager, but, this is the personal computer of mine. Much to my surprise, I have also open suse on my laptop, took the similar procedure then the same message appeared !!!!

Ok.

Is there a local system administrator that you can talk to about this?

Not a very good one, but I asked someone who had seen this message on his own works and this was his answer:

It means that the program corresponding to the process identifier 2407 (the PID you can find on the second column from the "ps aux" command) running on one of you cluster's node (named linux-4pel) has stopped because it has received the signal SIGTERM (termination signal 15). Sorry if this is a long explanation of things you already know :-). Let's say thay you have a program running on your system ; you can figure out its process ID number nnnnn by doing a "ps aux". Now if you want to stop it - f.e. because it is out of control - a convenient way is to send a termination request to the process by issuing the "kill -s SIGTERM nnnnn". Here, openmpi notified to you that one of the spawned processes has been terminated because it has received the SIGTERM signal and, as a consequence, has stopped all the other distributed processes running on the nodes - as PID 2407 process has acknowledged SIGTERM, openmpi has sent SIGTERM to all the processes associated to your parallel run.

This is exactly correct.

Now ... how to avoid this? I am afraid there is no easy answer. The 2407 process has probably received a SIGTERM from another application - I mean it has not died by accident (a hanging or faulting process exits without invoking the MPI_FINALYZE and produces a different error message). The difficulty is that you have to investigate what application has issued the SIGTERM - what application has told your 2407 process to terminate.

Also exactly correct.

If you are working on a cluster managing the MPI distributed processes to the nodes with a resource manager (like SLURM, PBS or Torque), I would check if the manager is not limiting the memory size footprint or the CPU time of the jobs accepted by the linux-4pel computer.

This is what I was asking you; you're telling me that you have no resource manager, and therefore this probably isn't the cause. But *something* is killing your app with a SIGTERM.

It is tricky for me to figure out what could have asked your program to stop ... does it stops immediately or during a long run (CPU time?), with small jobs or large ones (memory?) ; is MPI running on a personal computer or a huge cluster (resource manager?), do you have sufficient privileges to have a look on /var/log/messages on linux-4pel?

1. The code stops running immediately. 2. The computers are my personal ones and no administrator has limited the 7.9 GiB memory I have. 3. Sequentially the run takes 500-700MiB memory.

Is this a Fortran program, perchance?

Do you have access to the source code? I wonder if the program is internally raising an error and effectively aborting itself. Do you know that the application runs correctly? Do you have any test data sets that you can try that give known outputs?

--
Jeff Squyres
Cisco Systems

Reply via email to