Re: [OMPI users] signal 15 (terminated)

Jeff Squyres Wed, 4 Feb 2009 08:54:01 -0500

On Feb 3, 2009, at 10:15 PM, Hana Milani wrote:

sorry if I didn't answer:
Have you checked to ensure that the job manager is not killing yourjob?
I am not quite sure what you mean by job manager, but, this is thepersonal computer of mine. Much to my surprise, I have also opensuse on my laptop, took the similar procedure then the same messageappeared !!!!

Ok.

Is there a local system administrator that you can talk to about this?
Not a very good one, but I asked someone who had seen this messageon his own works and this was his answer:
It means that the program corresponding to the process identifier2407 (the PID you can find on the second column from the "ps aux"command) running on one of you cluster's node (named linux-4pel) hasstopped because it has received the signal SIGTERM (terminationsignal 15). Sorry if this is a long explanation of things youalready know :-). Let's say thay you have a program running on yoursystem ; you can figure out its process ID number nnnnn by doing a"ps aux". Now if you want to stop it - f.e. because it is out ofcontrol - a convenient way is to send a termination request to theprocess by issuing the "kill -s SIGTERM nnnnn". Here, openmpinotified to you that one of the spawned processes has beenterminated because it has received the SIGTERM signal and, as aconsequence, has stopped all the other distributed processes runningon the nodes - as PID 2407 process has acknowledged SIGTERM, openmpihas sent SIGTERM to all the processes associated to your parallel run.


This is exactly correct.

Now ... how to avoid this? I am afraid there is no easy answer. The2407 process has probably received a SIGTERM from anotherapplication - I mean it has not died by accident (a hanging orfaulting process exits without invoking the MPI_FINALYZE andproduces a different error message). The difficulty is that you haveto investigate what application has issued the SIGTERM - whatapplication has told your 2407 process to terminate.


Also exactly correct.

If you are working on a cluster managing the MPI distributedprocesses to the nodes with a resource manager (like SLURM, PBS orTorque), I would check if the manager is not limiting the memorysize footprint or the CPU time of the jobs accepted by thelinux-4pel computer.

This is what I was asking you; you're telling me that you have noresource manager, and therefore this probably isn't the cause. But*something* is killing your app with a SIGTERM.

It is tricky for me to figure out what could have asked your programto stop ... does it stops immediately or during a long run (CPUtime?), with small jobs or large ones (memory?) ; is MPI running ona personal computer or a huge cluster (resource manager?), do youhave sufficient privileges to have a look on /var/log/messages onlinux-4pel?
1. The code stops running immediately. 2. The computers are mypersonal ones and no administrator has limited the 7.9 GiB memory Ihave. 3. Sequentially the run takes 500-700MiB memory.


Is this a Fortran program, perchance?

Do you have access to the source code? I wonder if the program isinternally raising an error and effectively aborting itself. Do youknow that the application runs correctly? Do you have any test datasets that you can try that give known outputs?


--
Jeff Squyres
Cisco Systems

Re: [OMPI users] signal 15 (terminated)

Reply via email to