On Feb 3, 2009, at 10:15 PM, Hana Milani wrote:
sorry if I didn't answer:
Have you checked to ensure that the job manager is not killing your
job?
I am not quite sure what you mean by job manager, but, this is the
personal computer of mine. Much to my surprise, I have also open
suse on my laptop, took the similar procedure then the same message
appeared !!!!
Ok.
Is there a local system administrator that you can talk to about this?
Not a very good one, but I asked someone who had seen this message
on his own works and this was his answer:
It means that the program corresponding to the process identifier
2407 (the PID you can find on the second column from the "ps aux"
command) running on one of you cluster's node (named linux-4pel) has
stopped because it has received the signal SIGTERM (termination
signal 15). Sorry if this is a long explanation of things you
already know :-). Let's say thay you have a program running on your
system ; you can figure out its process ID number nnnnn by doing a
"ps aux". Now if you want to stop it - f.e. because it is out of
control - a convenient way is to send a termination request to the
process by issuing the "kill -s SIGTERM nnnnn". Here, openmpi
notified to you that one of the spawned processes has been
terminated because it has received the SIGTERM signal and, as a
consequence, has stopped all the other distributed processes running
on the nodes - as PID 2407 process has acknowledged SIGTERM, openmpi
has sent SIGTERM to all the processes associated to your parallel run.
This is exactly correct.
Now ... how to avoid this? I am afraid there is no easy answer. The
2407 process has probably received a SIGTERM from another
application - I mean it has not died by accident (a hanging or
faulting process exits without invoking the MPI_FINALYZE and
produces a different error message). The difficulty is that you have
to investigate what application has issued the SIGTERM - what
application has told your 2407 process to terminate.
Also exactly correct.
If you are working on a cluster managing the MPI distributed
processes to the nodes with a resource manager (like SLURM, PBS or
Torque), I would check if the manager is not limiting the memory
size footprint or the CPU time of the jobs accepted by the
linux-4pel computer.
This is what I was asking you; you're telling me that you have no
resource manager, and therefore this probably isn't the cause. But
*something* is killing your app with a SIGTERM.
It is tricky for me to figure out what could have asked your program
to stop ... does it stops immediately or during a long run (CPU
time?), with small jobs or large ones (memory?) ; is MPI running on
a personal computer or a huge cluster (resource manager?), do you
have sufficient privileges to have a look on /var/log/messages on
linux-4pel?
1. The code stops running immediately. 2. The computers are my
personal ones and no administrator has limited the 7.9 GiB memory I
have. 3. Sequentially the run takes 500-700MiB memory.
Is this a Fortran program, perchance?
Do you have access to the source code? I wonder if the program is
internally raising an error and effectively aborting itself. Do you
know that the application runs correctly? Do you have any test data
sets that you can try that give known outputs?
--
Jeff Squyres
Cisco Systems