Re: [OMPI users] signal 15 (terminated)

Hana Milani Tue, 3 Feb 2009 22:15:52 -0500

sorry if I didn't answer:

Have you checked to ensure that the job manager is not killing your job?


I am not quite sure what you mean by job manager, but, this is the personal 
computer of mine. Much to my surprise, I have also open suse on my laptop, took 
the similar procedure then the same message appeared !!!!

Is there a local system administrator that you can talk to about this?

Not a very good one, but I asked someone who had seen this message on his own 
works and this was his answer:

It means that the program corresponding to the process identifier 2407 (the PID 
you can find on the second column from the "ps aux" command) running on one of 
you cluster's node (named linux-4pel) has stopped because it has received the 
signal SIGTERM (termination signal 15). Sorry if this is a long explanation of 
things you already know :-). Let's say thay you have a program running on your 
system ; you can figure out its process ID number nnnnn by doing a "ps aux". 
Now if you want to stop it - f.e. because it is out of control - a convenient 
way is to send a termination request to the process by issuing the "kill -s 
SIGTERM nnnnn". Here, openmpi notified to you  that one of the spawned 
processes has been terminated because it has received the SIGTERM signal and, 
as a consequence, has stopped all the other distributed processes running on 
the nodes - as PID 2407 process has acknowledged SIGTERM, openmpi has sent 
SIGTERM to all the processes associated
 to your parallel run.
Now ... how to avoid this? I am afraid there is no easy answer. The 2407 
process has probably received a SIGTERM from another application - I mean it 
has not died by accident (a hanging or faulting process exits without invoking 
the MPI_FINALYZE and produces a different error message). The difficulty is 
that you have to investigate what application has issued the SIGTERM - what 
application has told your 2407 process to terminate. If you are working on a 
cluster managing the MPI distributed processes to the nodes with a resource 
manager (like SLURM, PBS or Torque), I would check if the manager is not 
limiting the memory size footprint or the CPU time of the jobs accepted by the 
linux-4pel computer. It is tricky for me to figure out what could have asked 
your program to stop ... does it stops immediately or during a long run (CPU 
time?), with small jobs or large ones (memory?) ; is MPI running on a personal 
computer or a huge cluster (resource manager?),
 do you have sufficient privileges to have a look on /var/log/messages on 
linux-4pel? 

1. The code stops running immediately. 2. The computers are my personal ones 
and no administrator has limited the 7.9 GiB memory I have. 3. Sequentially the 
run takes 500-700MiB memory.

3. Lokking at the message after I executed the run this was the message in 
/var/log/messages:

Jan 23 16:24:32 linux-jzqs gdm[2566]: GLib-CRITICAL: g_key_file_get_string: 
assertion `key_file != NULL' failed
Jan 23 16:24:32 linux-jzqs gdm[2566]: GLib-CRITICAL: g_key_file_get_string: 
assertion `key_file != NULL' failed
Jan 23 16:24:32 linux-jzqs gdm[2566]: GLib-CRITICAL: g_key_file_free: assertion 
`key_file != NULL' failed
Jan 23 16:24:33 linux-jzqs seahorse-agent[24718]: Failed to send buffer
Jan 23 16:24:33 linux-jzqs seahorse-agent[24718]: Failed to send buffer
Jan 23 16:24:35 linux-jzqs pulseaudio[24742]: main.c: This program is not 
intended to be run as root (unless --system is specified).
Jan 23 16:24:35 linux-jzqs pulseaudio[24742]: pid.c: Stale PID file, 
overwriting.
Jan 23 16:24:35 linux-jzqs pulseaudio[24743]: main.c: This program is not 
intended to be run as root (unless --system is specified).
Jan 23 16:24:35 linux-jzqs pulseaudio[24743]: pid.c: Daemon already running.
Jan 23 16:24:35 linux-jzqs pulseaudio[24743]: main.c: pa_pid_file_create() 
failed.
Jan 23 16:24:35 linux-jzqs pulseaudio[24745]: main.c: This program is not 
intended to be run as root (unless --system is specified).
Jan 23 16:24:35 linux-jzqs pulseaudio[24745]: pid.c: Daemon already running.
Jan 23 16:24:35 linux-jzqs pulseaudio[24745]: main.c: pa_pid_file_create() 
failed.
Jan 23 16:24:37 linux-jzqs gconfd (root-24630): Resolved address 
"xml:readwrite:/root/.gconf" to a writable configuration source at position 0
Jan 23 16:24:39 linux-jzqs kernel: CPU0 attaching NULL sched-domain.
Jan 23 16:24:39 linux-jzqs kernel: CPU1 attaching NULL sched-domain.
Jan 23 16:24:39 linux-jzqs kernel: CPU0 attaching sched-domain:
Jan 23 16:24:39 linux-jzqs kernel:  domain 0: span 
00000000,00000000,00000000,00000003
Jan 23 16:24:39 linux-jzqs kernel:   groups: 
00000000,00000000,00000000,00000001 00000000,00000000,00000000,00000002
Jan 23 16:24:39 linux-jzqs kernel: CPU1 attaching sched-domain:
Jan 23 16:24:39 linux-jzqs kernel:  domain 0: span 
00000000,00000000,00000000,00000003
Jan 23 16:24:39 linux-jzqs kernel:   groups: 
00000000,00000000,00000000,00000002 00000000,00000000,00000000,00000001

Re: [OMPI users] signal 15 (terminated)

Reply via email to