Re: [Pacemaker] Timeout value of STONITH resource is too large

Kazunori INOUE Sun, 05 Aug 2012 20:02:07 -0700

Hi David,

I reported it in Bugzilla with config.
* http://bugs.clusterlabs.org/show_bug.cgi?id=5092


Best Regards,
Kazunori INOUE

(12.08.03 12:19), David Vossel wrote:

----- Original Message -----

From: "Kazunori INOUE" <inouek...@intellilink.co.jp>
To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org>
Sent: Thursday, August 2, 2012 4:37:45 AM
Subject: Re: [Pacemaker] Timeout value of STONITH resource is too large

Hi David,

The resource (process forked from stonithd) stays until sleep is
completed.
stonithd does not call st_child_term().

I set 61 seconds in timeout of start.
Therefore, the value of track->timeout in run_stonith_agent() is
61000(ms).
However, it is after '1000*track->timeout' milliseconds that
st_child_term()
is called. That is, it is approximately _17 hours_ later.


Whoa, that isn't good at all!  Please create a bug report for this.  Include 
your config.

-- Vossel

- lib/fencing/st_client.c
   476 run_stonith_agent(const char *agent, const char *action, const
   char *victim,
   477               GHashTable * device_args, GHashTable * port_map,
   int *agent_result, char **output,
   478               async_command_t * track)

   541         if (track->timeout) {
   542             track->pid = pid;
   543             track->timer_sigterm =
   g_timeout_add(1000*track->timeout, st_child_term, track);
   544             track->timer_sigkill =
   g_timeout_add(1000*(track->timeout+5), st_child_kill, track);

Best Regards,
Kazunori INOUE

(12.08.01 00:01), David Vossel wrote:

----- Original Message -----

From: "Kazunori INOUE" <inouek...@intellilink.co.jp>
To: "pacemaker@oss" <pacemaker@oss.clusterlabs.org>
Cc: shimaza...@intellilink.co.jp
Sent: Monday, July 30, 2012 5:13:40 AM
Subject: [Pacemaker] Timeout value of STONITH resource is too
large

Hi,

I am using Pacemaker-1.1.
- glue       (2012 Jul 16) 2719:18489f275f75
- libqb      (2012 Jul 19)
11b20e19beff7f1b6003be0b4c73da8ecf936442
- corosync   (2012 Jul 12)
908ed7dcb390c0eade3dddb0cdfe181eb26b2ce2
- pacemaker  (2012 Jul 29)
33119da31c235710195c783e5c9a32c6e95b3efc

The timeout value of the _start_ operation of STONITH resource is
large.
Therefore, even after the start operation is timed out, the
process
of
plugin remains.


How long after the timeout occurs does the process stay around?
  Does it terminate a few seconds after the timeout, or does the
resource wait until the entire duration of the sleep 3600
finishes?

-- Vossel


The following is gdb at the time of STONITH resource starting.
----
     [root@dev1 ~]# gdb /usr/libexec/pacemaker/stonithd `pgrep
     stonithd`
     (gdb) b run_stonith_agent
     Breakpoint 1 at 0x7f03f1e00d69: file st_client.c, line 479.
     (gdb) c
     Continuing.

     Breakpoint 1, run_stonith_agent (agent=0xe0f820
     "fence_legacy",
     action=0xe11fb0 "monitor",
      <snip>
     479     {
     (gdb) bt
     #0  run_stonith_agent (agent=0xe0f820 "fence_legacy",
     action=0xe11fb0 "monitor",
         victim=0x0, device_args=Traceback (most recent call
         last):0xcffe30, port_map=
         Traceback (most recent call last):0xcffe80,
         agent_result=0x7fff70214ef4,
         output=0x0, track=0xe11d20) at st_client.c:479
     #1  0x0000000000406230 in stonith_device_execute
     (device=0xe10ff0)
     at commands.c:140
     #2  0x0000000000406404 in stonith_device_dispatch
     (user_data=0xe10ff0) at commands.c:160
     #3  0x00007f03f224ad00 in crm_trigger_dispatch
     (source=0xe11160,
     callback=
         0x4063dd <stonith_device_dispatch>, userdata=0xe11160) at
         mainloop.c:105
     #4  0x0000003642638f0e in g_main_context_dispatch () from
     /lib64/libglib-2.0.so.0
     #5  0x000000364263c938 in ?? () from /lib64/libglib-2.0.so.0
     #6  0x000000364263cd55 in g_main_loop_run () from
     /lib64/libglib-2.0.so.0
     #7  0x00000000004056dc in main (argc=1, argv=0x7fff70215278)
     at
     main.c:853
     (gdb) n 15
     Detaching after fork from child process 28915.
     510       if (pid) {
     (gdb) n 15
     542                 track->pid = pid;
     (gdb) list
     537             track->stdout = p_read_fd;
     538             g_child_watch_add(pid, track->done, track);
     539             crm_trace("Op: %s on %s, pid: %d, timeout:
     %ds",
     action, agent, pid, track->timeout);
     540
     541             if (track->timeout) {
     542                 track->pid = pid;
     543                 track->timer_sigterm =
     g_timeout_add(1000*track->timeout, st_child_term, track);
     544                 track->timer_sigkill =
     g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
     545
     546             } else {
     (gdb) n
     543                 track->timer_sigterm =
     g_timeout_add(1000*track->timeout, st_child_term, track);
     (gdb) n
     544                 track->timer_sigkill =
     g_timeout_add(1000*(track->timeout+5), st_child_kill, track);
     (gdb) p agent
     $1 = 0xe0f820 "fence_legacy"
     (gdb) p action
     $2 = 0xe11fb0 "monitor"
     (gdb) p args
     $3 = 0xe11500
     
"plugin=external/libvirt\nhostlist=dev2\nhypervisor_uri=qemu+ssh://n8/system\noption=monitor\n"
   * (gdb) p track->timeout
     $4 = 61000
   * (gdb) p 1000*track->timeout
     $5 = 61000000
----
1. I added "sleep 3600" to status() of
     /usr/lib64/stonith/plugins/external/libvirt.

     [root@dev1 external]# diff -u libvirt.ORG libvirt
     --- libvirt.ORG 2012-07-17 13:10:01.000000000 +0900
     +++ libvirt     2012-07-30 13:36:19.661431208 +0900
     @@ -221,6 +221,7 @@
          ;;

          status)
     +    sleep 3600
          libvirt_check_config
          libvirt_status
          exit $?

2. service corosync start ; service pacemaker start
3. cibadmin -U -x test.xml
4. When I wait for 61 seconds (timeout value of start),

     [root@dev1 ~]# crm_mon -rf1
     ============
     Last updated: Mon Jul 30 13:18:48 2012
     Last change: Mon Jul 30 13:15:08 2012 via cibadmin on dev1
     Stack: corosync
     Current DC: dev1 (-1788499776) - partition with quorum
     Version: 1.1.7-33119da
     2 Nodes configured, unknown expected votes
     1 Resources configured.
     ============

     Online: [ dev1 dev2 ]

     Full list of resources:

      f-2    (stonith:external/libvirt):     Started dev1 FAILED

     Migration summary:
     * Node dev2:
     * Node dev1:
        f-2: migration-threshold=1 fail-count=1000000

     Failed actions:
   *     f-2_start_0 (node=dev1, call=-1, rc=1, status=Timed Out):
   unknown error

     [root@dev1 ~]# ps -ef|egrep
     "UID|corosync|pacemaker|stonith|fence|sleep"
     UID    PID  PPID  C STIME TTY     TIME CMD
     root 28840     1  0 13:13 ?   00:00:01 corosync
     root 28858     1  0 13:13 ?   00:00:00 pacemakerd
     496  28860 28858  0 13:13 ?   00:00:00
     /usr/libexec/pacemaker/cib
     root 28861 28858  0 13:13 ?   00:00:00
     /usr/libexec/pacemaker/stonithd
     root 28862 28858 73 13:13 ?   00:04:16
     /usr/libexec/pacemaker/lrmd
     496  28863 28858  0 13:13 ?   00:00:00
     /usr/libexec/pacemaker/attrd
     496  28864 28858  0 13:13 ?   00:00:00
     /usr/libexec/pacemaker/pengine
     496  28865 28858 51 13:13 ?   00:02:58
     /usr/libexec/pacemaker/crmd
   * root 28915 28861  0 13:15 ?   00:00:00 /usr/bin/perl
   /usr/sbin/fence_legacy
   * root 28916 28915  0 13:15 ?   00:00:00 stonith -t
   external/libvirt
   -E -S
   * root 28921 28916  0 13:15 ?   00:00:00 /bin/sh
   /usr/lib64/stonith/plugins/external/libvirt status
     root 28925 28921  0 13:15 ?   00:00:00 sleep 3600

     [root@dev1 ~]# top -bn1
     top - 13:21:26 up 5 days,  3:23,  5 users,  load average:
     1.99,
     1.42, 0.72
     Tasks: 198 total,   3 running, 195 sleeping,   0 stopped,   0
     zombie
     Cpu(s):  0.7%us,  0.7%sy,  0.0%ni, 98.5%id,  0.0%wa,  0.0%hi,
      0.0%si,  0.0%st
     Mem:   5089052k total,  2423104k used,  2665948k free,
       265756k
     buffers
     Swap:  1048568k total,        0k used,  1048568k free,
      1717712k
     cached

       PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
        COMMAND
   * 28862 root      20   0 83816 3412 2572 R 98.2  0.1   6:17.18
   lrmd
   * 28865 hacluste  20   0  166m 6380 3428 R 98.2  0.1   4:59.84
   crmd
     28860 hacluste  20   0 93888 7192 4472 S  2.0  0.1   0:00.23
     cib
     29052 root      20   0 15024 1136  792 R  2.0  0.0   0:00.01
     top
         1 root      20   0 19348 1520 1212 S  0.0  0.0   0:00.77
         init
         2 root      20   0     0    0    0 S  0.0  0.0   0:00.00
         kthreadd
         3 root      RT   0     0    0    0 S  0.0  0.0   0:06.85
         migration/0
         4 root      20   0     0    0    0 S  0.0  0.0  14:25.15
         ksoftirqd/0
         5 root      RT   0     0    0    0 S  0.0  0.0   0:00.10
         migration/0

Best Regards,
Kazunori INOUE

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Timeout value of STONITH resource is too large

Reply via email to