Re: [Pacemaker] hangs pending

Andrew Beekhof Thu, 20 Feb 2014 01:56:31 -0800

On 20 Feb 2014, at 5:33 pm, Andrey Groshev <gre...@yandex.ru> wrote:


> 
> 
> 20.02.2014, 01:22, "Andrew Beekhof" <and...@beekhof.net>:
>> On 20 Feb 2014, at 4:18 am, Andrey Groshev <gre...@yandex.ru> wrote:
>> 
>>>  19.02.2014, 06:47, "Andrew Beekhof" <and...@beekhof.net>:
>>>>  On 18 Feb 2014, at 9:29 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>>>>>   Hi, ALL and Andrew!
>>>>> 
>>>>>   Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>   In general - I am happy (almost like an elephant)   :)
>>>>>   Except resources on the node are important to me eight processes: 
>>>>> corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>   I killed them with different signals (4,6,11 and even 9).
>>>>>   Behavior does not depend of number signal - it's good.
>>>>>   If STONITH send reboot to the node - it rebooted and rejoined the 
>>>>> cluster - too it's good.
>>>>>   But the behavior is different from killing various demons.
>>>>> 
>>>>>   Turned four groups:
>>>>>   1. corosync,cib - STONITH work 100%.
>>>>>   Kill via any signals - call STONITH and reboot.
>>>>  excellent
>>>>>   3. stonithd,attrd,pengine - not need STONITH
>>>>>   This daemons simple restart, resources - stay running.
>>>>  right
>>>>>   2. lrmd,crmd - strange behavior STONITH.
>>>>>   Sometimes called STONITH - and the corresponding reaction.
>>>>>   Sometimes restart daemon
>>>>  The daemon will always try to restart, the only variable is how long it 
>>>> takes the peer to notice and initiate fencing.
>>>>  If the failure happens just before a they're due to receive totem token, 
>>>> the failure will be very quickly detected and the node fenced.
>>>>  If the failure happens just after, then detection will take longer - 
>>>> giving the node longer to recover and not be fenced.
>>>> 
>>>>  So fence/not fence is normal and to be expected.
>>>>>   and restart resources with large delay MS:pgsql.
>>>>>   One time after restart crmd - pgsql don't restart.
>>>>  I would not expect pgsql to ever restart - if the RA does its job 
>>>> properly anyway.
>>>>  In the case the node is not fenced, the crmd will respawn and the the PE 
>>>> will request that it re-detect the state of all resources.
>>>> 
>>>>  If the agent reports "all good", then there is nothing more to do.
>>>>  If the agent is not reporting "all good", you should really be asking why.
>>>>>   4. pacemakerd - nothing happens.
>>>>  On non-systemd based machines, correct.
>>>> 
>>>>  On a systemd based machine pacemakerd is respawned and reattaches to the 
>>>> existing daemons.
>>>>  Any subsequent daemon failure will be detected and the daemon respawned.
>>>  And! I almost forgot about IT!
>>>  Exist another (NORMAL) the variants, the methods, the ideas?
>>>  Without this  ... @$%#$%&$%^&$%^&##@#$$^$%& !!!!!
>>>  Otherwise - it's a full epic fail ;)
>> 
>> -ENOPARSE
> 
> OK, I remove my personal attitude to "systemd".
> Let me explain.
> 
> Somewhere in the beginning of this topic, I wrote:
> A.G.:Who knows who runs lrmd? 
> A.B.:Pacemakerd.
> That's one!
> 
> Let's see the list of processes:
> #ps -axf
> .....
> 6067 ?        Ssl    7:24 corosync
> 6092 ?        S      0:25 pacemakerd
> 6094 ?        Ss   116:13  \_ /usr/libexec/pacemaker/cib
> 6095 ?        Ss     0:25  \_ /usr/libexec/pacemaker/stonithd
> 6096 ?        Ss     1:27  \_ /usr/libexec/pacemaker/lrmd
> 6097 ?        Ss     0:49  \_ /usr/libexec/pacemaker/attrd
> 6098 ?        Ss     0:25  \_ /usr/libexec/pacemaker/pengine
> 6099 ?        Ss     0:29  \_ /usr/libexec/pacemaker/crmd
> .....
> That's two!

Whats two?  I don't follow.

> And more, more...
> Now you must understand - why I want this process to work always. 
> Even I think, No need for anyone here to explain it!
> 
> And Now you say about "pacemakerd nice work, but only on systemd distros" !!!

No, I;m saying it works _better_ on systemd distros.
On non-systemd distros you still need quite a few unlikely-to-happen failures 
to trigger a situation in which the node still gets fenced and recovered 
(assuming no-one saw any of the error messages and didn't run "service 
pacemaker restart" prior to the additional failures).

> What should I do now?
> * Integrate systemd in CentOS?
> * Migrate to Fefora?
> * Buy RHEL7 !?

Option 3 is particularly good :)

> Each a variants is great, but don't fit for me.
> 
> P.S. And I'm not talking distros which don't migrate to systemd (and will not 
> do).

Are there any?  Even debian and ubuntu have raised the white flag.

> Do not be offended! We also do so. 
> We are building a secret military factory, 
> large concrete fence around it, 
> wall barbed wire, but forget to install the gates. :)
> 
> 
>>>>>   And then I can kill any process of the third group. They do not restart.
>>>>  Until they become needed.
>>>>  Eg. if the DC goes to invoke the policy engine, that will fail causing 
>>>> the crmd to fail and the node to be fenced.
>>>>>   Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>> 
>>>>>   What do you think about this?
>>>>>   The main question of this topic - we decided.
>>>>>   But this varied behavior - another big problem.
>>>>> 
>>>>>   17.02.2014, 08:52, "Andrey Groshev" <gre...@yandex.ru>:
>>>>>>   17.02.2014, 02:27, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>    With no quick follow-up, dare one hope that means the patch worked? 
>>>>>>> :-)
>>>>>>   Hi,
>>>>>>   No, unfortunately the chief changed my plans on Friday and all day I 
>>>>>> was engaged in a parallel project.
>>>>>>   I hope that today have time to carry out the necessary tests.
>>>>>>>    On 14 Feb 2014, at 3:37 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>>>>>>>>     Yes, of course. Now beginning build world and test )
>>>>>>>> 
>>>>>>>>     14.02.2014, 04:41, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>     The previous patch wasn't quite right.
>>>>>>>>>     Could you try this new one?
>>>>>>>>> 
>>>>>>>>>        http://paste.fedoraproject.org/77123/13923376/
>>>>>>>>> 
>>>>>>>>>     [11:23 AM] beekhof@f19 ~/Development/sources/pacemaker/devel ☺ # 
>>>>>>>>> git diff
>>>>>>>>>     diff --git a/crmd/callbacks.c b/crmd/callbacks.c
>>>>>>>>>     index ac4b905..d49525b 100644
>>>>>>>>>     --- a/crmd/callbacks.c
>>>>>>>>>     +++ b/crmd/callbacks.c
>>>>>>>>>     @@ -199,8 +199,7 @@ peer_update_callback(enum crm_status_type 
>>>>>>>>> type, crm_node_t * node, const void *d
>>>>>>>>>                      stop_te_timer(down->timer);
>>>>>>>>> 
>>>>>>>>>                      flags |= node_update_join | node_update_expected;
>>>>>>>>>     -                crm_update_peer_join(__FUNCTION__, node, 
>>>>>>>>> crm_join_none);
>>>>>>>>>     -                crm_update_peer_expected(__FUNCTION__, node, 
>>>>>>>>> CRMD_JOINSTATE_DOWN);
>>>>>>>>>     +                crmd_peer_down(node, FALSE);
>>>>>>>>>                      check_join_state(fsa_state, __FUNCTION__);
>>>>>>>>> 
>>>>>>>>>                      update_graph(transition_graph, down);
>>>>>>>>>     diff --git a/crmd/crmd_utils.h b/crmd/crmd_utils.h
>>>>>>>>>     index bc472c2..1a2577a 100644
>>>>>>>>>     --- a/crmd/crmd_utils.h
>>>>>>>>>     +++ b/crmd/crmd_utils.h
>>>>>>>>>     @@ -100,6 +100,7 @@ void crmd_join_phase_log(int level);
>>>>>>>>>      const char *get_timer_desc(fsa_timer_t * timer);
>>>>>>>>>      gboolean too_many_st_failures(void);
>>>>>>>>>      void st_fail_count_reset(const char * target);
>>>>>>>>>     +void crmd_peer_down(crm_node_t *peer, bool full);
>>>>>>>>> 
>>>>>>>>>      #  define fsa_register_cib_callback(id, flag, data, fn) do {     
>>>>>>>>>          \
>>>>>>>>>              fsa_cib_conn->cmds->register_callback(                   
>>>>>>>>>        \
>>>>>>>>>     diff --git a/crmd/te_actions.c b/crmd/te_actions.c
>>>>>>>>>     index f31d4ec..3bfce59 100644
>>>>>>>>>     --- a/crmd/te_actions.c
>>>>>>>>>     +++ b/crmd/te_actions.c
>>>>>>>>>     @@ -80,11 +80,8 @@ send_stonith_update(crm_action_t * action, 
>>>>>>>>> const char *target, const char *uuid)
>>>>>>>>>              crm_info("Recording uuid '%s' for node '%s'", uuid, 
>>>>>>>>> target);
>>>>>>>>>              peer->uuid = strdup(uuid);
>>>>>>>>>          }
>>>>>>>>>     -    crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, 
>>>>>>>>> NULL);
>>>>>>>>>     -    crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 0);
>>>>>>>>>     -    crm_update_peer_expected(__FUNCTION__, peer, 
>>>>>>>>> CRMD_JOINSTATE_DOWN);
>>>>>>>>>     -    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>> 
>>>>>>>>>     +    crmd_peer_down(peer, TRUE);
>>>>>>>>>          node_state =
>>>>>>>>>              do_update_node_cib(peer,
>>>>>>>>>                                 node_update_cluster | 
>>>>>>>>> node_update_peer | node_update_join |
>>>>>>>>>     diff --git a/crmd/te_utils.c b/crmd/te_utils.c
>>>>>>>>>     index ad7e573..0c92e95 100644
>>>>>>>>>     --- a/crmd/te_utils.c
>>>>>>>>>     +++ b/crmd/te_utils.c
>>>>>>>>>     @@ -247,10 +247,7 @@ tengine_stonith_notify(stonith_t * st, 
>>>>>>>>> stonith_event_t * st_event)
>>>>>>>>> 
>>>>>>>>>              }
>>>>>>>>> 
>>>>>>>>>     -        crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, 
>>>>>>>>> NULL);
>>>>>>>>>     -        crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 
>>>>>>>>> 0);
>>>>>>>>>     -        crm_update_peer_expected(__FUNCTION__, peer, 
>>>>>>>>> CRMD_JOINSTATE_DOWN);
>>>>>>>>>     -        crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>>     +        crmd_peer_down(peer, TRUE);
>>>>>>>>>           }
>>>>>>>>>      }
>>>>>>>>> 
>>>>>>>>>     diff --git a/crmd/utils.c b/crmd/utils.c
>>>>>>>>>     index 3988cfe..2df53ab 100644
>>>>>>>>>     --- a/crmd/utils.c
>>>>>>>>>     +++ b/crmd/utils.c
>>>>>>>>>     @@ -1077,3 +1077,13 @@ update_attrd_remote_node_removed(const 
>>>>>>>>> char *host, const char *user_name)
>>>>>>>>>          crm_trace("telling attrd to clear attributes for remote host 
>>>>>>>>> %s", host);
>>>>>>>>>          update_attrd_helper(host, NULL, NULL, user_name, TRUE, 'C');
>>>>>>>>>      }
>>>>>>>>>     +
>>>>>>>>>     +void crmd_peer_down(crm_node_t *peer, bool full)
>>>>>>>>>     +{
>>>>>>>>>     +    if(full && peer->state == NULL) {
>>>>>>>>>     +        crm_update_peer_state(__FUNCTION__, peer, CRM_NODE_LOST, 
>>>>>>>>> 0);
>>>>>>>>>     +        crm_update_peer_proc(__FUNCTION__, peer, crm_proc_none, 
>>>>>>>>> NULL);
>>>>>>>>>     +    }
>>>>>>>>>     +    crm_update_peer_join(__FUNCTION__, peer, crm_join_none);
>>>>>>>>>     +    crm_update_peer_expected(__FUNCTION__, peer, 
>>>>>>>>> CRMD_JOINSTATE_DOWN);
>>>>>>>>>     +}
>>>>>>>>> 
>>>>>>>>>     On 16 Jan 2014, at 7:24 pm, Andrey Groshev <gre...@yandex.ru> 
>>>>>>>>> wrote:
>>>>>>>>>>      16.01.2014, 01:30, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>>>      On 16 Jan 2014, at 12:41 am, Andrey Groshev <gre...@yandex.ru> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>       15.01.2014, 02:53, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>>>>>       On 15 Jan 2014, at 12:15 am, Andrey Groshev 
>>>>>>>>>>>>> <gre...@yandex.ru> wrote:
>>>>>>>>>>>>>>        14.01.2014, 10:00, "Andrey Groshev" <gre...@yandex.ru>:
>>>>>>>>>>>>>>>        14.01.2014, 07:47, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>>>>>>>>         Ok, here's what happens:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>         1. node2 is lost
>>>>>>>>>>>>>>>>         2. fencing of node2 starts
>>>>>>>>>>>>>>>>         3. node2 reboots (and cluster starts)
>>>>>>>>>>>>>>>>         4. node2 returns to the membership
>>>>>>>>>>>>>>>>         5. node2 is marked as a cluster member
>>>>>>>>>>>>>>>>         6. DC tries to bring it into the cluster, but needs to 
>>>>>>>>>>>>>>>> cancel the active transition first.
>>>>>>>>>>>>>>>>            Which is a problem since the node2 fencing 
>>>>>>>>>>>>>>>> operation is part of that
>>>>>>>>>>>>>>>>         7. node2 is in a transition (pending) state until 
>>>>>>>>>>>>>>>> fencing passes or fails
>>>>>>>>>>>>>>>>         8a. fencing fails: transition completes and the node 
>>>>>>>>>>>>>>>> joins the cluster
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>         Thats in theory, except we automatically try again. 
>>>>>>>>>>>>>>>> Which isn't appropriate.
>>>>>>>>>>>>>>>>         This should be relatively easy to fix.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>         8b. fencing passes: the node is incorrectly marked as 
>>>>>>>>>>>>>>>> offline
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>         This I have no idea how to fix yet.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>         On another note, it doesn't look like this agent works 
>>>>>>>>>>>>>>>> at all.
>>>>>>>>>>>>>>>>         The node has been back online for a long time and the 
>>>>>>>>>>>>>>>> agent is still timing out after 10 minutes.
>>>>>>>>>>>>>>>>         So "Once the script makes sure that the victim will 
>>>>>>>>>>>>>>>> rebooted and again available via ssh - it exit with 0." does 
>>>>>>>>>>>>>>>> not seem true.
>>>>>>>>>>>>>>>        Damn. Looks like you're right. At some time I broke my 
>>>>>>>>>>>>>>> agent and had not noticed it. Who will understand.
>>>>>>>>>>>>>>        I repaired my agent - after send reboot he is wait STDIN.
>>>>>>>>>>>>>>        Returned "normally" a behavior - hangs "pending", until 
>>>>>>>>>>>>>> manually send reboot. :)
>>>>>>>>>>>>>       Right. Now you're in case 8b.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>       Can you try this patch:  
>>>>>>>>>>>>> http://paste.fedoraproject.org/68450/38973966
>>>>>>>>>>>>       Killed all day experiences.
>>>>>>>>>>>>       It turns out here that:
>>>>>>>>>>>>       1. Did cluster.
>>>>>>>>>>>>       2. On the node-2 send signal (-4) - killed corosink
>>>>>>>>>>>>       3. From node-1 (there DC) - stonith sent reboot
>>>>>>>>>>>>       4. Noda rebooted and resources start.
>>>>>>>>>>>>       5. Again. On the node-2 send signal (-4) - killed corosink
>>>>>>>>>>>>       6. Again. From node-1 (there DC) - stonith sent reboot
>>>>>>>>>>>>       7. Noda-2 rebooted and hangs in "pending"
>>>>>>>>>>>>       8. Waiting, waiting..... manually reboot.
>>>>>>>>>>>>       9. Noda-2 reboot and raised resources start.
>>>>>>>>>>>>       10. GOTO p.2
>>>>>>>>>>>      Logs?
>>>>>>>>>>      Yesterday I wrote an additional letter why not put the logs.
>>>>>>>>>>      Read it please, it contains a few more questions.
>>>>>>>>>>      Today again began to hang and continue along the same cycle.
>>>>>>>>>>      Logs here http://send2me.ru/crmrep2.tar.bz2
>>>>>>>>>>>>>>        New logs: http://send2me.ru/crmrep1.tar.bz2
>>>>>>>>>>>>>>>>         On 14 Jan 2014, at 1:19 pm, Andrew Beekhof 
>>>>>>>>>>>>>>>> <and...@beekhof.net> wrote:
>>>>>>>>>>>>>>>>>          Apart from anything else, your timeout needs to be 
>>>>>>>>>>>>>>>>> bigger:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>          Jan 13 12:21:36 [17223] 
>>>>>>>>>>>>>>>>> dev-cluster2-node1.unix.tensor.ru stonith-ng: (  
>>>>>>>>>>>>>>>>> commands.c:1321  )   error: log_operation: Operation 'reboot' 
>>>>>>>>>>>>>>>>> [11331] (call 2 from crmd.17227) for host 
>>>>>>>>>>>>>>>>> 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' 
>>>>>>>>>>>>>>>>> returned: -62 (Timer expired)
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>          On 14 Jan 2014, at 7:18 am, Andrew Beekhof 
>>>>>>>>>>>>>>>>> <and...@beekhof.net> wrote:
>>>>>>>>>>>>>>>>>>          On 13 Jan 2014, at 8:31 pm, Andrey Groshev 
>>>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>          13.01.2014, 02:51, "Andrew Beekhof" 
>>>>>>>>>>>>>>>>>>> <and...@beekhof.net>:
>>>>>>>>>>>>>>>>>>>>          On 10 Jan 2014, at 9:55 pm, Andrey Groshev 
>>>>>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>          10.01.2014, 14:31, "Andrey Groshev" 
>>>>>>>>>>>>>>>>>>>>> <gre...@yandex.ru>:
>>>>>>>>>>>>>>>>>>>>>>          10.01.2014, 14:01, "Andrew Beekhof" 
>>>>>>>>>>>>>>>>>>>>>> <and...@beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>>          On 10 Jan 2014, at 5:03 pm, Andrey Groshev 
>>>>>>>>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>           10.01.2014, 05:29, "Andrew Beekhof" 
>>>>>>>>>>>>>>>>>>>>>>>> <and...@beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>>>>            On 9 Jan 2014, at 11:11 pm, Andrey Groshev 
>>>>>>>>>>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>             08.01.2014, 06:22, "Andrew Beekhof" 
>>>>>>>>>>>>>>>>>>>>>>>>>> <and...@beekhof.net>:
>>>>>>>>>>>>>>>>>>>>>>>>>>>             On 29 Nov 2013, at 7:17 pm, Andrey 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Groshev <gre...@yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              Hi, ALL.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>              I'm still trying to cope with the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> fact that after the fence - node hangs in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> "pending".
>>>>>>>>>>>>>>>>>>>>>>>>>>>             Please define "pending".  Where did you 
>>>>>>>>>>>>>>>>>>>>>>>>>>> see this?
>>>>>>>>>>>>>>>>>>>>>>>>>>             In crm_mon:
>>>>>>>>>>>>>>>>>>>>>>>>>>             ......
>>>>>>>>>>>>>>>>>>>>>>>>>>             Node dev-cluster2-node2 (172793105): 
>>>>>>>>>>>>>>>>>>>>>>>>>> pending
>>>>>>>>>>>>>>>>>>>>>>>>>>             ......
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>             The experiment was like this:
>>>>>>>>>>>>>>>>>>>>>>>>>>             Four nodes in cluster.
>>>>>>>>>>>>>>>>>>>>>>>>>>             On one of them kill corosync or 
>>>>>>>>>>>>>>>>>>>>>>>>>> pacemakerd (signal 4 or 6 oк 11).
>>>>>>>>>>>>>>>>>>>>>>>>>>             Thereafter, the remaining start it 
>>>>>>>>>>>>>>>>>>>>>>>>>> constantly reboot, under various pretexts, "softly 
>>>>>>>>>>>>>>>>>>>>>>>>>> whistling", "fly low", "not a cluster member!" ...
>>>>>>>>>>>>>>>>>>>>>>>>>>             Then in the log fell out "Too many 
>>>>>>>>>>>>>>>>>>>>>>>>>> failures ...."
>>>>>>>>>>>>>>>>>>>>>>>>>>             All this time in the status in crm_mon 
>>>>>>>>>>>>>>>>>>>>>>>>>> is "pending".
>>>>>>>>>>>>>>>>>>>>>>>>>>             Depending on the wind direction changed 
>>>>>>>>>>>>>>>>>>>>>>>>>> to "UNCLEAN"
>>>>>>>>>>>>>>>>>>>>>>>>>>             Much time has passed and I can not 
>>>>>>>>>>>>>>>>>>>>>>>>>> accurately describe the behavior...
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>             Now I am in the following state:
>>>>>>>>>>>>>>>>>>>>>>>>>>             I tried locate the problem. Came here 
>>>>>>>>>>>>>>>>>>>>>>>>>> with this.
>>>>>>>>>>>>>>>>>>>>>>>>>>             I set big value in property 
>>>>>>>>>>>>>>>>>>>>>>>>>> stonith-timeout="600s".
>>>>>>>>>>>>>>>>>>>>>>>>>>             And got the following behavior:
>>>>>>>>>>>>>>>>>>>>>>>>>>             1. pkill -4 corosync
>>>>>>>>>>>>>>>>>>>>>>>>>>             2. from node with DC call my fence agent 
>>>>>>>>>>>>>>>>>>>>>>>>>> "sshbykey"
>>>>>>>>>>>>>>>>>>>>>>>>>>             3. It sends reboot victim and waits 
>>>>>>>>>>>>>>>>>>>>>>>>>> until she comes to life again.
>>>>>>>>>>>>>>>>>>>>>>>>>            Hmmm.... what version of pacemaker?
>>>>>>>>>>>>>>>>>>>>>>>>>            This sounds like a timing issue that we 
>>>>>>>>>>>>>>>>>>>>>>>>> fixed a while back
>>>>>>>>>>>>>>>>>>>>>>>>           Was a version 1.1.11 from December 3.
>>>>>>>>>>>>>>>>>>>>>>>>           Now try full update and retest.
>>>>>>>>>>>>>>>>>>>>>>>          That should be recent enough.  Can you create 
>>>>>>>>>>>>>>>>>>>>>>> a crm_report the next time you reproduce?
>>>>>>>>>>>>>>>>>>>>>>          Of course yes. Little delay.... :)
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>          ......
>>>>>>>>>>>>>>>>>>>>>>          cc1: warnings being treated as errors
>>>>>>>>>>>>>>>>>>>>>>          upstart.c: In function ‘upstart_job_property’:
>>>>>>>>>>>>>>>>>>>>>>          upstart.c:264: error: implicit declaration of 
>>>>>>>>>>>>>>>>>>>>>> function ‘g_variant_lookup_value’
>>>>>>>>>>>>>>>>>>>>>>          upstart.c:264: error: nested extern declaration 
>>>>>>>>>>>>>>>>>>>>>> of ‘g_variant_lookup_value’
>>>>>>>>>>>>>>>>>>>>>>          upstart.c:264: error: assignment makes pointer 
>>>>>>>>>>>>>>>>>>>>>> from integer without a cast
>>>>>>>>>>>>>>>>>>>>>>          gmake[2]: *** [libcrmservice_la-upstart.lo] 
>>>>>>>>>>>>>>>>>>>>>> Error 1
>>>>>>>>>>>>>>>>>>>>>>          gmake[2]: Leaving directory 
>>>>>>>>>>>>>>>>>>>>>> `/root/ha/pacemaker/lib/services'
>>>>>>>>>>>>>>>>>>>>>>          make[1]: *** [all-recursive] Error 1
>>>>>>>>>>>>>>>>>>>>>>          make[1]: Leaving directory 
>>>>>>>>>>>>>>>>>>>>>> `/root/ha/pacemaker/lib'
>>>>>>>>>>>>>>>>>>>>>>          make: *** [core] Error 1
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>          I'm trying to solve this a problem.
>>>>>>>>>>>>>>>>>>>>>          Do not get solved quickly...
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>          
>>>>>>>>>>>>>>>>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>>>>>>>>>>>>>>>>>>>          g_variant_lookup_value () Since 2.28
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>          # yum list installed glib2
>>>>>>>>>>>>>>>>>>>>>          Loaded plugins: fastestmirror, rhnplugin, 
>>>>>>>>>>>>>>>>>>>>> security
>>>>>>>>>>>>>>>>>>>>>          This system is receiving updates from RHN 
>>>>>>>>>>>>>>>>>>>>> Classic or Red Hat Satellite.
>>>>>>>>>>>>>>>>>>>>>          Loading mirror speeds from cached hostfile
>>>>>>>>>>>>>>>>>>>>>          Installed Packages
>>>>>>>>>>>>>>>>>>>>>          glib2.x86_64                                     
>>>>>>>>>>>>>>>>>>>>>                          2.26.1-3.el6                     
>>>>>>>>>>>>>>>>>>>>>                                           installed
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>          # cat /etc/issue
>>>>>>>>>>>>>>>>>>>>>          CentOS release 6.5 (Final)
>>>>>>>>>>>>>>>>>>>>>          Kernel \r on an \m
>>>>>>>>>>>>>>>>>>>>          Can you try this patch?
>>>>>>>>>>>>>>>>>>>>          Upstart jobs wont work, but the code will compile
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>          diff --git a/lib/services/upstart.c 
>>>>>>>>>>>>>>>>>>>> b/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>          index 831e7cf..195c3a4 100644
>>>>>>>>>>>>>>>>>>>>          --- a/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>          +++ b/lib/services/upstart.c
>>>>>>>>>>>>>>>>>>>>          @@ -231,12 +231,21 @@ upstart_job_exists(const 
>>>>>>>>>>>>>>>>>>>> char *name)
>>>>>>>>>>>>>>>>>>>>          static char *
>>>>>>>>>>>>>>>>>>>>          upstart_job_property(const char *obj, const gchar 
>>>>>>>>>>>>>>>>>>>> * iface, const char *name)
>>>>>>>>>>>>>>>>>>>>          {
>>>>>>>>>>>>>>>>>>>>          +    char *output = NULL;
>>>>>>>>>>>>>>>>>>>>          +
>>>>>>>>>>>>>>>>>>>>          +#if !GLIB_CHECK_VERSION(2,28,0)
>>>>>>>>>>>>>>>>>>>>          +    static bool err = TRUE;
>>>>>>>>>>>>>>>>>>>>          +
>>>>>>>>>>>>>>>>>>>>          +    if(err) {
>>>>>>>>>>>>>>>>>>>>          +        crm_err("This version of glib is too old 
>>>>>>>>>>>>>>>>>>>> to support upstart jobs");
>>>>>>>>>>>>>>>>>>>>          +        err = FALSE;
>>>>>>>>>>>>>>>>>>>>          +    }
>>>>>>>>>>>>>>>>>>>>          +#else
>>>>>>>>>>>>>>>>>>>>             GError *error = NULL;
>>>>>>>>>>>>>>>>>>>>             GDBusProxy *proxy;
>>>>>>>>>>>>>>>>>>>>             GVariant *asv = NULL;
>>>>>>>>>>>>>>>>>>>>             GVariant *value = NULL;
>>>>>>>>>>>>>>>>>>>>             GVariant *_ret = NULL;
>>>>>>>>>>>>>>>>>>>>          -    char *output = NULL;
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>             crm_info("Calling GetAll on %s", obj);
>>>>>>>>>>>>>>>>>>>>             proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>>>>>>>>>>>>>>>>>>>>          @@ -272,6 +281,7 @@ upstart_job_property(const 
>>>>>>>>>>>>>>>>>>>> char *obj, const gchar * iface, const char *name)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>             g_object_unref(proxy);
>>>>>>>>>>>>>>>>>>>>             g_variant_unref(_ret);
>>>>>>>>>>>>>>>>>>>>          +#endif
>>>>>>>>>>>>>>>>>>>>             return output;
>>>>>>>>>>>>>>>>>>>>          }
>>>>>>>>>>>>>>>>>>>          Ok :) I patch source.
>>>>>>>>>>>>>>>>>>>          Type "make rc" - the same error.
>>>>>>>>>>>>>>>>>>          Because its not building your local changes
>>>>>>>>>>>>>>>>>>>          Make new copy via "fetch" - the same error.
>>>>>>>>>>>>>>>>>>>          It seems that if not exist 
>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then 
>>>>>>>>>>>>>>>>>>> download it.
>>>>>>>>>>>>>>>>>>>          Otherwise use exist archive.
>>>>>>>>>>>>>>>>>>>          Cutted log .......
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>          # make rc
>>>>>>>>>>>>>>>>>>>          make TAG=Pacemaker-1.1.11-rc3 rpm
>>>>>>>>>>>>>>>>>>>          make[1]: Entering directory `/root/ha/pacemaker'
>>>>>>>>>>>>>>>>>>>          rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* 
>>>>>>>>>>>>>>>>>>> pacemaker-HEAD.tar.*
>>>>>>>>>>>>>>>>>>>          if [ ! -f 
>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; then   
>>>>>>>>>>>>>>>>>>>                                           \
>>>>>>>>>>>>>>>>>>>                   rm -f pacemaker.tar.*;                    
>>>>>>>>>>>>>>>>>>>                           \
>>>>>>>>>>>>>>>>>>>                   if [ Pacemaker-1.1.11-rc3 = dirty ]; then 
>>>>>>>>>>>>>>>>>>>                                   \
>>>>>>>>>>>>>>>>>>>                       git commit -m "DO-NOT-PUSH" -a;       
>>>>>>>>>>>>>>>>>>>                           \
>>>>>>>>>>>>>>>>>>>                       git archive 
>>>>>>>>>>>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | 
>>>>>>>>>>>>>>>>>>> gzip > ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;   
>>>>>>>>>>>>>>>>>>>     \
>>>>>>>>>>>>>>>>>>>                       git reset --mixed HEAD^;              
>>>>>>>>>>>>>>>>>>>                           \
>>>>>>>>>>>>>>>>>>>                   else                                      
>>>>>>>>>>>>>>>>>>>                           \
>>>>>>>>>>>>>>>>>>>                       git archive 
>>>>>>>>>>>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ 
>>>>>>>>>>>>>>>>>>> Pacemaker-1.1.11-rc3 | gzip > 
>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;     \
>>>>>>>>>>>>>>>>>>>                   fi;                                       
>>>>>>>>>>>>>>>>>>>                           \
>>>>>>>>>>>>>>>>>>>                   echo `date`: Rebuilt 
>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;          
>>>>>>>>>>>>>>>>>>>                            \
>>>>>>>>>>>>>>>>>>>               else                                          
>>>>>>>>>>>>>>>>>>>                           \
>>>>>>>>>>>>>>>>>>>                   echo `date`: Using existing tarball: 
>>>>>>>>>>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;          
>>>>>>>>>>>>>>>>>>>            \
>>>>>>>>>>>>>>>>>>>               fi
>>>>>>>>>>>>>>>>>>>          Mon Jan 13 13:23:21 MSK 2014: Using existing 
>>>>>>>>>>>>>>>>>>> tarball: ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
>>>>>>>>>>>>>>>>>>>          .......
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>          Well, "make rpm" - build rpms and I create cluster.
>>>>>>>>>>>>>>>>>>>          I spent the same tests and confirmed the behavior.
>>>>>>>>>>>>>>>>>>>          crm_reoprt log here - 
>>>>>>>>>>>>>>>>>>> http://send2me.ru/crmrep.tar.bz2
>>>>>>>>>>>>>>>>>>          Thanks!
>>>>>>>>>>>>>>>>         ,
>>>>>>>>>>>>>>>>         _______________________________________________
>>>>>>>>>>>>>>>>         Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>>>>>>>>         http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>         Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>>         Getting started: 
>>>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>>         Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>>        _______________________________________________
>>>>>>>>>>>>>>>        Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>>>>>>>        http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>        Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>>        Getting started: 
>>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>>        Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>>        _______________________________________________
>>>>>>>>>>>>>>        Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>>>>>>        http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>        Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>>        Getting started: 
>>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>>        Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>>       ,
>>>>>>>>>>>>>       _______________________________________________
>>>>>>>>>>>>>       Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>>>>>       http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>> 
>>>>>>>>>>>>>       Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>>       Getting started: 
>>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>>       Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>>       _______________________________________________
>>>>>>>>>>>>       Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>>>>       http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>> 
>>>>>>>>>>>>       Project Home: http://www.clusterlabs.org
>>>>>>>>>>>>       Getting started: 
>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>>       Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>      ,
>>>>>>>>>>>      _______________________________________________
>>>>>>>>>>>      Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>>>      http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>> 
>>>>>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>>>>>      Getting started: 
>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>      _______________________________________________
>>>>>>>>>>      Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>>      http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>> 
>>>>>>>>>>      Project Home: http://www.clusterlabs.org
>>>>>>>>>>      Getting started: 
>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>      Bugs: http://bugs.clusterlabs.org
>>>>>>>>>     ,
>>>>>>>>>     _______________________________________________
>>>>>>>>>     Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>> 
>>>>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>>>>     Getting started: 
>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>>>>     _______________________________________________
>>>>>>>>     Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>> 
>>>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>>>     Getting started: 
>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>>>    ,
>>>>>>>    _______________________________________________
>>>>>>>    Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>> 
>>>>>>>    Project Home: http://www.clusterlabs.org
>>>>>>>    Getting started: 
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>>>   _______________________________________________
>>>>>>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>> 
>>>>>>   Project Home: http://www.clusterlabs.org
>>>>>>   Getting started: 
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>   Bugs: http://bugs.clusterlabs.org
>>>>>   _______________________________________________
>>>>>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>>   Project Home: http://www.clusterlabs.org
>>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>   Bugs: http://bugs.clusterlabs.org
>>>>  ,
>>>>  _______________________________________________
>>>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>>  Project Home: http://www.clusterlabs.org
>>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>  Bugs: http://bugs.clusterlabs.org
>>>  _______________________________________________
>>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>> 
>> ,
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] hangs pending

Reply via email to