Re: [Pacemaker] hangs pending

Andrey Groshev Thu, 06 Feb 2014 21:01:47 -0800

Hi, Andrew and ALL!

Andrew, We did not bury this topic?


16.01.2014, 12:32, "Andrey Groshev" <gre...@yandex.ru>:
> 16.01.2014, 01:30, "Andrew Beekhof" <and...@beekhof.net>:
>
>>  On 16 Jan 2014, at 12:41 am, Andrey Groshev <gre...@yandex.ru> wrote:
>>>   15.01.2014, 02:53, "Andrew Beekhof" <and...@beekhof.net>:
>>>>   On 15 Jan 2014, at 12:15 am, Andrey Groshev <gre...@yandex.ru> wrote:
>>>>>    14.01.2014, 10:00, "Andrey Groshev" <gre...@yandex.ru>:
>>>>>>    14.01.2014, 07:47, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>     Ok, here's what happens:
>>>>>>>
>>>>>>>     1. node2 is lost
>>>>>>>     2. fencing of node2 starts
>>>>>>>     3. node2 reboots (and cluster starts)
>>>>>>>     4. node2 returns to the membership
>>>>>>>     5. node2 is marked as a cluster member
>>>>>>>     6. DC tries to bring it into the cluster, but needs to cancel the 
>>>>>>> active transition first.
>>>>>>>        Which is a problem since the node2 fencing operation is part of 
>>>>>>> that
>>>>>>>     7. node2 is in a transition (pending) state until fencing passes or 
>>>>>>> fails
>>>>>>>     8a. fencing fails: transition completes and the node joins the 
>>>>>>> cluster
>>>>>>>
>>>>>>>     Thats in theory, except we automatically try again. Which isn't 
>>>>>>> appropriate.
>>>>>>>     This should be relatively easy to fix.
>>>>>>>
>>>>>>>     8b. fencing passes: the node is incorrectly marked as offline
>>>>>>>
>>>>>>>     This I have no idea how to fix yet.
>>>>>>>
>>>>>>>     On another note, it doesn't look like this agent works at all.
>>>>>>>     The node has been back online for a long time and the agent is 
>>>>>>> still timing out after 10 minutes.
>>>>>>>     So "Once the script makes sure that the victim will rebooted and 
>>>>>>> again available via ssh - it exit with 0." does not seem true.
>>>>>>    Damn. Looks like you're right. At some time I broke my agent and had 
>>>>>> not noticed it. Who will understand.
>>>>>    I repaired my agent - after send reboot he is wait STDIN.
>>>>>    Returned "normally" a behavior - hangs "pending", until manually send 
>>>>> reboot. :)
>>>>   Right. Now you're in case 8b.
>>>>
>>>>   Can you try this patch:  http://paste.fedoraproject.org/68450/38973966
>>>   Killed all day experiences.
>>>   It turns out here that:
>>>   1. Did cluster.
>>>   2. On the node-2 send signal (-4) - killed corosink
>>>   3. From node-1 (there DC) - stonith sent reboot
>>>   4. Noda rebooted and resources start.
>>>   5. Again. On the node-2 send signal (-4) - killed corosink
>>>   6. Again. From node-1 (there DC) - stonith sent reboot
>>>   7. Noda-2 rebooted and hangs in "pending"
>>>   8. Waiting, waiting..... manually reboot.
>>>   9. Noda-2 reboot and raised resources start.
>>>   10. GOTO p.2
>>  Logs?
>
> Yesterday I wrote an additional letter why not put the logs.
> Read it please, it contains a few more questions.
> Today again began to hang and continue along the same cycle.
> Logs here http://send2me.ru/crmrep2.tar.bz2
>
>>>>>    New logs: http://send2me.ru/crmrep1.tar.bz2
>>>>>>>     On 14 Jan 2014, at 1:19 pm, Andrew Beekhof <and...@beekhof.net> 
>>>>>>> wrote:
>>>>>>>>      Apart from anything else, your timeout needs to be bigger:
>>>>>>>>
>>>>>>>>      Jan 13 12:21:36 [17223] dev-cluster2-node1.unix.tensor.ru 
>>>>>>>> stonith-ng: (  commands.c:1321  )   error: log_operation: Operation 
>>>>>>>> 'reboot' [11331] (call 2 from crmd.17227) for host 
>>>>>>>> 'dev-cluster2-node2.unix.tensor.ru' with device 'st1' returned: -62 
>>>>>>>> (Timer expired)
>>>>>>>>
>>>>>>>>      On 14 Jan 2014, at 7:18 am, Andrew Beekhof <and...@beekhof.net> 
>>>>>>>> wrote:
>>>>>>>>>      On 13 Jan 2014, at 8:31 pm, Andrey Groshev <gre...@yandex.ru> 
>>>>>>>>> wrote:
>>>>>>>>>>      13.01.2014, 02:51, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>>>      On 10 Jan 2014, at 9:55 pm, Andrey Groshev <gre...@yandex.ru> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>      10.01.2014, 14:31, "Andrey Groshev" <gre...@yandex.ru>:
>>>>>>>>>>>>>      10.01.2014, 14:01, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>>>>>>      On 10 Jan 2014, at 5:03 pm, Andrey Groshev 
>>>>>>>>>>>>>> <gre...@yandex.ru> wrote:
>>>>>>>>>>>>>>>       10.01.2014, 05:29, "Andrew Beekhof" <and...@beekhof.net>:
>>>>>>>>>>>>>>>>        On 9 Jan 2014, at 11:11 pm, Andrey Groshev 
>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote:
>>>>>>>>>>>>>>>>>         08.01.2014, 06:22, "Andrew Beekhof" 
>>>>>>>>>>>>>>>>> <and...@beekhof.net>:
>>>>>>>>>>>>>>>>>>         On 29 Nov 2013, at 7:17 pm, Andrey Groshev 
>>>>>>>>>>>>>>>>>> <gre...@yandex.ru> wrote:
>>>>>>>>>>>>>>>>>>>          Hi, ALL.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>          I'm still trying to cope with the fact that after 
>>>>>>>>>>>>>>>>>>> the fence - node hangs in "pending".
>>>>>>>>>>>>>>>>>>         Please define "pending".  Where did you see this?
>>>>>>>>>>>>>>>>>         In crm_mon:
>>>>>>>>>>>>>>>>>         ......
>>>>>>>>>>>>>>>>>         Node dev-cluster2-node2 (172793105): pending
>>>>>>>>>>>>>>>>>         ......
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>         The experiment was like this:
>>>>>>>>>>>>>>>>>         Four nodes in cluster.
>>>>>>>>>>>>>>>>>         On one of them kill corosync or pacemakerd (signal 4 
>>>>>>>>>>>>>>>>> or 6 oк 11).
>>>>>>>>>>>>>>>>>         Thereafter, the remaining start it constantly reboot, 
>>>>>>>>>>>>>>>>> under various pretexts, "softly whistling", "fly low", "not a 
>>>>>>>>>>>>>>>>> cluster member!" ...
>>>>>>>>>>>>>>>>>         Then in the log fell out "Too many failures ...."
>>>>>>>>>>>>>>>>>         All this time in the status in crm_mon is "pending".
>>>>>>>>>>>>>>>>>         Depending on the wind direction changed to "UNCLEAN"
>>>>>>>>>>>>>>>>>         Much time has passed and I can not accurately 
>>>>>>>>>>>>>>>>> describe the behavior...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>         Now I am in the following state:
>>>>>>>>>>>>>>>>>         I tried locate the problem. Came here with this.
>>>>>>>>>>>>>>>>>         I set big value in property stonith-timeout="600s".
>>>>>>>>>>>>>>>>>         And got the following behavior:
>>>>>>>>>>>>>>>>>         1. pkill -4 corosync
>>>>>>>>>>>>>>>>>         2. from node with DC call my fence agent "sshbykey"
>>>>>>>>>>>>>>>>>         3. It sends reboot victim and waits until she comes 
>>>>>>>>>>>>>>>>> to life again.
>>>>>>>>>>>>>>>>        Hmmm.... what version of pacemaker?
>>>>>>>>>>>>>>>>        This sounds like a timing issue that we fixed a while 
>>>>>>>>>>>>>>>> back
>>>>>>>>>>>>>>>       Was a version 1.1.11 from December 3.
>>>>>>>>>>>>>>>       Now try full update and retest.
>>>>>>>>>>>>>>      That should be recent enough.  Can you create a crm_report 
>>>>>>>>>>>>>> the next time you reproduce?
>>>>>>>>>>>>>      Of course yes. Little delay.... :)
>>>>>>>>>>>>>
>>>>>>>>>>>>>      ......
>>>>>>>>>>>>>      cc1: warnings being treated as errors
>>>>>>>>>>>>>      upstart.c: In function ‘upstart_job_property’:
>>>>>>>>>>>>>      upstart.c:264: error: implicit declaration of function 
>>>>>>>>>>>>> ‘g_variant_lookup_value’
>>>>>>>>>>>>>      upstart.c:264: error: nested extern declaration of 
>>>>>>>>>>>>> ‘g_variant_lookup_value’
>>>>>>>>>>>>>      upstart.c:264: error: assignment makes pointer from integer 
>>>>>>>>>>>>> without a cast
>>>>>>>>>>>>>      gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
>>>>>>>>>>>>>      gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
>>>>>>>>>>>>>      make[1]: *** [all-recursive] Error 1
>>>>>>>>>>>>>      make[1]: Leaving directory `/root/ha/pacemaker/lib'
>>>>>>>>>>>>>      make: *** [core] Error 1
>>>>>>>>>>>>>
>>>>>>>>>>>>>      I'm trying to solve this a problem.
>>>>>>>>>>>>      Do not get solved quickly...
>>>>>>>>>>>>
>>>>>>>>>>>>      
>>>>>>>>>>>> https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
>>>>>>>>>>>>      g_variant_lookup_value () Since 2.28
>>>>>>>>>>>>
>>>>>>>>>>>>      # yum list installed glib2
>>>>>>>>>>>>      Loaded plugins: fastestmirror, rhnplugin, security
>>>>>>>>>>>>      This system is receiving updates from RHN Classic or Red Hat 
>>>>>>>>>>>> Satellite.
>>>>>>>>>>>>      Loading mirror speeds from cached hostfile
>>>>>>>>>>>>      Installed Packages
>>>>>>>>>>>>      glib2.x86_64                                                  
>>>>>>>>>>>>             2.26.1-3.el6                                           
>>>>>>>>>>>>                     installed
>>>>>>>>>>>>
>>>>>>>>>>>>      # cat /etc/issue
>>>>>>>>>>>>      CentOS release 6.5 (Final)
>>>>>>>>>>>>      Kernel \r on an \m
>>>>>>>>>>>      Can you try this patch?
>>>>>>>>>>>      Upstart jobs wont work, but the code will compile
>>>>>>>>>>>
>>>>>>>>>>>      diff --git a/lib/services/upstart.c b/lib/services/upstart.c
>>>>>>>>>>>      index 831e7cf..195c3a4 100644
>>>>>>>>>>>      --- a/lib/services/upstart.c
>>>>>>>>>>>      +++ b/lib/services/upstart.c
>>>>>>>>>>>      @@ -231,12 +231,21 @@ upstart_job_exists(const char *name)
>>>>>>>>>>>      static char *
>>>>>>>>>>>      upstart_job_property(const char *obj, const gchar * iface, 
>>>>>>>>>>> const char *name)
>>>>>>>>>>>      {
>>>>>>>>>>>      +    char *output = NULL;
>>>>>>>>>>>      +
>>>>>>>>>>>      +#if !GLIB_CHECK_VERSION(2,28,0)
>>>>>>>>>>>      +    static bool err = TRUE;
>>>>>>>>>>>      +
>>>>>>>>>>>      +    if(err) {
>>>>>>>>>>>      +        crm_err("This version of glib is too old to support 
>>>>>>>>>>> upstart jobs");
>>>>>>>>>>>      +        err = FALSE;
>>>>>>>>>>>      +    }
>>>>>>>>>>>      +#else
>>>>>>>>>>>         GError *error = NULL;
>>>>>>>>>>>         GDBusProxy *proxy;
>>>>>>>>>>>         GVariant *asv = NULL;
>>>>>>>>>>>         GVariant *value = NULL;
>>>>>>>>>>>         GVariant *_ret = NULL;
>>>>>>>>>>>      -    char *output = NULL;
>>>>>>>>>>>
>>>>>>>>>>>         crm_info("Calling GetAll on %s", obj);
>>>>>>>>>>>         proxy = get_proxy(obj, BUS_PROPERTY_IFACE);
>>>>>>>>>>>      @@ -272,6 +281,7 @@ upstart_job_property(const char *obj, 
>>>>>>>>>>> const gchar * iface, const char *name)
>>>>>>>>>>>
>>>>>>>>>>>         g_object_unref(proxy);
>>>>>>>>>>>         g_variant_unref(_ret);
>>>>>>>>>>>      +#endif
>>>>>>>>>>>         return output;
>>>>>>>>>>>      }
>>>>>>>>>>      Ok :) I patch source.
>>>>>>>>>>      Type "make rc" - the same error.
>>>>>>>>>      Because its not building your local changes
>>>>>>>>>>      Make new copy via "fetch" - the same error.
>>>>>>>>>>      It seems that if not exist 
>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz, then download it.
>>>>>>>>>>      Otherwise use exist archive.
>>>>>>>>>>      Cutted log .......
>>>>>>>>>>
>>>>>>>>>>      # make rc
>>>>>>>>>>      make TAG=Pacemaker-1.1.11-rc3 rpm
>>>>>>>>>>      make[1]: Entering directory `/root/ha/pacemaker'
>>>>>>>>>>      rm -f pacemaker-dirty.tar.* pacemaker-tip.tar.* 
>>>>>>>>>> pacemaker-HEAD.tar.*
>>>>>>>>>>      if [ ! -f ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz ]; 
>>>>>>>>>> then                                             \
>>>>>>>>>>               rm -f pacemaker.tar.*;                                 
>>>>>>>>>>              \
>>>>>>>>>>               if [ Pacemaker-1.1.11-rc3 = dirty ]; then              
>>>>>>>>>>                      \
>>>>>>>>>>                   git commit -m "DO-NOT-PUSH" -a;                    
>>>>>>>>>>              \
>>>>>>>>>>                   git archive 
>>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ HEAD | gzip > 
>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;       \
>>>>>>>>>>                   git reset --mixed HEAD^;                           
>>>>>>>>>>              \
>>>>>>>>>>               else                                                   
>>>>>>>>>>              \
>>>>>>>>>>                   git archive 
>>>>>>>>>> --prefix=ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3/ 
>>>>>>>>>> Pacemaker-1.1.11-rc3 | gzip > 
>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;     \
>>>>>>>>>>               fi;                                                    
>>>>>>>>>>              \
>>>>>>>>>>               echo `date`: Rebuilt 
>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                   
>>>>>>>>>>                   \
>>>>>>>>>>           else                                                       
>>>>>>>>>>              \
>>>>>>>>>>               echo `date`: Using existing tarball: 
>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz;                   
>>>>>>>>>>   \
>>>>>>>>>>           fi
>>>>>>>>>>      Mon Jan 13 13:23:21 MSK 2014: Using existing tarball: 
>>>>>>>>>> ClusterLabs-pacemaker-Pacemaker-1.1.11-rc3.tar.gz
>>>>>>>>>>      .......
>>>>>>>>>>
>>>>>>>>>>      Well, "make rpm" - build rpms and I create cluster.
>>>>>>>>>>      I spent the same tests and confirmed the behavior.
>>>>>>>>>>      crm_reoprt log here - http://send2me.ru/crmrep.tar.bz2
>>>>>>>>>      Thanks!
>>>>>>>     ,
>>>>>>>     _______________________________________________
>>>>>>>     Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>>     Project Home: http://www.clusterlabs.org
>>>>>>>     Getting started: 
>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>     Bugs: http://bugs.clusterlabs.org
>>>>>>    _______________________________________________
>>>>>>    Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>>    Project Home: http://www.clusterlabs.org
>>>>>>    Getting started: 
>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>>    _______________________________________________
>>>>>    Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>>    Project Home: http://www.clusterlabs.org
>>>>>    Getting started: 
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>    Bugs: http://bugs.clusterlabs.org
>>>>   ,
>>>>   _______________________________________________
>>>>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>>   Project Home: http://www.clusterlabs.org
>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>   Bugs: http://bugs.clusterlabs.org
>>>   _______________________________________________
>>>   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>   Project Home: http://www.clusterlabs.org
>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>   Bugs: http://bugs.clusterlabs.org
>>  ,
>>  _______________________________________________
>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] hangs pending

Reply via email to