On Fri, Feb 24, 2012 at 10:37 AM, mark - pacemaker list <m+pacema...@nerdish.us> wrote: > Hello, > > I have a pretty simple cluster running with three nodes, xen1, xen2, and > qnode (which runs in standby at all times and only exists for quorum). This > afternoon xen1 reset out of the blue. There is nothing in its logs, in fact > there's a gap from 15:37 to 15:47: > > Feb 23 15:36:18 xen1 lrmd: [5328]: WARN: G_SIG_dispatch: Dispatch function > for SIGCHLD was delayed 1000 ms (> 100 ms) before being called (GSource: > 0x1d96080) > Feb 23 15:36:18 xen1 lrmd: [5328]: info: G_SIG_dispatch: started at > 2571780644 should have started at 2571780544 > Feb 23 15:37:39 xen1 lrmd: [5328]: WARN: G_SIG_dispatch: Dispatch function > for SIGCHLD was delayed 1000 ms (> 100 ms) before being called (GSource: > 0x1d96080) > Feb 23 15:37:39 xen1 lrmd: [5328]: info: G_SIG_dispatch: started at > 2571788778 should have started at 2571788678 > Feb 23 15:47:53 xen1 xenstored: Checking store ... > Feb 23 15:47:54 xen1 acpid: starting up with netlink and the input layer > Feb 23 15:47:54 xen1 xenstored: Checking store complete. > .... > > > I'm kicking myself for lack of a serial console, so I really don't know what > happened prior to the reboot. The two surviving nodes logs these very > similar events, so it appears they were both seeing each other fine and > neither could see anything from xen1: > > Feb 23 15:32:26 qnode pengine: [2096]: info: process_pe_message: Transition > 10271: PEngine Input stored in: /var/lib/pengine/pe-input-4479.bz2 > Feb 23 15:36:51 qnode cib: [2093]: info: cib_stats: Processed 1 operations > (10000.00us average, 0% utilization) in the last 10min > Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 1772: memb=2, new=0, lost=1 > Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: pcmk_peer_update: > memb: xen2 1241622720 > Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: pcmk_peer_update: > memb: qnode 1308731584 > Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: pcmk_peer_update: > lost: xen1 1224845504 > Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 1772: memb=2, new=0, lost=0 > Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: pcmk_peer_update: > MEMB: xen2 1241622720 > Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: pcmk_peer_update: > MEMB: qnode 1308731584 > Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: > ais_mark_unseen_peer_dead: Node xen1 was not seen in the previous transition > Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: update_member: Node > 1224845504/xen1 is now: lost > Feb 23 15:42:02 qnode corosync[2081]: [pcmk ] info: > send_member_notification: Sending membership update 1772 to 2 children > Feb 23 15:42:02 qnode corosync[2081]: [TOTEM ] A processor joined or left > the membership and a new membership was formed. > Feb 23 15:42:02 qnode crmd: [2097]: info: ais_dispatch: Membership 1772: > quorum retained > Feb 23 15:42:02 qnode cib: [2093]: info: ais_dispatch: Membership 1772: > quorum retained > Feb 23 15:42:02 qnode crmd: [2097]: info: ais_status_callback: status: xen1 > is now lost (was member) > > > Those are followed by very verbose logs covering fencing attempts (via SBD) > by both qnode and xen2 which never succeeded, leading me to belive that at > this point xen1 had fenced itself either due to SBD or softdog and was in > the process of POST/bootup. When xen1 finished booting it rejoined the > cluster just fine. There are no failure- or alert-type messages in the logs > on any of the three hosts prior to the 15:42:02 corosync entries above, > where xen1 has disappeared. > > Based on these entries from corosync.conf: > > # How long before declaring a token lost (ms) > token: 3000 > # How many token retransmits before forming a new configuration > token_retransmits_before_loss_const: 10 > > ... am I correct in thinking that the two healthy nodes went at least 30 > seconds without seeing anything from xen1, then started their fencing > attempts?
Certainly sounds reasonable. Not 100% sure if the timeout is a simple "token * retransmits" calculation but it wouldn't be far off. > In the complete absence of any logs with clues as to what > happend, I'm just trying to build a likely scenario for what happened. My > guess is that xen1 had some kind of a panic or loss of network on all four > interfaces (two iSCSI interfaces to the SBD and storage device, two bonded > interfaces facing the service-side of things, where corosync talks). The > xen servers are PowerEdge R815s and OpenManage is running, but there are no > events/errors logged with it either, no hardware issues logged at all. > > In any case, this isn't a complaint or issue report, simply an "am I > thinking clearly about what may have happened?" message. If this ever > happens in the future, hopefully a serial console will provide answers. > (Side note: For that, I'm thinking minicom connected to both and writing > logs, is there something nicer that I just don't know about?). > > Regards, > Mark > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org