On 31 May 2014, at 12:43 am, Robert Dahlem <robert.dah...@gmx.net> wrote:
> Hi, > > On 30.05.2014 13:20, Robert Dahlem wrote: > >>> run crm_report for the period covered by these commands and attach the >>> result: >>> >>> # crm node standby korfwf01 ; sleep 10 >>> # crm node standby korfwf02 ; sleep 10 >>> # crm node online korfwf02 ; sleep 10 >>> # crm node online korfwf01 ; sleep 10 >>> # crm status >> >> I filed a bug >> http://bugs.clusterlabs.org/show_bug.cgi?id=5217 >> and attached the crm_report. > > This seems to be some kind of a race condition: I added > sleep 3 > to a central point in /usr/lib/ocf/resource.d/linbit/drbd. Define central? > As soon as I > did that I could not reproduce the split brain. > > Then I added some logging and repeated > crm resource move ALL-ffm korfwf01 > crm node standby korfwf01 > crm node standby korfwf02 > crm node online korfwf02 > crm node online korfwf01 > > The following is what happens after > crm node online korfwf01 > > Timestamp korfwf01 korfwf02 > ======================================================== > Fri May 30 15:46:35 CEST 2014 notify > Fri May 30 15:46:38 CEST 2014 notify > Fri May 30 15:46:41 CEST 2014 demote > Fri May 30 15:46:44 CEST 2014 notify > Fri May 30 15:46:47 CEST 2014 notify > Fri May 30 15:46:50 CEST 2014 stop > Fri May 30 15:46:53 CEST 2014 start start > Fri May 30 15:46:58 CEST 2014 notify notify > Fri May 30 15:47:01 CEST 2014 notify notify > Fri May 30 15:47:04 CEST 2014 promote > Fri May 30 15:47:07 CEST 2014 notify notify > Fri May 30 15:47:10 CEST 2014 monitor monitor > > 1.) Note the parallel "start" at 15:46:53. This could very well end up > in a race condition without "sleep 3". > > 2.) Why is pacemaker doing "stop/start" at all on korfwf02? This seems to correspond to May 23 13:29:31 korfwm01 pengine[5140]: notice: LogActions: Move stonith-korfwf02 (Started korfwm01 -> korfwf01) May 23 13:29:31 korfwm01 pengine[5140]: notice: LogActions: Move ALL-ffm (Started korfwf02 -> korfwf01) May 23 13:29:31 korfwm01 pengine[5140]: notice: LogActions: Demote DRBD-ffm:0 (Master -> Slave korfwf02) May 23 13:29:31 korfwm01 pengine[5140]: notice: LogActions: Restart DRBD-ffm:0 (Slave korfwf02) May 23 13:29:31 korfwm01 pengine[5140]: notice: LogActions: Start DRBD-ffm:1 (korfwf01) May 23 13:29:31 korfwm01 pengine[5140]: notice: LogActions: Promote DRBD-ffm:1 (Stopped -> Master korfwf01) May 23 13:29:31 korfwm01 pengine[5140]: notice: process_pe_message: Calculated Transition 843: /var/lib/pacemaker/pengine/pe-input-728.bz2 from your original tarball. In that case, the cause is: <rsc_order id="ord-ALL-ffm-before-DRBD-ffm" score="INFINITY" first="ALL-ffm" then="ms-DRBD-ffm"/> Which requires that ms-DRBD-ffm be completely stopped if ALL-ffm is stopped (which it is because its being moved to 01). Perhaps you meant this? <rsc_order id="ord-ALL-ffm-before-DRBD-ffm" score="INFINITY" first="ALL-ffm" then="ms-DRBD-ffm" then-action="promote"/> > > Now again, but without "sleep 3": > > Timestamp korfwf01 korfwf02 > ======================================================== > Fri May 30 16:23:24 CEST 2014 notify > Fri May 30 16:23:26 CEST 2014 notify > Fri May 30 16:23:26 CEST 2014 start demote > Fri May 30 16:23:26 CEST 2014 notify > Fri May 30 16:23:26 CEST 2014 notify > Fri May 30 16:23:26 CEST 2014 notify > Fri May 30 16:23:26 CEST 2014 stop > Fri May 30 16:23:26 CEST 2014 start > Fri May 30 16:23:27 CEST 2014 notify notify > Fri May 30 16:23:28 CEST 2014 notify notify > Fri May 30 16:23:28 CEST 2014 promote > Fri May 30 16:23:28 CEST 2014 notify notify > Fri May 30 16:23:28 CEST 2014 monitor monitor > > Look at this excerpt from /var/log/messages: > > 16:23:27 korfwf01 block drbd7: disk( Diskless -> Attaching ) > 16:23:27 korfwf01 block drbd7: disk( Attaching -> UpToDate ) > 16:23:27 korfwf01 drbd ffm: conn( StandAlone -> Unconnected ) > 16:23:27 korfwf01 drbd ffm: conn( Unconnected -> WFConnection ) > 16:23:28 korfwf01 block drbd7: role( Secondary -> Primary ) > 16:23:28 korfwf01 drbd ffm: conn( WFConnection -> WFReportParams ) > 16:23:28 korfwf01 drbd ffm: conn( WFReportParams -> NetworkFailure ) > 16:23:28 korfwf01 drbd ffm: conn( NetworkFailure -> Unconnected ) > 16:23:28 korfwf01 drbd ffm: conn( Unconnected -> WFConnection ) > > korfwf01 was not waiting for a connection before 16:23:27. At this time > korfwf02 was stopped, but had the latest data. So korfwf01 came up > before korfwf02 was up again -> split-brain! > > Again: why is korfwf02 doing "stop/start" in this situation? Without > this, korfwf02 would just do "demote" while korfwf01 would do > "start/promote" and everything would be fine. > > Kind regards, > Robert > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org