On Tue, Jan 26, 2010 at 9:47 PM, <subscripti...@ludmata.info> wrote: > Hi, > > I have two node cluster, running Master/Slave drdb and fs resources (there > will be more resources later). Here are the details of the software I'm > using: Debian 5.03 stable, DRBD 8.3.7 compiled from source, corosync 1.1.2 > and pacemaker1.0.6 installed from madkiss repository. > > I have a problem, that I can't solve for two days and a lot of digging in > internet. The cluster works right in all cases but one - node1 runs Primary > DRBD and it is mounted the fs resource, than I simulate power loss (plug > out the power cord of node1) and node2 takes all resources, promotes DRBD > and mount the fs (so far so good).Then again I simulate power loss by > unplugging the power cord of node2. Then I power on node1, and it boots, > loads its stuff and start corosync and then the cluster resource manager > promotes DRBD to Primary on node1 (it should not!).
But you told it too. no-quorum-policy="ignore" And you prevented the cluster from being sure that the other side didn't already have drbd promoted stonith-enabled="false" Basically you created a split-brain condition and turned off the options that might have prevented data corruption :-) > That is a disaster, > because i intend to run SQL database on that cluster and that way I might > loose a huge amount of data. I also have ancient two node cluster running > heartbeat 1 and drbd 6.x with drbdisk resource, and its behavior in that > case is to stop and ask "My data may be outdated, are you sure you want to > continue?". I tried the same scenario without cluster engine (that is the > old way, isnt it) - enabled DRBD init scripts and repeated same steps. In > that particular case DRBD stopped waiting the other node and asks if i want > to continue (good boy, that is exactly what I want!). So, my problem must > be somewhere in configuration of resources, but I can't understand what I'm > doing wrong. So, let ask straight - How to do this in pacemaker. I just > want node1 stops and waits for my confirmation what to do or something of > that sort, but never ever promote drbd to master! > > If somebody wonders why I do this scenario, let me explain: My company own > a APC Smart UPS, who in case of power loss shut down one node of each > cluster (we have two pairs of clusters in separate vlans, so I cant create > a 4 node cluster, which will solve this problem, at least partially) after > the level of the battery falls bellow certain level. If battery runs below > the critical level, the UPS kills all servers, but two our logging server > and one of DB nodes. If the power dont come, than ups kills the last node. > The only machine that waits to its the death is our logging server. When > the power comes, the UPS starts all servers, that are down. If that happens > when all nodes are down, we end with the following situation - the first > node that comes up become SynkSource, and that node may be not the last one > survived the UPS rage. > > One of the possible sollutions is to use old heartbeat resource manager > drbddisk, that uses the drbd init script. But I don like it :) > > Here are my configs: > > corosync.conf > totem { > version: 2 > token: 3000 > token_retransmits_before_loss_const: 10 > join: 60 > consensus: 1500 > vsftype: none > max_messages: 20 > clear_node_high_bit: yes > secauth: off > threads: 0 > rrp_mode: passive > #external interface > interface { > ringnumber: 0 > bindnetaddr: 10.0.30.0 > mcastaddr: 226.94.1.1 > mcastport: 5405 > } > #internal interface > interface { > ringnumber: 1 > bindnetaddr: 10.2.2.0 > mcastaddr: 226.94.2.1 > mcastport: 5405 > } > } > amf { > mode: disabled > } > service { > ver: 0 > name: pacemaker > } > aisexec { > user: root > group: root > } > logging { > fileline: off > to_stderr: yes > to_logfile: no > to_syslog: yes > syslog_facility: daemon > debug: off > timestamp: on > logger_subsys { > subsys: AMF > debug: off > tags: enter|leave|trace1|trace2|trace3|trace4|trace6 > } > } > ------------- > drbd.conf > > common { > syncer { rate 100M; } > } > resource drbd0 { > protocol C; > handlers { > fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; > after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh"; > pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f"; > pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f"; > local-io-error "echo o > /proc/sysrq-trigger ; halt -f"; > outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; > pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD > Alert' root"; > } > startup { wfc-timeout 0; degr-wfc-timeout 0; } > > disk { on-io-error detach; fencing resource-only;} > > net { > sndbuf-size 1024k; > timeout 20; # 6 seconds (unit = 0.1 seconds) > connect-int 10; # 10 seconds (unit = 1 second) > ping-int 3; # 10 seconds (unit = 1 second) > ping-timeout 5; # 500 ms (unit = 0.1 seconds) > ko-count 4; > cram-hmac-alg "sha1"; > shared-secret "password"; > after-sb-0pri disconnect; > after-sb-1pri disconnect; > after-sb-2pri disconnect; > rr-conflict disconnect; > } > > syncer { rate 100M; } > > on db1 { > device /dev/drbd0; > disk /dev/db/db; > address 10.2.2.1:7788; > flexible-meta-disk internal; > } > > on db2 { > device /dev/drbd0; > disk /dev/db/db; > address 10.2.2.2:7788; > meta-disk internal; > } > } > ----------------------- > crm: > crm(live)# configure show > node db1 \ > attributes standby="off" > node db2 \ > attributes standby="off" > primitive drbd-db ocf:linbit:drbd \ > params drbd_resource="drbd0" \ > op monitor interval="15s" role="Slave" timeout="30" \ > op monitor interval="16s" role="Master" timeout="30" > primitive fs-db ocf:heartbeat:Filesystem \ > params fstype="ext3" directory="/db" device="/dev/drbd0" > primitive ip-dbclust.v52 ocf:heartbeat:IPaddr2 \ > params ip="10.0.30.211" broadcast="10.0.30.255" nic="eth1" > cidr_netmask="24" \ > op monitor interval="21s" timeout="5s" > ms ms-db drbd-db \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" > notify="true" target-role="Started" > location drbd-fence-by-handler-ms-db ms-db \ > rule $id="drbd-fence-by-handler-rule-ms-db" $role="Master" -inf: #uname > ne db1 > location lo-ms-db ms-db \ > rule $id="ms-db-loc-rule" -inf: #uname ne db1 and #uname ne db2 > colocation fs-on-drbd0 inf: fs-db ms-db:Master > colocation ip-on-drbd0 inf: ip-dbclust.v52 ms-db:Master > order or-drbd-bf-fs inf: ms-db:promote fs-db:start > order or-drbd-bf-ip inf: ms-db:promote ip-dbclust.v52:start > property $id="cib-bootstrap-options" \ > no-quorum-policy="ignore" \ > stonith-enabled="false" \ > expected-quorum-votes="2" \ > last-lrm-refresh="1264523323" \ > dc-version="1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe" \ > cluster-infrastructure="openais" > > I hope somebody can help me, I am completely lost :( > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker