----- Original Message ----- > From: "David Parker" <dpar...@utica.edu> > To: pacemaker@oss.clusterlabs.org > Sent: Thursday, August 23, 2012 12:56:33 PM > Subject: Re: [Pacemaker] Issues with HA cluster for mysqld > > > On 08/23/2012 10:17 AM, David Parker wrote: > > On 08/23/2012 09:01 AM, Jake Smith wrote: > >> ----- Original Message ----- > >>> From: "David Parker"<dpar...@utica.edu> > >>> To: pacemaker@oss.clusterlabs.org > >>> Sent: Wednesday, August 22, 2012 2:49:32 PM > >>> Subject: [Pacemaker] Issues with HA cluster for mysqld > >>> > >>> Hello, > >>> > >>> I'm trying to set up a 2-node, active-passive HA cluster for > >>> MySQL > >>> using > >>> heartbeat and Pacemaker. The operating system is Debian Linux > >>> 6.0.5 > >>> 64-bit, and I am using the heartbeat packages installed via > >>> apt-get. > >>> The servers involved are the SQL nodes of a running MySQL > >>> cluster, so > >>> the only service I need HA for is the MySQL daemon (mysqld). > >>> > >>> What I would like to do is have a single virtual IP address which > >>> clients use to query MySQL, and have the IP and mysqld fail over > >>> to > >>> the > >>> passive node in the event of a failure on the active node. I > >>> have > >>> read > >>> through a lot of the heartbeat and Pacemaker documentation, and > >>> here > >>> are > >>> the resources I have configured for the cluster: > >>> > >>> * A custom LSB script for mysqld (compliant with Pacemaker's > >>> requirements as outlined in the documentation) > >>> * An iLO2-based STONITH device using riloe (both servers are HP > >>> Proliant > >>> DL380 G5) > >>> * A virtual IP address for mysqld using IPaddr2 > >>> > >>> I believe I have configured everything correctly, but I'm not > >>> positive. > >>> Anyway, when I start heartbeat and pacemaker > >>> (/etc/init.d/heartbeat > >>> start), everything seems to be ok. However, the virtual IP never > >>> comes > >>> up, and the output of "crm_resource -LV" indicates that something > >>> is > >>> wrong: > >>> > >>> root@ha1:~# crm_resource -LV > >>> crm_resource[28988]: 2012/08/22_14:41:23 WARN: unpack_rsc_op: > >>> Processing > >>> failed op stonith_start_0 on ha1: unknown error (1) > >>> stonith (stonith:external/riloe) Started > >>> MysqlIP (ocf::heartbeat:IPaddr2) Stopped > >>> mysqld (lsb:mysqld) Started > >> It looks like you only have one STONITH resource defined... you > >> need > >> one per server (or to clone the one but that usually applies in > >> blades not standalone servers). And then you would add location > >> constraints not allowing ha1's stonith to run on ha1 and ha2's > >> stonith not run on ha2 (can't shoot yourself). That way each > >> server > >> has the ability to stonith the other. Nothing *should* run if your > >> stonith fails and you have stonith enabled. > >> > >> HTH > >> > >> Jake > > > > Thanks! Can you clarify how I would go about putting those > > constraints in place? I've been following Andrew's "Configuration > > Explained" document, and I think I have a grasp on most of these > > things, but it's not clear to me how I can constrain a STONITH > > device > > to only one node. Also, following the example in the > > documentation, I > > added these location constraints to the other resources: > > > > <constraints> > > <rsc_location id="loc-1" rsc="MysqlIP" node="ha1" score="200"/> > > <rsc_location id="loc-2" rsc="MysqlIP" node="ha2" score="0"/> > > <rsc_location id="loc-3" rsc="mysqld" node="ha1" score="200"/> > > <rsc_location id="loc-4" rsc="mysqld" node="ha2" score="0"/> > > </constraints> > > > > I'm trying to make ha1 the preferred node for both mysqld and the > > virtual IP. Do these look correct for that? > > > >>> When I attempt to stop heartbeat and Pacemaker > >>> (/etc/init.d/heartbeat > >>> stop) it says "Stopping High-Availability services:" and then > >>> hangs > >>> for > >>> about 5 minutes before finally stopping the services. > >>> > >>> So, I'm left with a couple of questions. Is there something > >>> wrong > >>> with > >>> my configuration? Is there a reason why the HA services can't > >>> shut > >>> down > >>> in a timely manner? Is there something else I need to do to get > >>> the > >>> virtual IP working? Thanks in advance for any help! > > > > Would the misconfigured STONITH resources be causing the long > > shutdown > > delays? > > > > Okay, I think I've almost got this. I updated my Pacemaker config > and > made a few changes. I put the MysqlIP and mysqld primitives into a > resource group called "mysqld-resources", ordered them such that > mysqld > will always wait for MysqlIP to be ready first, and added constraints > to > make ha1 the preferred host for the mysqld-resources group and ha2 > the > failover host. I also created STONITH devices for both ha1 and ha2, > and > added constraints to fix the STONIOTH location issues. My new > constraints section looks like this: > > <constraints> > <rsc_location id="loc-1" rsc="stonith-ha1" node="ha2" > score="INFINITY"/> > <rsc_location id="loc-2" rsc="stonith-ha2" node="ha1" > score="INFINITY"/>
Don't need the 2 above as long as you have the 2 negative locations below for stonith locations. I prefer the negative below because if you ever expanded to greater than 2 nodes the stonith for any node could run on any node but itself. > <rsc_location id="loc-3" rsc="stonith-ha1" node="ha1" > score="-INFINITY"/> > <rsc_location id="loc-4" rsc="stonith-ha2" node="ha2" > score="-INFINITY"/> > <rsc_location id="loc-5" rsc="mysql-resources" node="ha1" > score="200"/> Don't need the 0 score below either - the 200 above will take care of it. Pretty sure no location constraint is the same as a 0 score location. > <rsc_location id="loc-6" rsc="mysql-resources" node="ha2" score="0"/> > </constraints> > > Everything seems to work. I had the virtual IP and mysqld running on > ha1, and not on ha2. I shut down ha1 using "poweroff -n" and both > the > virtual IP and mysqld came up on ha2 almost instantly. When I > powered > ha1 on again, ha2 shut down the the virtual IP and mysqld. The > virtual > IP moved over instantly; a continuous ping of the IP produced one > "Time > to live exceeded" message and one packet was lost, but that's to be > expected. However, mysqld took almost 30 seconds to start up on ha1 > after being stopped on ha2, and I'm not exactly sure why. > > Here's the relevant log output from ha2: > > Aug 23 11:42:48 ha2 crmd: [1166]: info: te_rsc_command: Initiating > action 16: stop mysqld_stop_0 on ha2 (local) > Aug 23 11:42:48 ha2 crmd: [1166]: info: do_lrm_rsc_op: Performing > key=16:1:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_stop_0 ) > Aug 23 11:42:48 ha2 lrmd: [1163]: info: rsc:mysqld:10: stop > Aug 23 11:42:50 ha2 lrmd: [1163]: info: RA output: > (mysqld:stop:stdout) > Stopping MySQL daemon: mysqld_safe. > Aug 23 11:42:50 ha2 crmd: [1166]: info: process_lrm_event: LRM > operation > mysqld_stop_0 (call=10, rc=0, cib-update=57, confirmed=true) ok > Aug 23 11:42:50 ha2 crmd: [1166]: info: match_graph_event: Action > mysqld_stop_0 (16) confirmed on ha2 (rc=0) > > And here's the relevant log output from ha1: > > Aug 23 11:42:47 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing > key=8:1:7:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_monitor_0 ) > Aug 23 11:42:47 ha1 lrmd: [1240]: info: rsc:mysqld:5: probe > Aug 23 11:42:47 ha1 crmd: [1243]: info: process_lrm_event: LRM > operation > mysqld_monitor_0 (call=5, rc=7, cib-update=10, confirmed=true) not > running > Aug 23 11:43:36 ha1 crmd: [1243]: info: do_lrm_rsc_op: Performing > key=11:3:0:ec1989a8-ff84-4fc5-9f48-88e9b285797c op=mysqld_start_0 ) > Aug 23 11:43:36 ha1 lrmd: [1240]: info: rsc:mysqld:11: start > Aug 23 11:43:36 ha1 lrmd: [1240]: info: RA output: > (mysqld:start:stdout) > Starting MySQL daemon: mysqld_safe.#012(See > /usr/local/mysql/data/mysql.messages for messages). > Aug 23 11:43:36 ha1 crmd: [1243]: info: process_lrm_event: LRM > operation > mysqld_start_0 (call=11, rc=0, cib-update=18, confirmed=true) ok > > So, ha2 stopped mysqld at 11:42:50, but ha1 didn't start mysqld until > 11:43:36, a full 46 seconds after it was stopped on ha2. Any ideas > why > the delay for mysqld was so long, when the MysqlIP resource moved > almost > instantly? Couple thoughts. Are you sure both servers have the same time (in sync)? On HA2 did verify mysqld was actually done stopping at the 11:42:50 mark? I don't use mysql so I can't say from experience. Just curious but do you really want it to failback if it's actively running on ha2? Could you include the output of '$crm configure show' next time? I read that much better/quicker than the xml pacemaker config :-) Jake _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org