On 25/11/13 06:40, Michał Margula wrote: > Hello! > > I wanted to ask for your help because we are having much trouble with > cluster based on Pacemaker. > > We have two identical nodes - PowerEdge R510 with 2x Xeon X5650, 64 GB > of RAM, MegaRAID SAS 2108 RAID (PERC H700) - system disk - RAID 1 on > SSDs (SSDSC2CW060A3) and two volumes - one RAID 1 with WD3000FYYZ and > one RAID 1 with WD1002FBYS -- both Western Digital disks. Both nodes are > linked with two gigabit direct fiber links (no switch in between). > > We have two DRBD volumes - /dev/drbd1 (1TB on WD1002FBYS disks) and > /dev/drbd2 (3TB on WD3000FYYZ disks). On top of DRBD (used as PVs) we > have a LVM with LVs for virtual machines which run under XEN. > > Here is our CRM configuration - http://pastebin.com/raqsvRTA > > We have previously used fast USB drives instead of SSD for root > filesystem and it caused some trouble - it was lagging on I/O and one > node "thought" that another one was having trouble and performing > STONITH on it. After replacing it with SSDs we had no more trouble with > that issue. > > But now from time to time it happens that we get STONITH of one nodes, > and reason is unclear to us. > > For example last time we found it in logs: > > Nov 23 15:14:24 rivendell-B crmd: [9529]: info: process_lrm_event: LRM > operation primitive-LVM:1_monitor_120000 (call=54, rc=7, cib-update=124, > confirmed=false) not running > > And after that node rivendell-B got STONITH. Previously we had trouble > with DRBD - node stopped DRBD for no apparent reason and again - > STONITH. Unfortunately we did not check logs that time. > > Also when doing some tasks on one of nodes (for example "crm resource > migrate" of few XEN virtual machines) it can cause STONITH also. > > Could you give us some hints? Maybe our configuration is wrong? To be > honest we had no previous experience with HA clusters so we created it > based on configuration. > > It is working now for over a year now but giving us headaches and we are > wondering if we should drop Pacemaker and use something else (even > manual stopping and starting of virtual machines comes in mind). > > Thank you in advance!
My first thought is that the network is congested. That is a lot of servers to have on the system. Do you or can you isolate the corosync traffic from the drbd traffic? Personally, I always setup a dedicated network for corosync, another for drbd and a third for all traffic to/from the servers. With this, I have never had a congestion-based problem. If possible, please past all logs from both nodes, starting just before the stonith occurred until recovery completed please. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org