>>> "EXTERNAL Konold Martin (erfrakon, RtP2/TEF72)" <[email protected]> schrieb am 16.08.2012 um 17:54 in Nachricht <[email protected]>:
[...] > From my experience with SLES11 SP2 (with all current updates) I conclude > that actually nobody is seriously running SP2 without local bugfixes. Unfortunately that's ture for SP1 as well: We had to use a newer corosync (among others) > > E.g. Even the most simple examples from the official SuSE documentation > don't work as expected. > > A trivial example is ocf:heartbeat:exportfs as distributed by SuSE with SP2 > causes unlimited growth of .rmtab files (goes fast in the gigabytes for > serious NFS servers). I could work around this issue using some shell > scripting. Yes, we had that, too for SP1. Fixed in "resource-agents-3.9.2-0.4.2.1.4061.0.PTF.754067" (just for reference). Unfortunately the problem only shows up when seriously using the NFS server. > > There are other issues which are more than annoying and actually make the > SLES SP2 HA Extension unusable for production systems. E.g. clvmd cannot be > made less verbose from the cluster configuration. (No daemon_options="-d0" > does not help!) I haven't tried it, but it's on the agenda. > > Not funny is also the fact that the official SLES 11 SP2 kernels crash > seriously (when a node rejoins the cluster) when using STCP as recommended in > the SLES HA documentation and offered via the wizards. It took me a while to > find out what was going on. No we did not have these bugs, but we had a crashing crmd, and a two-node cluster that could not agree who's DC for several minutes. > > When setting up a system with many (rather simple) resources funny things > happen due to race conditions all over the place. (can be worked around > mostly using arbitrary start-delay options. > > Oh, did I mention that situations which are actually forbidden by > constraints (e.g. using a score of INFINITY) actually do happen... Depending > on the environment this can lead to not so funny effects. > > E.g. I defined the following constraints: > > colocation c17 inf: p_lsb_ccslogserver p_fs_daten > order o34 inf: p_fs_daten p_lsb_ccslogserver:start > > I can proof from the logs that ccslogserver (an application) got migrated > from node A to node B while p_fs_daten (a filesystem on top of drbd) was > definitely still running on node A I'm absolutely no expert on that, but I think you constraints will allow p_fs_daten to be active on one node while p_lsb_ccslogserver is going down (being migrated). Only before staring p_lsb_ccslogserver p_fs_daten should be up. Probably then the colocation is ignored. I'm also unsure whether transitive ordering an colocation works. What also disappointed me: When adding stickiness to a primitive, a group gets more or less the sum of ist primitives, but when you add a stickiness to a goup, EVERY primitive gets that stickiness, and the group STILL gets the sum of all these then. So especially bad, when adding one more primitive to a group the total stickiness changes. Likewise if you use resource utilization on primitives in a group, the group begains to start on one node, then stalls when the next primitive's utilization cannot be fulfilled. That's bad especially when there are enough resources for the whole group on another node. (Here ulilizations are not summed). Some concepts had been implemented very "ad hoc". And one of the popular clusterbooks describes the XML configuration. It's like describing how to start the engine of your car: Open the hood, locate the battery and the starter engine. The take a pair of wires, connecting one end to the battery, and the other end to the starter engine, watching for right polarity, Then... (you get it) The best tool around is the crm shell (IMHO), while the GUI has extraordinarily poor performance once your cluster has a reasonable number of resources. There is a acess control concept (ACLs) based on XPath. Unfortunately that would require to exactly describe the data model of the CIB to really implement proven access restrictions. It's a bit complicated... > > Reporting bugs is not possible without a direct support contract. (You must > enter into a support contract with SuSE before you can even report a bug or > provide a patch ....) Yes: I found out that there is no mechanism to repair non-clustered MD-RAIDs, so I wrote a RAID monitor. Proposed that to support. Still didn't hear any feedback about it... Regards, Ulrich _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
