On 2013-04-02T17:02:01, David Vossel <dvos...@redhat.com> wrote: > > Seriously, folks, the LRM rewrite may turn out not to be the best > > example of pacemaker's attention to detail ;-) > such is any re-write of poorly designed code ;-) <--- I included the smiley > so my jab is acceptable and not in poor taste just like yours! :D <--- I > included this smiley because I think it looks funny.
Heh. Well, I admit that the above is the toned-down reminder of a rant. I was used to every pacemaker release being an almost boring improvement over the previous; so that set my expectations for 1.1.8, and the effort we thought we could get away with before shipping it again. When I saw 1.1.8 shaping up, I knew we couldn't ship that as a maintenance update already, but I was (and still am) taken by surprise just how much effort it was to get back into shape. From where I stand (speaking as the guy who probably has to deal with the largest production subset of the pacemaker community), it was the worst pacemaker release ever. I realize that the goal of most of the rewrites (libqb, lrmd, handling of anonymous clones, fencing, lots of logging messages changes, ...) that went into 1.1.8 was to clean up the code to make it more maintainable for the future. And that's a good thing. But in the short-term, the fall-out wasn't nice. If you're on the side of the rewrite equation that doesn't seem to feel any of the benefits but mostly pain, it does create a certain tension ;-) It also showed a couple of areas that apparently *aren't* well protected by regression tests in pacemaker / the cluster stack, I guess. I also realize that one of the problems is that, as soon as we realized that we couldn't ship 1.1.8 as-is, we were forced to shift our effort to selective backports (since we had to deal with customer issues in production, whom we couldn't upgrade). That meant that instead of feeding back to 1.1.8 immediately, we came late to the party with testing. But the only way I can see to avoid that is keeping the changes in pacemaker flowing at a more constant and lower rate, giving us time to integrate and test them. 1.1.8 blew our capacity, and is probably one of the few pacemaker releases we skipped shipping, and the first we skipped intentionally. And yes, I did feel frustration; that didn't seem to be a nice thing to do to your production deployments. (I know RHT as a company doesn't care much, because RHT doesn't support pacemaker officially yet.) So, basically, my frustration stems from the fact that (1.1.8 excepted, from my PoV) pacemaker has an excellent, continuously improving release quality, and that was what the plans and expectations were based on ;-) > I'll add PCMK_MAX_CHILDREN to the sysconfig documentation. To be backwards > compatible I'll have the lrmd internally interpret your LRMD_MAX_CHILDREN > environment variable as well. > > sound reasonable? That makes perfect sense, thank you. > We should open this discussion at some point. As long as it is constructive > criticism I doubt it will be perceived as a rant. Well, emotions are likely to creep into it in one or two paragraphs. Hopefully no swear words in public. ;-) > I've mentioned to Andrew that we might need to consider doing release > candidates. This would at least put some of the responsibility back on > the community to verify the release with us before we officially tag > it. We definitely test our code, but it is impossible for us to test > everyone's possible deployment use-case. See above. We usually can do that, but 1.1.8 was too much for us to stomach, and too much for a "smooth upgrade" from 1.1.7 in production. And, frankly, it took us several months of testing to get where we are now (and yes, I am *very* grateful that once we reported them, we received a lot of help from you and Andrew et al); we never needed as much time and effort to test a pacemaker release. (We seriously considered not moving to 1.1.8 at all, but continue SLE HA 11 as 1.1.7+backports, but then it was too late already for us to pull back.) And previously, pacemaker got away with making such changes because the PE has *excellent* regression tests, and that was where the majority of changes happened. On the plus side, 1.1.8 was a great learning opportunity. ;-) Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org