Paul Rubin <http://[EMAIL PROTECTED]> wrote: ... > Hi Alex, I'm a little confused: does Production Systems mean stuff > like the Google search engine, which (as you described further up in > your message) achieves its reliability at least partly by massive > redundancy and failover when something breaks?
The infrastructure supporting that engine (and other things), yes. > In that case why is it > so important that the software be highly reliable? Is a software Think "common-mode failures": if a program has a bug, so do all identical copies of that program. Redundancy works for cheap hardware because the latter's "unreliability" is essentially free of common-mode failures (when properly deployed): it wouldn't work against a design mistake in the hardware units. Think of the famous Pentium division bug: no matter how many redundant but identical such buggy CPUs you place in parallel to compute divisions, in the error cases they'll all produce the same wrong results. Software bugs generally work (or, rather, fail to work;-) similarly to hardware design bugs. There are (for both hw and sw) also classes of mistakes that don't quite behave that way -- "occasional glitches" that are not necessarily repeatable and are heavily state-dependent ("race conditions" in buggy multitasking SW, for example; and many more examples for HW, where flaky behavior may be triggered by, say, temperature situations). Here, from a systems viewpoint, you might have a system that _usually_ says that 10/2 is 5, but once in a while says it's 4 instead (as opposed to the "Pentium division bug" case where it would always say 4) -- this is much more likely to be caused by flaky HW, but might possibly be caused by the SW running on it (or the microcode in between -- not that it makes much of a difference one way or another from a systems viewpoint). Catching such issues can, again, benefit from redundancy (and monitoring, "watchdog" systems, health and sanity checks running in the background, &c). "Quis custodiet custodes" is an interesting problem here, since bugs or flakiness in the monitoring/watchdog infrastructure have the potential to do substantial global harm; one approach is to invest in giving that infrastructure an order of magnitude more reliability than the systems it's overseeing (for example by using more massive and *simple* redundancy, and extremely straightforward architectures). There's ample literature in the matter, but it absolutely needs a *systems* approach: focusing just on the HW, just on the SW, or just on the microcode in-between;-), just can't help much. > some good hits they should display) but the server is never actually > down, can you still claim 100% uptime? I've claimed nothing (since all such measurements and methodologies would no doubt be considered confidential unless and until cleared for publication -- this has been done for a few whitepapers about some aspects of Google's systems, but never to the best of my knowledge for the "metasystem" as a whole), but rather pointed to <http://uptime.pingdom.com/site/month_summary/site_name/www.google.com>, a publically available site which does publish its methodology (at <http://uptime.pingdom.com/general/methodology>); summarizing, as they have no way to check that the results are "right" for the many sites they keep an eye on, they rely on the HTTP result codes (as well as validity of HTTP headers returned, and of course whether the site does return a response at all). > problem. Of course then there's a second level system to manage the > restarts that has to be very reliable, but it doesn't have to deal > with much weird concocted input the way that a public-facing internet > application has to. Indeed, Production Systems' software does *not* "have to deal" with input from the general public -- it's infrastructure, not user-facing applications (except in as much as the "users" are Google engineers or operators, say). IOW, it's *exactly* the code that "has to be very reliable" (nice to see that we agree on this;-), and therefore, if as you then said "Russ's point stands", would NOT be in Python -- but it is. So, I disagree about the "standing" status of his so-called "point". > Therefore I think Russ's point stands, that we're talking about a > different sort of reliability in these highly redundant systems, than > in the systems Russ is describing. Russ specifically mentioned *mission-critical applications* as being outside of Python's possibilities; yet search IS mission critical to Google. Yes, reliability is obtained via a "systems approach", considering HW, microcode, SW, and other issues yet such as power supplies, cooling units, network cables, etc, not as a single opaque big box but as an articulated, extremely complex and large system that needs testing, monitoring, watchdogging, etc, at many levels -- there is no other real way to make systems reliable (you can't do it by just looking at components in isolation). Note that this does have costs and therefore it needs to be deployed selectively, and the whole panoply may chosen to be arrayed only for what an organization considers to be its truly mission critical applications -- Google's "pillars" used to be search and ads, but has more recently been officially declared to now be "Search, Ads and Apps" (cfr <http://searchengineland.com/070511-092730.php> and links therefrom for some discussion about this), which may have implications... but Python remains at the core of many aspects of our reliability strategy. Alex -- http://mail.python.org/mailman/listinfo/python-list