On Sat, Apr 23, 2011 at 11:35 PM, Mike Meyer <m...@mired.org> wrote: > On Sat, 23 Apr 2011 23:19:53 -0400 > Ken Wesson <kwess...@gmail.com> wrote: > >> On Sat, Apr 23, 2011 at 8:13 PM, Mike Meyer <m...@mired.org> wrote: >> > On Sat, 23 Apr 2011 19:41:28 -0400 >> > Ken Wesson <kwess...@gmail.com> wrote: >> > or you live in a universe where cosmic rays can flip bits and other >> > sources of hardware hiccups exist. >> Software crashes caused by non-software-bug-triggered memory >> corruption seem to me to be exceedingly rare, and they could as easily >> strike critical parts of the operating system as a multithreaded >> server program (and a large batch of independent C jobs will occupy >> more memory and have a correspondingly larger cross section as a >> target for such things). >> The best recourse if the server gets hit by something like that is >> going to be to reboot it. > > While it might be exceedingly rare on a per-cpu-second basis, if your > application runs 7x24 on enough cpus, you can expect to see them at > regular intervals. In which case the best recourse - if you want a > stable, robust application - is to restart the smallest set of > processes that might have been affected by the problem.
In other words, all of them, since the operating system might have been affected by such a problem and if it was, everything else is probably affected too. > In still others, it's the set of servers participating in a distributed > application. With distributed applications that share data, you need some sort of verification. That might be a master DB they all talk to that is ACID and has various consistency constraints set; it might be farming out jobs to multiple machines and using a majority-rules mechanism to decide which one(s) returned the "correct" answer (in the case of using DC to speed up a pure-functional computation). For many applications, occasional hiccups are non-fatal so long as crucial disk files don't get trashed (e.g. web servers); then, having the hard drives backed up regularly is the main defense against disasters that can't be fixed by rebooting one machine, whether the server is a single machine or a load-balanced (i.e. distributed) server farm. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en