On Sat, Apr 23, 2011 at 11:35 PM, Mike Meyer <m...@mired.org> wrote:
> On Sat, 23 Apr 2011 23:19:53 -0400
> Ken Wesson <kwess...@gmail.com> wrote:
>
>> On Sat, Apr 23, 2011 at 8:13 PM, Mike Meyer <m...@mired.org> wrote:
>> > On Sat, 23 Apr 2011 19:41:28 -0400
>> > Ken Wesson <kwess...@gmail.com> wrote:
>> > or you live in a universe where cosmic rays can flip bits and other
>> > sources of hardware hiccups exist.
>> Software crashes caused by non-software-bug-triggered memory
>> corruption seem to me to be exceedingly rare, and they could as easily
>> strike critical parts of the operating system as a multithreaded
>> server program (and a large batch of independent C jobs will occupy
>> more memory and have a correspondingly larger cross section as a
>> target for such things).
>> The best recourse if the server gets hit by something like that is
>> going to be to reboot it.
>
> While it might be exceedingly rare on a per-cpu-second basis, if your
> application runs 7x24 on enough cpus, you can expect to see them at
> regular intervals. In which case the best recourse - if you want a
> stable, robust application - is to restart the smallest set of
> processes that might have been affected by the problem.

In other words, all of them, since the operating system might have
been affected by such a problem and if it was, everything else is
probably affected too.

> In still others, it's the set of servers participating in a distributed 
> application.

With distributed applications that share data, you need some sort of
verification. That might be a master DB they all talk to that is ACID
and has various consistency constraints set; it might be farming out
jobs to multiple machines and using a majority-rules mechanism to
decide which one(s) returned the "correct" answer (in the case of
using DC to speed up a pure-functional computation). For many
applications, occasional hiccups are non-fatal so long as crucial disk
files don't get trashed (e.g. web servers); then, having the hard
drives backed up regularly is the main defense against disasters that
can't be fixed by rebooting one machine, whether the server is a
single machine or a load-balanced (i.e. distributed) server farm.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to