On Wed, May 16, 2001, guy keren wrote about "2 years? can't be linux (was: Re: web 
server)":
> 
> On Tue, 15 May 2001, Ariel Biener wrote:
> 
> > least (it's current version had almost a 2 years uptime).
> 
> then i assume that the current machine isn't a 32-bit Linux machine, since
> its kernel's jiffies variable would have recycled after about 1.3 years,
> and as far as i understood, no one before tested what would happen in such
> a case.
>...

I don't think anybody (except sysadminds who love breaking records ;)) really
cares about this "uptime" figure. Having an uptime of 5 years also implies
(usually) that you haven't upgraded the kernel in 5 years... Also, there are
good reasons to recommend rebooting the machine every few months: not for
the Windows reason ("maybe if I reboot the machine all the memory leaks will
go away and suddenly everything will work better") but for an administrative
reason - some things are done only after a reboot (e.g., reading the /etc/rc.d
files) and if you accidentally ruined one of them, you wouldn't find out until
the next reboot - which could be accidental in the middle of the night, when
you're sleeping and not able to fix the problem immediately.

Anyway, I think what you really care about more than continguous uptime is 
the fraction of uptime. For example, for a system to have 4-nine reliability
(up 99.99% of the time), it needs to be down less than 52 minutes a year.
This is infact doable with a Linux machine (with good hardware), but not easy.
You are most likely not going to achieve a 4-nine reliability anyway because
your ISP's electrical system is likely to be less reliable than that, and
its Internet connection is even worse (probably not better than 2-nines, 99%
uptime). If you really want to achieve 4 nines on a web-server, you'll
need redundant machines, redundant Internet suppliers, a private power
supply, and a whole lot of stuff you probably can't afford.

Anyway, from my experience running a Linux web (and other Internet services)
server for the last 4 years (the one running iguide.co.il, for example),
the most significant reasons for downtime are, in decreasing order:

 1. Overload. When you have a certain hardware (in our case, an old pentium
    with little memory) and no deep-pocketed company to upgrade it for you,
    your machine can be brought down to its knees by a combination of too many
    clients and not-smart design of your servers. For example, a couple of
    months ago we had a problem with several search engines indexing a heavily
    modperl-ed part of our site concurrently, causing the amount of memory
    needed to be 3 times what we actually had, and bringing the entire machine
    to a grinding halt, swapping in and out for several hours (ivrix-discuss
    members saw this slowdown in action). Redesigning this part of the server
    (to use a seperate perl process, for example) let us server the search
    engines properly again, and now our puny little Pentium is handling the
    load properly, and we didn't need to upgrade the hardware.
    Overload can also come from malicious attacks - we've had to contend with
    everything from SYN floods to mail floods. Apparently some crackers just
    love to attack Israeli machines, and "iguide.co.il" looks like a good
    candidate ;)

 2. Breakins. Making your home PC into a inpenetrable fortress is relatively
    easy - just deactivate every service and put up a firewall blocking almost
    everything, just in case. But on an Internet server you simple cannot do
    that - you are delibrately running stuff like httpd, ftpd, named, sendmail,
    sshd, etc., and cannot deactivate them. So you have to be very professional,
    and very timely in installing all the latest bug-fix patches, and be
    aware of everything going on in the cracker community (subscribing to the
    bugtraq mailing list is a good start). Also, if you can, strictly limit
    the number of people who have shell accounts on the machine, and disable
    and cleartext-password-service (ftp, pop3, etc.) for them. If you have
    physical access to the machine's network, you can enable logins only from
    the local network, but this is not possible when you are colocating your
    machine.
    Every time your machine is broken into (and it takes a very professional
    and knowledgeable sysadmin to find out that that even happened!), you may
    be facing at least 10-20 hours of downtime (unless you have a team of
    very experienced people doing the recovery, or are not intending to
    collect any evidence).

 3. Electrical failure - even in Netvision's very reliable server room, with
    its UPSs, etc., we've had one multi-hour blackout in the last year (caused,
    if I remember correctly, by some backhoe accident in the Haifa MATAM area,
    out of Netvision's control).

 4. Distribution upgrade - this usually means 2-5 hours of downtime. To
    prevent this downtime, we are still running Redhat 5.2 on our server,
    upgrading the different packages individually and incrementally (especially
    on packages with security fixes).

 5. Hardware upgrade - usually can be done in less than an hour, if you
    know what you're doing.

 6. Kernel upgrade - if you know what you're doing, this can only be a 5
    minute downtime.

-- 
Nadav Har'El                        |    Wednesday, May 16 2001, 23 Iyyar 5761
[EMAIL PROTECTED]             |-----------------------------------------
Phone: +972-53-245868, ICQ 13349191 |A man is incomplete until he is married.
http://nadav.harel.org.il           |After that, he is finished.

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to