On Thu, 4 Nov 2010, Paul Graydon wrote: > On 11/03/2010 10:24 PM, da...@lang.hm wrote: >> On Wed, 3 Nov 2010, Paul Graydon wrote: >> >>> On 11/3/2010 6:32 PM, da...@lang.hm wrote: >>>> On Wed, 3 Nov 2010, Paul Graydon wrote: >>>> >>>>> I'm facing an interesting challenge at the moment. Our Apache httpd >>>>> based load balancers are starting to show signs of strain. Nothing too >>>>> bad but a good indicator that as the amount of traffic to our sites >>>>> increases there will come a point when they can't cope. I've been >>>>> expecting this but at the moment as a "Standalone" sysadmin I've got too >>>>> much on my plate to even get on to anything pro-active that requires >>>>> more than a few hours work.. with inevitable consequences, though I'm >>>>> making favourable progress. Load is now reaching a stage where it's >>>>> spawning enough httpd sessions to be of some concern and at a level that >>>>> seems to be resulting in latency for requests. >>>> >>>> check what you have set for your ssl session cache, if it's not in shared >>>> memory, move it there (the overhead of filesystem operations for shared >>>> disk, even if you almost always operate in ram disk buffers, can be >>>> noticable at high traffic levels >>> Hmm.. /var/cache/mod_ssl . That's definitely something that can be easily >>> moved. Thanks. >>>> >>>> Definantly measure where your latency is happening. It could be that >>>> apache is the problem, but it could also be that you are running into >>>> something else. >> >> take a look at ab (apache benchmark, part of apache) and httperf for tools >> to allow you to throw load at the system, create a custom log format that >> logs the performance stats (%D among others). you need more info to see >> what's going on. > > A (weird, but effective) anti-scraper script that runs on the box collects > certain data and throws it off to a back-end database (extremely lightweight, > not tied in to apache directly and is run as low priority). Whilst it's > grabbing response times it wasn't actually pushing the data into the > database. Quick modification this morning and I'm now logging relevant data. > Left column is AVG response times in seconds, right the hostnames (names > altered to protect the innocent). Web1 and 2 are both older boxes that > handle the bulk of the traffic currently, web5 is a new box that is slowly > getting stuff migrated to it once various bits of testing have been carried > out (different JVM and a few other bits). Web4 doesn't actually host live > sites, it's got a CMS on there that generates static content which web1 & 2 > host. > > | 0.03180483 | web5 | > | 0.11877206 | web2 | > | 0.12424236 | web3 | > | 0.14441832 | web1 | > | 0.21145667 | web4 | > > That's looking pretty reasonable response times to me, I'm happy with that in > general. A quick look at css and jpg specifically shows < 0.09s responses, > mostly even < 0.009s.
do you have anything logging the apache response time? I believe that %D in an apache log is how long apache thinks that it spent on the hit. what I have asked our production folks to do is to log %D,%k,%X,%B somewhere (I'd have to lookup exactly what these are, but when I asked, I thought that they would all be useful) >> also look at autobench, it uses httperf to run a series of tests against >> the box, increasing the number of simultanious connections so you can map >> where the bos (or in your case, system of boxes) start to fail. then you >> can try changing things and see that number shift. >>>> how many processes are you seeing that is making you concerned? >>> >>> I couldn't give you an solid figure, but based on memory usage compared to >>> current I'd guesstimate at 120+ and I swear we're not doing that much >>> traffic. I've added that to zabbix so I'll have a better idea tomorrow. >>> Even now during what is a quiet time for us I'm bouncing between 50 and >>> 80, tuned: >>> >>> StartServers 8 >>> MinSpareServers 5 >>> MaxSpareServers 20 >>> ServerLimit 256 >>> MaxClients 512 >>> MaxRequestsPerChild 20000 >> >> set maxspare servers _much_ higher, if this box is dedicated to the task, >> set the maxspare to the same as your serverlimit, and set startservers >> pretty high as well. >> >> even with the phenominal forking speed that linux has, there is still a lot >> of overhead in starting and stopping a apache process (less in the fork >> than in all the other setup for starting, but it hurts). just avoiding the >> thrashing can gain you quite a bit. >> >> also install the sysstat package and run iostat during heavy load to see >> what your disk I/O is looking like, you may be surprised at how much you >> are hitting it. >> >> depending on what you are using it for and how you have it configured, >> apache can handle from hundreds of connections/sec to 10s of thousands of >> connections a sec on the same hardware (admittedly, at 10s of thousands, >> all it can do is serve static or cached content, but sometimes that's what >> you need) >> >> David Lang > I've followed your advice and bumped maxspare to the same as ServerLimit (and > fixed that MaxClients entry). CPU usage dropped a good 10% on each core, and > we seem to have peaked at 216. you can probably set the limits higher than 256, back in the apache 1.3 days that was the limit, nowdays I think the compile time limit is 2048. on my production servers with 8G of ram we have this bumped up to 4096, and we have some fairly heavy CGIs running there, so while you should check the available ram at peak load, I expect that you can probably bump these up to 1024 with the 2G of ram you have on these boxes. if you are hitting 216 under peak load and your limit is 256, I would be getting pretty concerned. If you find that you can go to 1024 and are at 216, I wouldn't be (unless you find that under peak load your overall response time is getting worse) try to get iostat and vmstat data for your peak times. something that's very handy to do is to have a script do iostat -x 60 |logger -t iostat & vmstat 60 |logger -t vmstat & this is pretty low overhead (one vmline, 5 iostat lines, plus one line per disk/partition per min) and gives you data that can be invaluble in tracking things down after the fact. because it goes to syslog, if you have your syslog being sent off the box, if a box dies due to performance issues (running out of memory and paging for example), you have info that can tell you what was happening leading up to the crash. This isn't a slick central monitoring tool like nagios or many others, but in many cases I've found this simple data to be easier to search and spot patterns in. another similar thing I have is the following script ps ax |wc -l |while read junk; do echo -n "$junk=" ; done ps ax |cut -c 28- |sort |uniq -c |sort -rn |head |while read junk; do echo -n "$junk=" ; done echo sending this to logger puts a line in the log that tells me how many processes are running, and what the 10 most common processes are > I'll be keeping an eye on it over the next few days and tweaking it some > more. Tests of MPM Worker in the dev environment are fairly encouraging, > doesn't look like anything is broken so far, but it'll need at least another > couple of weeks before I'll be happy with it. Is it generally considered > worth pursuing or am I wasting my time with it? it depends on your workload and operating system. Linux handles processes much more efficiently than just about any other OS, and as a result there is less of a benifit in moving to threads. personally, I've never found the performance difference to be significant enough for me to consider it being worth the fragility that going to threads brings (with multiple processes, if one gets corrupted, it dies and apache starts a replacement, with multiple threads, if one gets corrupted it will take down everything). I'm probably overly sensitive to this, but I've never been pushed to the point of really needing this. David Lang _______________________________________________ Tech mailing list Tech@lists.lopsa.org http://lists.lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/