FSF Nagios sent an alert on internal0 high load average earlier this week. Ruben notified me. I jumped in but had already missed the high load and things were in the after-event settle down return to normal behavior. But it had load spiked to 53+ within the previous few minutes.
I looked through the logs everywhere and determined that external influences were hitting the web frontend on frontend0. This was flowing through to hitting internal0 for database queries. I missed being able to figure out exactly what was hitting the web page. But if it was hitting there once it will hit there again. All we need to do is wait. The event did not cause a high load on the web frontend0 which handled it okay. The slow queries caused the MySQL server to have a high load as it forked to handle the task. Afterward I turned on the slow query log on internal0. This is available moving forward. Initially I had the time limit set at 2 seconds and it was streaming out logs of 2-second slow queries. (Which is not a good sign.) I reduced it to 4 seconds and then finally back to the 10 second default value until the stream slowed to a trickle. At 10 seconds there are always at least one per day but at least not streaming. At the point that we have time to look and improve things it needs to be reduced in steps lower and the problematic queries found and fixed. There appear to be many queries that search every row of the database. This may be unavoidable due to the web bug tracker query possibilities. But if a human is running the web interface the rate of them will be slow enough that this isn't a problem. It is only a robot that would strike the web interface fast enough for this to be a problem. * Improve the robots.txt file to keep them out of wherever they were hitting that was trouble. But I think this was a probe attack and not a well behaved robot. * In other cases it appears that some columns should have an index but do not. Another opportunity for improvement. * And of course many tables (most?) are of the old type and should be migrated to an ACID3 compliant engine that also has better locking behavior. Time is what keeps everything from happening all at once. :-) Bob