* All the servers are made by the same manufacturer (Dell)
* They are all the same model (R410)
* The have the same engine (24 cores, 24G RAM, SAS Drives)
The R410 is a two socket Xeon box with max 2 x 6 core CPUs. The 24 CPUs
you see is the result of HyperThreading being enabled. I'd disable HT
if I were you, or those boxen mine.
OK, I'll take a look at this, thanks.
* The motorway is exactly the same for all servers (NFS to a NetApp 6080
and a RAMSAN)
* The weather is almost exactly the same (Same Datacentre, different
rooms/racks)
* The Driver is exactly the same (Dovecot 1.0.15)
What operating system? Linux or *BSD? If Linux, what kernel version?
Given that you're running Dovecot 1.0.15 I'm guessing you're using
CentOS or RHEL 5.x and thus have kernel 2.6.18-xxx. 2.6.18 is 5 years
old now and not inappropriate for a modern 2 socket, 6 core
HyperThreading box. You need a much newer kernel, preferably in the
2.6.3x series. 2.6.18 could be reporting incorrect load numbers on
these machines.
Linux, Centos 5.6 and (yup, you've guessed it...) 2.6.18 again, I'll
take a look at this, thanks.
1) Load Average
On Linux, load average strictly shows total system CPU usage in
intervals, nothing else. Neither memory, disk, nor network or anything
else affects load average. Thus, with a 12 core system, until you see a
load average above 12 you have absolutely nothing to worry about. With
HT enabled load averages pretty much go out the window as half the
"CPUs" are merely glorified duplicate register file phantoms.
Given that all mail apps are 100% IO bound, never CPU or memory bound,
I'd guess you'll never see a load average over 4.00 on any of these
machines with less than 1000 concurrent connections. This assuming you
run a newer kernel and with HT disabled. In other words, no more than 4
cores worth of CPU time will ever be eaten by your workload. What
number do your Munin graphs show for load average for each set of boxes?
Do they even come close to 4?
They're showing as between 20 and 24 for the POP3 servers and 1.4 for
the IMAP servers.
Also note that TCP stack processing on the pop nodes will be greater
than that of the imap boxes, eating more CPU cycles. More data sent
over the wire means more packets, more packets means more CPU time in
both code/data processing and interrupts. If you're running iptables
rules on each host that bumps up network processing cycles a bit more yet.
OK, I'll take a look at that as well
2) RAM Usage (particularly in regard to cache)
In both cases, the value for each area is higher on the three nodes
running POP3 than the nodes running IMAP.
Almost all the memory consumption on both systems is buffer cache. Thus
you don't have a memory issue on either host. The kernel will free and
immediately reassign pages from cache to application processes as
needed. I don't see evidence of the pop machine using more memory, in
fact the imap processes are using more. Both boxes are just under 24GB
total usage and both using right at 20GB of cache. Looks like a default
config Linux kernel based on the ultra aggressive caching and eating up
nearly all memory.
So a kernel update is more than sensible...
It may have been. I'll know when you post your load numbers from those
top secret graphs. ;)
LOL, see above.
Thanks again,
Matt