On 2/25/2011 4:38 PM, Robert Goodyear wrote:
I'm going to run some analytics on my last 12 months' worth of outbound messages to get more scientific with my gut instincts here. It's about 270 million messages, and my observation is that when we have a spike of 4 or 5 million that need to deliver at a certain point in time (surrounding a critical/time-sensitive product launch) that my deferred queues saturate too quickly. Again, rather than just brute-force it with more edge MTAs, I was hoping to devise a more deterministic way to control the internal relaying to my geographically-separated points of presence and shave off the few ms of conversation that are consumed in finding out if relay X will accept more messages yet.
"Standard advice" for this problem is to designate an internal fallback_relay (which can be a whole second MX farm) to handle mail that isn't delivered quickly. That way the primary outbound machines aren't bogged down with a clogged defer queue. I think this is discussed in TUNING_README.
But you're wise to analyze prior data and determine exactly where the bottleneck is before wholesale restructuring.
-- Noel Jones