Hi List,
I have an interesting little story that all of you might find quite
interesting. It has a happy ending but I'd also like a little feedback from
the rest of you on other possible solutions to this.
Six months ago I built, and pushed into production a mail server, running on
5 FreeBSD 4.1 servers, running qmail, vpopmail, sqwebmail, courier-imap, and
all the trimmings. The original design intent was to develop a server that
would support roughly a million email users. Scalability was, of course, of
paramount importance in such a solution.
The architecture is pretty standard for large shared environments. Once
machine is a file server. It's got 300GB of RAID storage hanging off a scsi
card and connected to the other 4 machines via a gigabit ethernet
controller. That should last for quite some time I'm thinking. :-) Once I
exceed that file servers ability I can slide up to 25 more file servers into
the equation for nearly limitless storage and several T3's worth of mail
bandwidth. That should be enough for a while. ;-)
Anyway, since that time, the main problem I've been having has been the
implementation of the pop before smtp authentication for relaying. The way
it's implemented, by default, is pretty simple. A user POP auth's, and upon
successful authentication we stuff their IP address into a file
~vpopmail/etc/open-smtp and compile that into the tcp.smtp.cdb database
which tcpserver consults to determine if the IP is allowed to relay. Pretty
simple stuff really.
That all worked fine and dandy until somewhere around 1300 domains. I'm not
sure how many users that equated to but I'll guess around 3,000. So, I had 4
mail servers, all configured identically, all sharing the same file system
for local user mail spools (via NFS), and all sharing a common
~vpopmail/etc/tcp.smtp.cdb file to determine if a user is allowed to relay.
At around 1300 domains we started seeing the ~vpopmail/etc/open-smtp file
getting munged. At that time, each machine was seeing nearly one POP auth
per second at peak times and, consequently, trying to update that file. As a
result, the file got munged quite often during the middle of the day, users
couldn't relay, and the phones in support started to ring.
Since I already had 1300 vpasswd files strewn around the file system, the
idea of converting entirely to MySQL wasn't really an appealing option. The
solution then was to hack up vpopmail to use the pop-auth code that stuffed
the IP's into a MySQL table. So, I quickly hacked up the code, recompiled
vpopmail and shoved the new programs into production. Wahoo, the table got
populated quite rapidly with hundreds of IP's and life was happy again, for
a while.
Two weeks ago I left work for France to spend a while with friends, drinking
wine, eating well, and skiing in the Pyrenees. While I was gone, a new
problem surfaced. While the IP table is being stored in MySQL, it still gets
recompiled into the ~vpopmail/etc/tcp.smtp every time a POP session
authenticates successfully. At this time I have some 2600 domains and over
10,000 users on the system (I wrote a perl script to figure that out by
finding all the vpasswd files and adding up all the lines in the files :-)).
Now that all four servers are seeing in excess of one POP auth per second,
that file was getting written up to four times per second.
Tcpserver would try to access the tcp.smtp.cdb file and get a stale NFS file
handle and drop the connection. So, the phones started ringing because the
SMTP server was intermittently dropping the connections. What to do? Well,
we chose the most obvious solution. Hack up tcpserver to check our MySQL
table directly instead of the .cdb file. I had one of our senior programmers
tackle this and the results are great. The new enhanced tcpserver, when run
with the -S flag, checks for /var/qmail/control/sql and open finding it,
follows it's instructions for connecting to the sql server. Then, for every
incoming SMTP connection, it checks the database for the IP and, if found,
sets the RELAYCLIENT environment variable. It's pretty darned cool and works
like a charm.
Consequences?
So far, so good. I've removed the -x tcp.smtp.cdb flag from tcpserver and
only have it consult the database. The -x stuff still works, except that now
I have to go back and hack up my hacked vpopmail so that it's stops
rebuilding the tcp.smtp.cdb file. Shouldn't be a big deal. Then life should
be good for a while.
So, has anyone else run into a problem of this sort? How did you solve it?.
I've emailed Dan to see if he might (not likely) like to include the SQL
stuff in a released version of tcpserver but the odds of even getting a
response are pretty slim. So, failing that I guess I'll release a custom
version of tcpserver with SQL support. Other ideas?
Matt