Hello, I'm stuck with a problem where postfix is hanging sometimes when issuing a "postfix stop" command. In my configuration, I have two domains I'm relaying mail for with postfix. The lists of email addresses are in these virtual files, defined with this line in main.cf: virtual_alias_maps = hash:$config_directory/usermanaged/virtual.domain1.com,hash:$config_directory/usermanaged/virtual.domain2.com
I have a script called update-postfix.pl that runs every half an hour which shuts down postfix, runs postmap on the virtual alias files, and then restarts postfix. Most of the time, the script runs without any problems. Actually, this configuration was running on a different server previously with no problems at all. Because the old server was too slow, I set up a new postfix installation on a new server, and it's under the new setup that I'm running into this intermitent hanging problem. Usually, when the update-postfix.pl script runs, it tells Postfix to shut down and we get a logged message that says "postfix/postfix-script: stopping the Postfix mail system". Right after that, postfix responds with something like "postfix/master[11211]: terminating on signal 15" However, sometimes (once every day or so), the script runs and we get the first message "postfix/postfix-script: stopping the Postfix mail system", but then postfix does not respond to it and keeps running for a while, until it sees the virtual.domain2.com.db was updated, at which point it logs "postfix/trivial-rewrite[16529]: table hash:/etc/postfix/usermanaged/virtual.domain2.com(0,lock|fold_fix) has changed -- restarting", and then after that, it appears to be hung. My first thought was that maybe postfix didn't have enough of an opportunity to shut itself down before the .db was updated by the postmap command. So I put in a sleep 60 right after the postfix stop command. Even though when the stop command works, we see the "terminating on signal" response almost instantly. Since the 60 seconds didn't work, I increased to 2 minutes, but that also didn't help. I found a forum post (http://www.howtoforge.com/forums/showthread.php?t=15898) where someone had a somewhat similar problem. Someone suggested running "newaliases". So I tried deleting both virtual.domain2.com.db and virtual.domain1.com.db, then ran "newaliases", manually ran postmap for domain2 and domain1, and restarted postfix. But even after those steps, the problem kept happening. This most recent time it hung, I tried issuing another "service postfix stop", as well as a plain "postfix stop", but neither of those caused postfix to respond with "terminating on signal". I checked the ps aux process list, and tried killing the postfix processes I saw. Then tried restarting, and got the "already running" error. So I checked ps aux again and noticed there were a bunch of processes being run by the postfix user that were tagged with <defunct>. I tried killing those processes but couldn't kill them. Does anyone have any ideas on what could be wrong? Thanks very much in advance for any suggestions! Scott Below is a snippet from maillog showing what happens when it hangs. You can see that at 23:30:02, the update-postfix.pl script kicks in and tries to stop postfix. It doesn't succeed, and one minute later, Postfix sees the .db was updated and tries to restart. Shortly after that, the update script tries to restart postfix but fails because it's already running. Then postfix stays in a hung state, not accepting any incoming connections. Half an hour later, the cron job runs again but fails to do anything because postfix is hung. Oct 9 23:30:02 myserver postfix/postfix-script: stopping the Postfix mail system Oct 9 23:31:04 myserver postfix/trivial-rewrite[16529]: table hash:/etc/postfix/usermanaged/virtual.domain2.com(0,lock|fold_fix) has changed -- restarting Oct 9 23:31:14 myserver postfix/postfix-script: fatal: the Postfix mail system is already running Oct 9 23:32:44 myserver postfix/anvil[16528]: statistics: max connection rate 1/60s for (smtp:110.36.0.252) at Oct 9 23:29:03 Oct 9 23:32:44 myserver postfix/anvil[16528]: statistics: max connection count 1 for (smtp:110.36.0.252) at Oct 9 23:29:03 Oct 9 23:32:44 myserver postfix/anvil[16528]: statistics: max cache size 2 at Oct 9 23:29:23 Oct 10 00:00:02 myserver postfix/postfix-script: stopping the Postfix mail system Oct 10 00:01:13 myserver postfix/postfix-script: fatal: the Postfix mail system is already running update-postfix.pl: ------------------------- #!/usr/bin/perl $|=1; my $dir= "/var"; my @fulldfinfo = `df $dir`; #/dev/da0s1f 33851580 28087462 3055992 90% /var my ($dfdevice,$total,$used,$avail,$pct,$dfmount)=split /[\b\t ]+/,$fulldfinfo[1]; print "space available=$avail\nspace used=$used"; if ($avail > 0) { open(SH, "|/bin/sh"); print SH <<"EOM"; umask 022 cd /etc/postfix /sbin/service postfix stop sleep 120 /usr/sbin/postmap -c /etc/postfix hash:/etc/postfix/access /usr/sbin/postmap -c /etc/postfix hash:/etc/postfix/transport /usr/sbin/postmap -c /etc/postfix hash:/etc/postfix/usermanaged/virtual.domain1.com /usr/sbin/postmap -c /etc/postfix hash:/etc/postfix/usermanaged/virtual.domain2.com /usr/sbin/postalias -c /etc/postfix hash:aliases sleep 10 /sbin/service postfix start #/usr/sbin/postfix -c /etc/postfix reload EOM #mv /etc/postfix/usermanaged/virtual.tmp.db /etc/postfix/usermanaged/virtual.db } else { print "Not enough space to generate new db files!"; }