Wietse,
Awesome response - thank you very much!

You have really demystified a lot of what's going on for me.

I see what you're saying about there being better ways to update the Berkeley 
.db files.  It does seem like an upgrade to a less disruptive approach would be 
in order.

But at the same time, whatever is causing this hanging is probably an 
underlying 
issue that I should resolve as it could pop up elsewhere.  I will take your 
advice and run the gnu debugger next time I get a hang, and report the results. 
 I'd like to get to the bottom of this problem, but then I agree it would make 
sense to come up with a better .db update scheme.

Thanks again
Scott


----- Original Message ----
From: Wietse Venema <wie...@porcupine.org>
To: Postfix users <postfix-users@postfix.org>
Sent: Fri, October 15, 2010 10:10:20 AM
Subject: Re: intermittent hang on "postfix stop"; doesn't return "terminating 
on 
signal"

Scott Brown:
> Usually, when the update-postfix.pl script runs, it tells Postfix to shut 
> down 

> and we get a logged message that says "postfix/postfix-script: stopping the 
> Postfix mail system".  Right after that, postfix responds with something like 
> "postfix/master[11211]: terminating on signal 15"
> 
> However, sometimes (once every day or so), the script runs and we get the 
> first 
>
> message "postfix/postfix-script: stopping the Postfix mail system", but then 
> postfix does not respond to it and keeps running for a while, until it sees 
> the 
>
> virtual.domain2.com.db was updated, at which point it logs 
> "postfix/trivial-rewrite[16529]: table 
> hash:/etc/postfix/usermanaged/virtual.domain2.com(0,lock|fold_fix) has changed
> -- restarting", and then after that, it appears to be hung.

You mean, the master did not terminate trivial-rewrite and
then itself, as it is normally does.

When you say "postfix stop", the postfix-script file sends a SIGTERM
signal to the process ID in $queue_directory/pid/master.pid.

There are a number of possible explanations:

1) The master.pid file is bad - it contains a different PID than
that of the running master daemon, and therefore "postfix stop"
does not send. In that case, Postfix keeps handling mail, but that
is not what you observe, because trivial-rewrite is still running.

2) The master.pid file is good, and the master is stuck 
after it receives the SIGTERM signal.

The code in the master's sigterm handler is trivially simple. When
I omit the comments (which contain more text that the code itself):

     1  static void master_sigdeath(int sig)
     2  {
     3      const char *myname = "master_sigdeath";
     4      struct sigaction action;
     5      pid_t   pid = getpid();
     6  
     7      killme_after(5);
     8  
     9      sigemptyset(&action.sa_mask);
    10      action.sa_flags = 0;
    11      action.sa_handler = SIG_IGN;
    12      if (sigaction(SIGTERM, &action, (struct sigaction *) 0) < 0)
    13          msg_fatal("%s: sigaction: %m", myname);
    14      if (kill(-pid, SIGTERM) < 0)
    15          msg_fatal("%s: kill process group: %m", myname);
    16      msg_info("terminating on signal %d", sig);
    17      ...

(The killme_after() function is a workaround for Linux that 
sets up a time bomb to terminate the master in 5 seconds).

Line 14 terminates all master's child processes including
trivial-rewrite.  Apparently this line did not not execute, so I
wonder where the execution has stopped.

Next time, can you do:

    gdb -p pid-of-master-daemon
    ...garbage...
    (gdb) bt

And report the result.

I would also like to mention that "postfix stop" is a very disruptive
way to update a table. Instead, consider doing a "safe update" of
the Berkeley DB map as described in:

http://www.postfix.org/DATABASE_README.html#safe_db

This updates the file without overwriting it.

    Wietse



      

Reply via email to