Santiago Romero:
>  Stracing qmgr process for a while (before restarting postfix), showed 
> lots of lines like:
> 
> time(NULL)                              = 1236156322
> epoll_ctl(8, EPOLL_CTL_DEL, 128, {EPOLLIN, {u32=128, 
> u64=13252642876283682944}}) = 0
> fcntl64(128, F_GETFL)                   = 0x802 (flags O_RDWR|O_NONBLOCK)
> fcntl64(128, F_SETFL, O_RDWR)           = 0
> ioctl(128, FIONREAD, [10])              = 0
> poll([{fd=128, events=POLLIN, revents=POLLIN}], 1, 3600000) = 1
> read(128, "status\0000\0\0", 4096)      = 10
> gettimeofday({1236156322, 508869}, NULL) = 0
> close(128)                              = 0

Some time ago the queue manager made a connection to a delivery
agent, but that part is missing.

The queue manager receives a status of 0, from a delivery agent,
indicating that the delivery agent is ready to accept a delivery
request.  

The code follows this path:

       ... if (attr_scan(stream, ATTR_FLAG_STRICT,
                         ATTR_TYPE_INT, MAIL_ATTR_STATUS, &stat,
                         ATTR_TYPE_END) != 1) {
        /* not applicable... */
    } else {
        return (stat ? DELIVER_STAT_DEFER : 0);
    }

We know that the status is 0, so the result value is 0.

The result value is used in the following statement:

    if (stream == 0 || qmgr_deliver_initial_reply(stream) != 0) {
        /* not applicable */
    }

Then, if you use the new queue manager, this happens:

    if ((entry = qmgr_job_entry_select(transport)) == 0) {
        (void) vstream_fclose(stream);
        return;
    }

If you use the old queue manager, this happens instead:

    if ((queue = qmgr_queue_select(transport)) == 0
        || (entry = qmgr_entry_select(queue)) == 0) {
        (void) vstream_fclose(stream);
        return;
    }

Either way, there queue manager finds that there is no work, and
disconnects from the delivery agent. This part is responsible for
the gettimeofday() and close() calls at the end of the trace above.

You might want to repeat your precise Postfix version at this point,
and which queue manager version is configured in your master.cf.
Current Postfix versions have (qmgr=new, oqmgr=old) in master.cf.
Older Postfix versions have (nqmgr=new, qmgr=old) instead. The
programs are the same except for the job selection algorithm.

If you are using the new queue manager, it is worthwhile to see if
the problem persists when you switch to the old queue manager.

If the problem is with "new" queue manager, the question is why
does qmgr_transport_select() find work that qmgr_job_entry_select()
can't find? The answer is who knows, there is only one person who
understands this code, and it isn't me. qmgr_job_entry_select()
has a number of ways in which it can decide that there is no
suitable work. I could log a warning at this point, because this
is where the process could go into a tight loop.

> epoll_ctl(8, EPOLL_CTL_DEL, 129, {EPOLLIN, {u32=129, 
> u64=13252642876283682945}}) = 0
> fcntl64(129, F_GETFL)                   = 0x802 (flags O_RDWR|O_NONBLOCK)
> fcntl64(129, F_SETFL, O_RDWR)           = 0
> ioctl(129, FIONREAD, [10])              = 0
> poll([{fd=129, events=POLLIN, revents=POLLIN}], 1, 3600000) = 1
> read(129, "status\0000\0\0", 4096)      = 10
> gettimeofday({1236156322, 510488}, NULL) = 0
> close(129)                              = 0

Same story. We miss the beginning of the story, and witness the
end.

> alarm(333)                              = 333

That't the built-in watchdog timer (1000/3 seconds).

> socket(PF_FILE, SOCK_STREAM, 0)         = 13
> fcntl64(13, F_GETFL)                    = 0x2 (flags O_RDWR)
> fcntl64(13, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> connect(13, {sa_family=AF_FILE, path="private/slow"}, 110) = 0
> gettimeofday({1236156322, 513893}, NULL) = 0
> fcntl64(13, F_DUPFD, 128)               = 128
> close(13)                               = 0
> epoll_ctl(8, EPOLL_CTL_ADD, 128, {EPOLLIN, {u32=128, 
> u64=13834671851822907520}}) = 0
> time(NULL)                              = 1236156322
> socket(PF_FILE, SOCK_STREAM, 0)         = 13
> fcntl64(13, F_GETFL)                    = 0x2 (flags O_RDWR)
> fcntl64(13, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> connect(13, {sa_family=AF_FILE, path="private/slow"}, 110) = 0
> gettimeofday({1236156322, 515731}, NULL) = 0
> fcntl64(13, F_DUPFD, 128)               = 129
> close(13)                               = 0
> epoll_ctl(8, EPOLL_CTL_ADD, 129, {EPOLLIN, {u32=129, 
> u64=13834671851822907521}}) = 0
> time(NULL)                              = 1236156322
> ioctl(3, FIONREAD, [100])               = 0
> time(NULL)                              = 1236156322

In qmgr_active_drain(), qmgr_transport_select() has found work for
the slow transport, and qmgr_transport_alloc() has made a conection
to a slow delivery agent.

The queue manager is now waiting for the delivery agent to report
that it is ready (with status = 0).

        Wietse

Reply via email to