Solved ...
On January 13, 2010 1:06:10 PM -0500 Wietse Venema <wie...@porcupine.org>
wrote:
Well, if you can provide unmodified evidence, then people
can look into this.
Yeah unfortunately as I said, I couldn't do that. And anyway I wasn't
looking for folks to fix my problem so much as have the discussion so
I could get help letting me know where to look. In the first place
I got sidetracked because the only discussion I found on multiple
PTR records (and it was a recent discussion) was that postfix did
not handle them and had no intentions of doing so. I learned that
was wrong and hopefully the next person to search for that problem
will find these threads and see postfix does handle them in the only
way that makes sense.
I won't bore you with all the details, and they are numerous, except
to say that I have learned that I have to trust my skills in debugging.
I let clues go right by because I just assumed I must not be running
the tools correctly (tcpdump, sotruss, etc). But I have vast experience
and really I should have had confidence that I do know what I'm doing.
It turns out that the problem was a firewall rule. My MX host is in
a DMZ (natch) but my resolvers are inside. There is a udp/53 hole
in the firewall. But because this was such a large PTR record,
the nameserver indicated truncation and the resolver had to go to TCP.
Now most of those records would be thrown away so it doesn't matter,
but the resolver doesn't know that. Normally I configure (patch) named
to use a very large truncation threshold; I think the default is very
low, 500 bytes or so and in this case I would normally set it to
8900-ish since we are jumbo frames everywhere. Then this problem
wouldn't have come up, at least not at this time. But for reasons I
don't know anymore, this named is just a vanilla named.
I do have some suggestions for postfix; things that would have helped
me solve this much sooner and maybe would be useful in general.
1) It makes absolute sense that postfix not start logging a connection
until a name lookup has been done. But if that name lookup takes a
"very long" time, along with the connect postfix should log how long
ago the actual connect was. Syslog will of course only log the current
time, ie the time the log message is received, however the actual
connect may have been eons ago and the timestamp is now "wrong". Not
just for this specific case, but in general it helps if the connect
time is recorded for purposes of log correlation. I would say that
if the time since connect() and the return of getnameinfo() is more
than 1s, the delta should be logged. postfix already keeps timing info
so I would hope that this additional info isn't an extensive change.
2) There is something going on where postfix wasn't logging connects
from my test host, which sent FIN instead of the actual problem host
which was sending RST. My guess would be that postfix is blocking on
getnameinfo() so I don't see how the status of the client connection
even comes into play, but it is curious.
For (2) you can probably reproduce it by creating a host with many
PTR records and blocking tcp/53 from the test postfix server. I'm
sorry that I can't spend any more time on this to provide testing
help (if you even want to pursue it). I'm swamped right now with
active directory problems and this problem was just an emergency
that came up.
-frank