Jeremy Chadwick wrote:
On Tue, Sep 28, 2010 at 10:59:04PM +0200, Miroslav Lachman wrote:
Jeremy Chadwick wrote:
On Tue, Sep 28, 2010 at 08:12:00PM +0200, Miroslav Lachman wrote:
Hi,

we are using fetch command from cron to run PHP scripts periodically
and sometimes cron sends error e-mails like this:

fetch: https://hiden.example.com/cron/fiveminutes: Non-recoverable
resolver failure

[...]

Note: target domains are hosted on the server it-self and named too.

The system is FreeBSD 7.3-RELEASE-p2 i386 GENERIC

Can somebody help me to diagnose this random fetch+resolver issue?

[...]

There is PF with some basic rules, mostly blocking incomming
packets, allowing all outgoing and scrubbing:

scrub in on bge1 all fragment reassemble
scrub out on bge1 all no-df random-id min-ttl 24 max-mss 1492
fragment reassemble

pass out on bge1 inet proto udp all keep state
pass out on bge1 inet proto tcp from 1.2.3.40 to any flags S/SA
modulate state
pass out on bge1 inet proto tcp from 1.2.3.41 to any flags S/SA
modulate state
pass out on bge1 inet proto tcp from 1.2.3.42 to any flags S/SA
modulate state

modified PF options:

set timeout { frag 15, interval 5 }
set limit { frags 2500, states 5000 }
set optimization aggressive
set block-policy drop
set loginterface bge1
# Let loopback and internal interface traffic flow without restrictions
set skip on lo0

Please also provide "pfctl -s info" output, in addition to uname -a
output (you can hide the hostname), since the pf stack differs depending
on what FreeBSD version you're using.

# pfctl -s info
No ALTQ support in kernel
ALTQ related functions disabled
Status: Enabled for 32 days 11:31:02          Debug: Urgent

Interface Stats for bge1              IPv4             IPv6
  Bytes In                     37064314787                0
  Bytes Out                   279633869976                0
  Packets In
    Passed                       214057477                0
    Blocked                        1180125                0
  Packets Out
    Passed                       272266744                0
    Blocked                         128777                0

State Table                          Total             Rate
  current entries                      181
  searches                       518860439          184.9/s
  inserts                         16608172            5.9/s
  removals                        16607991            5.9/s
Counters
  match                           17951131            6.4/s
  bad-offset                             0            0.0/s
  fragment                              23            0.0/s
  short                                  0            0.0/s
  normalize                              4            0.0/s
  memory                                 0            0.0/s
  bad-timestamp                          0            0.0/s
  congestion                             0            0.0/s
  ip-option                              0            0.0/s
  proto-cksum                         3095            0.0/s
  state-mismatch                     16707            0.0/s
  state-insert                           0            0.0/s
  state-limit                            0            0.0/s
  src-limit                              0            0.0/s
  synproxy                               0            0.0/s


uname:
7.3-RELEASE-p2 FreeBSD 7.3-RELEASE-p2 #0: Mon Jul 12 19:04:04 UTC 2010 r...@i386-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC i386

Things that catch my eye as potential problems -- I don't have a way to
confirm these are responsible for your issue (DNS resolver lookups are
UDP-based, not TCP), but I want to point them out anyway.

1) "modulate state" is broken on FreeBSD.  Taken from our pf.conf notes:

# Filtering (public interface only; see "set skip")
#
# NOTE: Do not use "modulate state", as it's known to be broken on FreeBSD.
# http://lists.freebsd.org/pipermail/freebsd-pf/2008-March/004227.html

2) "optimization aggressive" sounds dangerous given what pf.conf(5) says
about it.  I'd like to know what it considers "idle".

3) I would also remove many of the options you have set in your "scrub
out" rule.  Starting with a clean slate to see if things improve is
probably a good idea.  As you'll see below, sometimes pf does things
which may be correct per IP specification but don't work quite right
with other vendors' IP stacks.

4) Your "set timeout" values look to be extreme.  I would recommend
leaving these at their defaults given your situation.

5) This feature is not in use in your pf.conf, but I want to point out
regardless.  "reassemble tcp" is also broken in some way.  Again taken
from our pf.conf notes:

# Normalization -- resolve/reduce traffic ambiguities.
#
# NOTE: Do NOT use 'reassemble tcp' as it definitely causes breakage.
# Issue may be related to other vendors' IP stacks, so let's leave it
# disabled.

Thank you for all your hints about PF! Maybe it's time to consider refactoring our standard pf.conf which was made years ago...


The original problem seems to be problem of how resolver on FreeBSD 7.3 works. This machine was upgraded from 7.2 few weeks ago and we had not this problem before.

I added '|| dig hiden.example.com' to the crontab so I get dig output in the case of fetch failure:

*/5 * * * * fetch -qo /dev/null "https://hiden.example.com/cron/fiveminutes"; || dig hiden.example.com

The domain has TTL set to 360 seconds and each fetch "Non-recoverable resolver failure" is exactly in the time when TTL was expired and new query to authoritative nameservers must be done:

; <<>> DiG 9.4.-ESV <<>> hiden.example.com
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30191
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 2, ADDITIONAL: 0

;; QUESTION SECTION:
;hiden.example.com.     IN      A

;; ANSWER SECTION:
hiden.example.com. 360 IN       CNAME   server.example.com.
server.example.com.     360     IN      A       1.2.3.49

;; AUTHORITY SECTION:
example.com.            224     IN      NS      ns1.ignum.com.
example.com.            224     IN      NS      ns2.ignum.cz.

;; Query time: 395 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 30 11:30:16 2010
;; MSG SIZE  rcvd: 135

Note: real domains and IPs were replaced with example.com / 1.2.3.49


I made some easy script to run dig queries to affected domains each 3 minutes from cron with logging to file. The script is in use for one day and did not log any error response (resolving by dig command works fine) and we got only one occurence of fetch "Non-recoverable resolver failure" in the time when cached DNS entry expired (the above one), this is coincidence where diq query from script was made in the same time as fetch job. The same DNS answere was e-mailed from cron and loggend in to file by the script.

So my thought is that DNS cache server (locally running BIND) is working fine, authoritative nameservers too, but resolving the domain for the first time and passing the reply to the fetch fails for unknown reason. I will try to use curl or wget instead of fetch to see if the symptoms persist or not.

Miroslav Lachman
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Reply via email to