Hi,

So I managed to isolate and reproduce the issue quite reliably.

Every day exactly at 06:10 UTC time my dnsmasq container stops responding.
During the event, I can successfully query my external DNS servers but not
dnsmasq:

dig domain.tld @172.18.0.250


; <<>> DiG 9.16.23-RH <<>> domain.tld @172.18.0.250

;; global options: +cmd

;; connection timed out; no servers could be reached


I see hundreds of errors like this in the system log:

Sep 05 06:10:58 mm4.lax.icann.org dockerd[1150]:
time="2024-09-05T06:10:58.464185887Z" level=error msg="[resolver] failed to
query external DNS server" client-addr="udp:172.18.0.4:48552"
dns-server="udp:172.18.0.250:53" error="read udp 172.18.0.4:48552->
172.18.0.250:53: i/o timeout" question=";_dmarc.domain.tld.\tIN\t TXT"


However, there is nothing suspicious in the /var/log/messages and
/var/log/cron that might explain what happened.


Before the container restarted at 06:15, I tried to collect stats via the
"kill --signal=USR1" command but the stats weren't posted in the logs -
obviously, dnsmasq was so stuck it couldn't even process the signal.
(However, I don't think stats would be helpful since the time of the event
doesn't change even if I restart dnsmasq in between 6:10 events.)


Resource-wise, it was an increase in memory consumption by dnsmasq when the
issue started and then a spike in the middle of it (the time shown is 3
hours later than UTC):


[image: Screenshot 2024-09-05 at 09.42.38.png]




I'm using these params
<https://github.com/dockur/dnsmasq/blob/master/entry.sh#L14> plus
"fast-dns-retry". Also tried adding "no-negcache" and "all-servers" but it
didn't fix the issue.

Any idea where to continue the investigation?

Sincerely,
Danil Smirnov


On Sun, Aug 25, 2024 at 7:45 PM Danil Smirnov <danil.smir...@gmail.com>
wrote:

>
> Hi Dimitry,
>
> On Sun, Aug 25, 2024 at 7:36 PM Dimitry Andric <
> dimi...@unified-streaming.com> wrote:
>
>> Is there any way to reproduce this issue reliably? That is, some recipe
>> that says: run this particular docker container, run some script that
>> queries it, observe hang after N minutes?
>>
>
> For now, I established a watchdog in my environment that will restart the
> container on freeze while collecting some stats. I'm going to monitor the
> issue for one more week (already spent a week debugging the issue). After
> seeing some useful data I'll try to reproduce it.
>
> Sincerely,
> Danil Smirnov
>
_______________________________________________
Dnsmasq-discuss mailing list
Dnsmasq-discuss@lists.thekelleys.org.uk
https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss

Reply via email to