Hi, So I managed to isolate and reproduce the issue quite reliably.
Every day exactly at 06:10 UTC time my dnsmasq container stops responding. During the event, I can successfully query my external DNS servers but not dnsmasq: dig domain.tld @172.18.0.250 ; <<>> DiG 9.16.23-RH <<>> domain.tld @172.18.0.250 ;; global options: +cmd ;; connection timed out; no servers could be reached I see hundreds of errors like this in the system log: Sep 05 06:10:58 mm4.lax.icann.org dockerd[1150]: time="2024-09-05T06:10:58.464185887Z" level=error msg="[resolver] failed to query external DNS server" client-addr="udp:172.18.0.4:48552" dns-server="udp:172.18.0.250:53" error="read udp 172.18.0.4:48552-> 172.18.0.250:53: i/o timeout" question=";_dmarc.domain.tld.\tIN\t TXT" However, there is nothing suspicious in the /var/log/messages and /var/log/cron that might explain what happened. Before the container restarted at 06:15, I tried to collect stats via the "kill --signal=USR1" command but the stats weren't posted in the logs - obviously, dnsmasq was so stuck it couldn't even process the signal. (However, I don't think stats would be helpful since the time of the event doesn't change even if I restart dnsmasq in between 6:10 events.) Resource-wise, it was an increase in memory consumption by dnsmasq when the issue started and then a spike in the middle of it (the time shown is 3 hours later than UTC): [image: Screenshot 2024-09-05 at 09.42.38.png] I'm using these params <https://github.com/dockur/dnsmasq/blob/master/entry.sh#L14> plus "fast-dns-retry". Also tried adding "no-negcache" and "all-servers" but it didn't fix the issue. Any idea where to continue the investigation? Sincerely, Danil Smirnov On Sun, Aug 25, 2024 at 7:45 PM Danil Smirnov <danil.smir...@gmail.com> wrote: > > Hi Dimitry, > > On Sun, Aug 25, 2024 at 7:36 PM Dimitry Andric < > dimi...@unified-streaming.com> wrote: > >> Is there any way to reproduce this issue reliably? That is, some recipe >> that says: run this particular docker container, run some script that >> queries it, observe hang after N minutes? >> > > For now, I established a watchdog in my environment that will restart the > container on freeze while collecting some stats. I'm going to monitor the > issue for one more week (already spent a week debugging the issue). After > seeing some useful data I'll try to reproduce it. > > Sincerely, > Danil Smirnov >
_______________________________________________ Dnsmasq-discuss mailing list Dnsmasq-discuss@lists.thekelleys.org.uk https://lists.thekelleys.org.uk/cgi-bin/mailman/listinfo/dnsmasq-discuss