Hi team,
The dns reslove expiration cause the ovsdb-server startup failed.
```
ovsdb-server --relay-leader-only
--remote=db:OVN_Southbound,SB_Global,connections -vconsole:info
-vsyslog:off -vfile:off --log-file=/var/log/ovn/ovn-ovsdb-sb.log
--pidfile=/var/run/ovn/ovnsb_db.pid --unixctl=/var/run/ovn/ovnsb_db.ctl
relay:OVN_Southbound:tcp:ovn-ovsdb-sb-0.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642,tcp:ovn-ovsdb-sb-1.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642,tcp:ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642
```
The sb relay need to connect sb-0/1/2 when it startup, but maybe failed
as sb-0/1/2 can't provide service at this time. So sb relay will print
the below logs:
```
|2025-06-25T21:25:46Z|00021|memory|INFO|atoms:13 cells:20 monitors:0
n-weak-refs:0 ||2025-06-25T21:26:39Z|00022|stream_tcp|WARN|Dropped 11
log messages in last 63 seconds (most recently, 8 seconds ago) due to
excessive rate
||2025-06-25T21:26:39Z|00023|stream_tcp|WARN|tcp:ovn-ovsdb-sb-1.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
connect: Address family not supported by protocol
||2025-06-25T21:27:43Z|00024|stream_tcp|WARN|Dropped 7 log messages in
last 56 seconds (most recently, 8 seconds ago) due to excessive rate
||2025-06-25T21:27:43Z|00025|stream_tcp|WARN|tcp:ovn-ovsdb-sb-0.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
connect: Address family not supported by protocol
||2025-06-25T21:28:47Z|00026|stream_tcp|WARN|Dropped 7 log messages in
last 56 seconds (most recently, 8 seconds ago) due to excessive rate
||2025-06-25T21:28:47Z|00027|stream_tcp|WARN|tcp:ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
connect: Address family not supported by protocol
||2025-06-25T21:29:51Z|00028|stream_tcp|WARN|Dropped 7 log messages in
last 56 seconds (most recently, 8 seconds ago) due to excessive rate
||2025-06-25T21:29:51Z|00029|stream_tcp|WARN|tcp:ovn-ovsdb-sb-1.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
connect: Address family not supported by protocol
||2025-06-25T21:30:55Z|00030|stream_tcp|WARN|Dropped 7 log messages in
last 56 seconds (most recently, 8 seconds ago) due to excessive rate
||2025-06-25T21:30:55Z|00031|stream_tcp|WARN|tcp:ovn-ovsdb-sb-0.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
connect: Address family not supported by protocol
||2025-06-25T21:31:59Z|00032|stream_tcp|WARN|Dropped 7 log messages in
last 56 seconds (most recently, 8 seconds ago) due to excessive rate
||2025-06-25T21:31:59Z|00033|stream_tcp|WARN|tcp:ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
connect: Address family not supported by protocol
||2025-06-25T21:33:03Z|00034|stream_tcp|WARN|Dropped 7 log messages in
last 56 seconds (most recently, 8 seconds ago) due to excessive rate
||2025-06-25T21:33:03Z|00035|stream_tcp|WARN|tcp:ovn-ovsdb-sb-1.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
connect: Address family not supported by protocol
||2025-06-25T21:34:07Z|00036|stream_tcp|WARN|Dropped 7 log messages in
last 56 seconds (most recently, 8 seconds ago) due to excessive rate
||2025-06-25T21:34:07Z|00037|stream_tcp|WARN|tcp:ovn-ovsdb-sb-0.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
connect: Address family not supported by protocol|
```
After that, the reconnect will enter connecting and backoff phase
repeatedly.
```
|2025-06-27T07:37:25Z|01023|reconnect|DBG|tcp:ovn-ovsdb-sb-1.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
entering CONNECTING
||2025-06-27T07:37:25Z|01024|reconnect|DBG|tcp:ovn-ovsdb-sb-1.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
entering BACKOFF ||2025-06-27T07:37:27Z|01025|poll_loop|DBG|wakeup due
to 1982-ms timeout at ovsdb/ovsdb-server.c:401 (0% CPU usage)
||2025-06-27T07:37:29Z|01026|poll_loop|DBG|wakeup due to 2501-ms timeout
at ovsdb/ovsdb-server.c:401
||2025-06-27T07:37:32Z|01027|poll_loop|DBG|wakeup due to 2502-ms timeout
at ovsdb/ovsdb-server.c:401
||2025-06-27T07:37:33Z|01028|poll_loop|DBG|wakeup due to 1015-ms timeout
at lib/reconnect.c:677 (0% CPU usage)
||2025-06-27T07:37:33Z|01029|reconnect|DBG|tcp:ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
entering CONNECTING
||2025-06-27T07:37:33Z|01030|reconnect|DBG|tcp:ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
entering BACKOFF ||2025-06-27T07:37:34Z|01031|poll_loop|DBG|wakeup due
to 1487-ms timeout at ovsdb/ovsdb-server.c:401 (0% CPU usage)
||2025-06-27T07:37:37Z|01032|poll_loop|DBG|wakeup due to 2502-ms timeout
at ovsdb/ovsdb-server.c:401
||2025-06-27T07:37:39Z|01033|poll_loop|DBG|wakeup due to 2502-ms timeout
at ovsdb/ovsdb-server.c:401
||2025-06-27T07:37:41Z|01034|poll_loop|DBG|wakeup due to 1511-ms timeout
at lib/reconnect.c:677 (0% CPU usage)
||2025-06-27T07:37:41Z|01035|reconnect|DBG|tcp:ovn-ovsdb-sb-0.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
entering CONNECTING
||2025-06-27T07:37:41Z|01036|reconnect|DBG|tcp:ovn-ovsdb-sb-0.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
entering BACKOFF ||2025-06-27T07:37:49Z|01054|poll_loop|DBG|wakeup due
to 2007-ms timeout at lib/reconnect.c:677 (0% CPU usage)
||2025-06-27T07:37:49Z|01055|reconnect|DBG|tcp:ovn-ovsdb-sb-1.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
entering CONNECTING
||2025-06-27T07:37:49Z|01056|reconnect|DBG|tcp:ovn-ovsdb-sb-1.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
entering BACKOFF ||2025-06-27T07:37:49Z|01057|poll_loop|DBG|wakeup due
to 492-ms timeout at ovsdb/ovsdb-server.c:401 (0% CPU usage)
||2025-06-27T07:37:52Z|01058|poll_loop|DBG|wakeup due to 2503-ms timeout
at ovsdb/ovsdb-server.c:401 (0% CPU usage)
||2025-06-27T07:37:54Z|01059|poll_loop|DBG|wakeup due to 2502-ms timeout
at ovsdb/ovsdb-server.c:401
||2025-06-27T07:37:57Z|01077|reconnect|DBG|tcp:ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
entering CONNECTING
||2025-06-27T07:37:57Z|01078|reconnect|DBG|tcp:ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
entering BACKOFF ||2025-06-27T07:37:59Z|01079|poll_loop|DBG|wakeup due
to 2501-ms timeout at ovsdb/ovsdb-server.c:401 (0% CPU usage)
||2025-06-27T07:38:02Z|01080|poll_loop|DBG|wakeup due to 2503-ms timeout
at ovsdb/ovsdb-server.c:401 (0% CPU usage)
||2025-06-27T07:38:04Z|01081|poll_loop|DBG|wakeup due to 2502-ms timeout
at ovsdb/ovsdb-server.c:401
||2025-06-27T07:38:05Z|01082|poll_loop|DBG|wakeup due to 494-ms timeout
at lib/reconnect.c:677
||2025-06-27T07:38:05Z|01083|reconnect|DBG|tcp:ovn-ovsdb-sb-0.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
entering CONNECTING
||2025-06-27T07:38:05Z|01084|reconnect|DBG|tcp:ovn-ovsdb-sb-0.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642:
entering BACKOFF ||2025-06-27T07:38:05Z|01085|poll_loop|DBG|wakeup due
to [POLLIN] on fd 19 (0.0.0.0:6642<->) at lib/stream-fd.c:274|
```
We print the stack info:
```
bash-4.4# gdb attach 17
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-15.es8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
attach: No such file or directory.
Attaching to process 17
[New LWP 18]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f0520055a41 in poll () from /lib64/libc.so.6
Missing separate debuginfos, use: yum debuginfo-install
glibc-2.28-151.es8_1.x86_64 libevent-2.1.8-5.es8.x86_64
libgcc-8.4.1-1.es8.x86_64 libibverbs-32.0-4.es8.x86_64
numactl-libs-2.0.12-11.es8.x86_64 openssl-libs-1.1.1k-12.es8.x86_64
python3-libs-3.6.8-39.es8.rocky.0.x86_64 zlib-1.2.11-17.es8.x86_64
(gdb) b resolve_check_valid__
Breakpoint 1 at 0x7f052074d899: file lib/dns-resolve.c, line 234.
(gdb) c
Continuing.
Thread 1 "ovsdb-server" hit Breakpoint 1, resolve_check_valid__
(req=0x5616829ce740) at lib/dns-resolve.c:234
234 lib/dns-resolve.c: No such file or directory.
(gdb) bt
#0 resolve_check_valid__ (req=0x5616829ce740) at lib/dns-resolve.c:234
#1 dns_resolve (name=<optimized out>, addr=addr@entry=0x7ffeaec04b20)
at lib/dns-resolve.c:173
#2 0x00007f05206ed73f in parse_sockaddr_components_dns
(s=0x56168269d1f0
"ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642",
default_port=65535, port_s=0x56168269c0c2 "6642",
host_s=<optimized out>, ss=0x7ffeaec04bf0) at lib/socket-util.c:510
#3 parse_sockaddr_components (ss=ss@entry=0x7ffeaec04bf0,
host_s=<optimized out>, port_s=0x56168269c0c2 "6642",
default_port=default_port@entry=65535,
s=s@entry=0x56168269d1f0
"ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642",
resolve_host=resolve_host@entry=true, dns_failure=<optimized out>) at
lib/socket-util.c:510
#4 0x00007f05206ede71 in inet_parse_active
(target_=target_@entry=0x56168269d1f0
"ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642",
default_port=default_port@entry=-1,
ss=ss@entry=0x7ffeaec04bf0, resolve_host=resolve_host@entry=true,
dns_failure=dns_failure@entry=0x0) at lib/socket-util.c:559
#5 0x00007f05206eea22 in inet_open_active (style=style@entry=1,
target=0x56168269d1f0
"ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642",
default_port=default_port@entry=-1,
ssp=ssp@entry=0x0, fdp=fdp@entry=0x7ffeaec04cc4, dscp=<optimized
out>) at lib/socket-util.c:597
#6 0x00007f05206f3e5e in tcp_open (name=0x5616829ce870
"tcp:ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642",
suffix=<optimized out>, streamp=0x7ffeaec04d10,
dscp=<optimized out>) at lib/stream-tcp.c:56
#7 0x00007f05206f47be in stream_open (name=name@entry=0x5616829ce870
"tcp:ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642",
streamp=streamp@entry=0x5616826c5c70,
dscp=dscp@entry=48 '0') at lib/stream.c:233
#8 0x00007f05206f5037 in stream_open_with_default_port
(name_=<optimized out>, name_@entry=0x5616826c6020
"tcp:ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642",
default_port=default_port@entry=6640,
streamp=streamp@entry=0x5616826c5c70, dscp=<optimized out>) at
lib/stream.c:749
#9 0x00007f052063f074 in jsonrpc_stream_open
(name=name@entry=0x5616826c6020
"tcp:ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local:6642",
streamp=streamp@entry=0x5616826c5c70,
dscp=<optimized out>) at lib/jsonrpc.c:72
#10 0x00007f0520640980 in jsonrpc_session_connect (s=0x5616826c5c40) at
lib/jsonrpc.c:994
#11 jsonrpc_session_run (s=0x5616826c5c40) at lib/jsonrpc.c:1088
#12 0x00007f05206c6a2e in ovsdb_cs_run (cs=0x5616826c5e00,
events=events@entry=0x7ffeaec04e70) at lib/ovsdb-cs.c:621
#13 0x00007f0520aacf21 in ovsdb_relay_run () at ovsdb/relay.c:387
#14 0x0000561657a07bdb in main_loop (exiting=0x7ffeaec04fad,
run_process=0x0, remotes=0x7ffeaec05070, unixctl=0x5616826e94a0,
all_dbs=0x7ffeaec05010, jsonrpc=0x5616826bca10, config=0x7ffeaec050d0)
at ovsdb/ovsdb-server.c:334
#15 main (argc=<optimized out>, argv=<optimized out>) at
ovsdb/ovsdb-server.c:895
(gdb) p req
$1 = (struct resolve_request *) 0x5616829ce740
(gdb) p req->name
$2 = 0x56168269bb30
"ovn-ovsdb-sb-2.ovn-ovsdb-sb-discovery.openstack.svc.cluster.local"
(gdb) p req->addr
$3 = 0x5616829cd360 "10.232.4.170"
(gdb) p req->state
$4 = RESOLVE_GOOD
(gdb) p req->time
$5 = 55695
(gdb) p req->ub_result->ttl
$6 = 5
(gdb)
```
Based on the above info and code analysis:
The sb relay reconnects to an sb service every 8 seconds (based on
RECONNECT_DEFAULT_MAX_BACKOFF). There are three sb services, so the sb
relay attempts to reconnect to one of them every 24 seconds. DNS uses
synchronous resolving with a TTL of 5 seconds. For example:
- The first time, sb relay uses synchronous DNS resolving for sb-0 and
succeeds, but no address is returned.
- After 24 seconds, when sb relay tries to request sb-0's address again,
it finds the data has expired. It then performs synchronous resolving
again, but no address is returned.
As a result, sb relay repeats step 2 continuously.
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss