>Synopsis: Inactive Wireguard connections never time out >Category: Networking >Environment: System : OpenBSD 7.6 Details : OpenBSD 7.6-current (CUSTOM) #4: Sat Dec 14 01:13:27 GMT 2024 lloyd@bsdtst01:/sys/arch/amd64/compile/CUSTOM
Architecture: OpenBSD.amd64 Machine : amd64 >Description: I've been running the Wireguard logging patch I submitted to -tech for a few days with little other issue. Recently I did some testing with a device over cellular (dynamic client behind CG-NAT). The testing was completed and the client powered off and disconnected from the remote network. In other words, it was no longer at the remote IP address (typical of a roaming client). The next morning I checked the syslog to find 17 hours worth of the following messages (until I rebooted the box), repeated every 5 seconds, with x.x.x.x is the remote IP address of the wireless provider's CG-NAT server: --snip-- 09:27:00 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:27:05 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 2) 09:27:05 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:27:11 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 2) 09:27:11 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:27:16 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 2) 09:27:16 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:27:21 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 2) 09:27:21 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:27:26 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 2) 09:27:26 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:27:31 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 2) 09:27:31 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:27:36 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 2) 09:27:36 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:27:42 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 2) 09:27:42 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:27:47 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 3) 09:27:47 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:27:52 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 2) 09:27:52 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:27:57 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 3) 09:27:57 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:28:03 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 2) 09:28:03 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:28:08 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 3) 09:28:08 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:28:13 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 2) 09:28:13 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:28:18 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 3) 09:28:18 wg0: Sending handshake initiation to peer 0 (x.x.x.x) 09:28:23 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, retrying (try 2) --snip-- As you can see, it appears the Wireguard service will continue to try and reach out to the remote endpoint forever, even after it's long gone. If I understand correctly, it's supposed to try MAX_TIMER_HANDSHAKES times (aka 18) then stop. The log entries most often have the value of "try 2" repeated, with occasional "try 3". I grep'd the logs and over 17 hours of data, I have exactly 4 log entries of "try 4" and 3 log entries of "try 5". I double checked my patch and it does not seem to alter any logic, simply presentation of debug logging information. The value of t below is passed as a pointer to wg_timers_run_retry_handshake(): if (t->t_handshake_retries <= MAX_TIMER_HANDSHAKES) My C is a bit rusty, but my suspicion is one of the instances of: t->t_handshake_retries = 0; scattered elsewhere in the code is being called, zeroing out that counter before it has a chance to complete and creating a race condition. >How-To-Repeat: Connect a remote endpoint to an OpenBSD Wireguard server. Shutdown or disconnect the remote endpoint. Observe any debug logs generated. >Fix: Unknown