>Synopsis:      Inactive Wireguard connections never time out
>Category:      Networking
>Environment:
        System      : OpenBSD 7.6
        Details     : OpenBSD 7.6-current (CUSTOM) #4: Sat Dec 14 01:13:27 GMT 
2024
                         lloyd@bsdtst01:/sys/arch/amd64/compile/CUSTOM

        Architecture: OpenBSD.amd64
        Machine     : amd64
>Description:
        I've been running the Wireguard logging patch I submitted to -tech for 
a few days
        with little other issue. Recently I did some testing with a device over 
cellular
        (dynamic client behind CG-NAT). The testing was completed and the 
client powered
        off and disconnected from the remote network. In other words, it was no 
longer
        at the remote IP address (typical of a roaming client).

        The next morning I checked the syslog to find 17 hours worth of the 
following
        messages (until I rebooted the box), repeated every 5 seconds, with 
x.x.x.x is
        the remote IP address of the wireless provider's CG-NAT server:

       --snip--
09:27:00 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:05 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 2)
09:27:05 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:11 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 2)
09:27:11 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:16 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 2)
09:27:16 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:21 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 2)
09:27:21 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:26 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 2)
09:27:26 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:31 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 2)
09:27:31 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:36 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 2)
09:27:36 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:42 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 2)
09:27:42 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:47 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 3)
09:27:47 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:52 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 2)
09:27:52 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:27:57 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 3)
09:27:57 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:28:03 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 2)
09:28:03 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:28:08 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 3)
09:28:08 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:28:13 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 2)
09:28:13 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:28:18 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 3)
09:28:18 wg0: Sending handshake initiation to peer 0 (x.x.x.x)
09:28:23 wg0: Handshake for peer 0 (x.x.x.x) did not complete after 5 seconds, 
retrying (try 2)
       --snip--

        As you can see, it appears the Wireguard service will continue to try 
and reach out to
        the remote endpoint forever, even after it's long gone. If I understand 
correctly, it's
        supposed to try MAX_TIMER_HANDSHAKES times (aka 18) then stop. The log 
entries most often
        have the value of "try 2" repeated, with occasional "try 3". I grep'd 
the logs and over
        17 hours of data, I have exactly 4 log entries of "try 4" and 3 log 
entries of "try 5".

        I double checked my patch and it does not seem to alter any logic, 
simply presentation of
        debug logging information.

        The value of t below is passed as a pointer to 
wg_timers_run_retry_handshake():
            if (t->t_handshake_retries <= MAX_TIMER_HANDSHAKES)

        My C is a bit rusty, but my suspicion is one of the instances of:
            t->t_handshake_retries = 0;
        scattered elsewhere in the code is being called, zeroing out that 
counter before it has
        a chance to complete and creating a race condition.

>How-To-Repeat:
        Connect a remote endpoint to an OpenBSD Wireguard server. Shutdown or 
disconnect the remote
        endpoint. Observe any debug logs generated.
>Fix:
        Unknown

Reply via email to