On 9/26/19 9:57 AM, Eric Dumazet wrote:
>
>
> On 9/26/19 9:46 AM, Eric Dumazet wrote:
>>
>>
>> On 9/26/19 8:05 AM, Eric Dumazet wrote:
>>>
>>>
>>> On 9/25/19 1:46 AM, Marek Majkowski wrote:
>>>> Hello my favorite mailing list!
>>>>
>>>> Recently I've been looking into TCP_USER_TIMEOUT and noticed some
>>>> strange behaviour on fresh sockets in SYN-SENT state. Full writeup:
>>>> https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
>>>>
>>>> Here's a reproducer. It does a simple thing: sets TCP_USER_TIMEOUT and
>>>> does connect() to a blackholed IP:
>>>>
>>>> $ wget
>>>> https://gist.githubusercontent.com/majek/b4ad53c5795b226d62fad1fa4a87151a/raw/cbb928cb99cd6c5aa9f73ba2d3bc0aef22fbc2bf/user-timeout-and-syn.py
>>>>
>>>> $ sudo python3 user-timeout-and-syn.py
>>>> 00:00.000000 IP 192.1.1.1.52974 > 244.0.0.1.1234: Flags [S]
>>>> 00:01.007053 IP 192.1.1.1.52974 > 244.0.0.1.1234: Flags [S]
>>>> 00:03.023051 IP 192.1.1.1.52974 > 244.0.0.1.1234: Flags [S]
>>>> 00:05.007096 IP 192.1.1.1.52974 > 244.0.0.1.1234: Flags [S]
>>>> 00:05.015037 IP 192.1.1.1.52974 > 244.0.0.1.1234: Flags [S]
>>>> 00:05.023020 IP 192.1.1.1.52974 > 244.0.0.1.1234: Flags [S]
>>>> 00:05.034983 IP 192.1.1.1.52974 > 244.0.0.1.1234: Flags [S]
>>>>
>>>> The connect() times out with ETIMEDOUT after 5 seconds - as intended.
>>>> But Linux (5.3.0-rc3) does something weird on the network - it sends
>>>> remaining tcp_syn_retries packets aligned to the 5s mark.
>>>>
>>>> In other words: with TCP_USER_TIMEOUT we are sending spurious SYN
>>>> packets on a timeout.
>>>>
>>>> For the record, the man page doesn't define what TCP_USER_TIMEOUT does
>>>> on SYN-SENT state.
>>>>
>>>
>>> Exactly, so far this option has only be used on established flows.
>>>
>>> Feel free to send patches if you need to override the stack behavior
>>> for connection establishment (Same remark for passive side...)
>>
>> Also please take a look at TCP_SYNCNT, which predates TCP_USER_TIMEOUT
>>
>>
>
> I will test the following :
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index
> dbd9d2d0ee63aa46ad2dda417da6ec9409442b77..1182e51a6b794d75beb8c130354d7804fc83a307
> 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -220,7 +220,6 @@ static int tcp_write_timeout(struct sock *sk)
> sk_rethink_txhash(sk);
> }
> retry_until = icsk->icsk_syn_retries ? :
> net->ipv4.sysctl_tcp_syn_retries;
> - expired = icsk->icsk_retransmits >= retry_until;
> } else {
> if (retransmits_timed_out(sk, net->ipv4.sysctl_tcp_retries1,
> 0)) {
> /* Black hole detection */
> @@ -242,9 +241,9 @@ static int tcp_write_timeout(struct sock *sk)
> if (tcp_out_of_resources(sk, do_reset))
> return 1;
> }
> - expired = retransmits_timed_out(sk, retry_until,
> - icsk->icsk_user_timeout);
> }
> + expired = retransmits_timed_out(sk, retry_until,
> + icsk->icsk_user_timeout);
> tcp_fastopen_active_detect_blackhole(sk, expired);
>
> if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
>
The patch works well, but reading again the man page, I see the existing
behavior as
been clearly documented.
If we change the behavior, we might break applications that were setting
TCP_USER_TIMEOUT
on the listener, expecting the value to b inherited to children at accept() time
but not expecting to change SYNACK rtx behavior.
On the other hand, John Maxell patch (tcp: Add tcp_clamp_rto_to_user_timeout()
helper to improve accuracy)
has added this weird effect of sending remaining SYN every jiffie
remaining = icsk->icsk_user_timeout - elapsed;
if (remaining <= 0)
return 1; /* user timeout has passed; fire ASAP */
So we probably just should extend TCP_USER_TIMEOUT to SYN_SENT/SYN_RECV states
and change the man page accordingly.
TCP_USER_TIMEOUT (since Linux 2.6.37)
This option takes an unsigned int as an argument. When the
value is
greater than 0, it specifies the maximum amount of time in
millisec‐
onds that transmitted data may remain unacknowledged before TCP
will
forcibly close the corresponding connection and return ETIMEDOUT
to
the application. If the option value is specified as 0, TCP
will to
use the system default.
Increasing user timeouts allows a TCP connection to survive
extended
periods without end-to-end connectivity. Decreasing user
timeouts
allows applications to "fail fast", if so desired. Otherwise,
fail‐
ure may take up to 20 minutes with the current system defaults
in a
normal WAN environment.
This option can be set during any state of a TCP connection, but
is
effective only during the synchronized states of a connection
(ESTAB‐
LISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, and
LAST-ACK).
Moreover, when used with the TCP keepalive (SO_KEEPALIVE)
option,
TCP_USER_TIMEOUT will override keepalive to determine when to
close a
connection due to keepalive failure.
The option has no effect on when TCP retransmits a packet, nor
when a
keepalive probe is sent.
This option, like many others, will be inherited by the socket
re‐
turned by accept(2), if it was set on the listening socket.
Further details on the user timeout feature can be found in RFC
793
and RFC 5482 ("TCP User Timeout Option").