Hello,

it took a while to build a testsystem for bisecting the issue. Finally I've 
identified the patch that causes my problems.
BTW. The fq packet network scheduler is in use.

It's 
[PATCH net-next] tcp/fq: move back to CLOCK_MONOTONIC

In the recent TCP/EDT patch series, I switched TCP and sch_fq clocks from 
MONOTONIC to TAI, in order to meet the choice done
earlier for sch_etf packet scheduler.

But sure enough, this broke some setups were the TAI clock jumps forward (by 
almost 50 year...), as reported by Leonard Crestez.

If we want to converge later, we'll probably need to add an skb field to 
differentiate the clock bases, or a socket option.

In the meantime, an UDP application will need to use CLOCK_MONOTONIC base for 
its SCM_TXTIME timestamps if using fq 
packet scheduler.

Fixes: 72b0094f9182 ("tcp: switch tcp_clock_ns() to CLOCK_TAI base")
Fixes: 142537e41923 ("net_sched: sch_fq: switch to CLOCK_TAI")
Fixes: fd2bca2aa789 ("tcp: switch internal pacing timer to CLOCK_TAI")
Signed-off-by: Eric Dumazet <edumazet@xxxxxxxxxx>
Reported-by: Leonard Crestez <leonard.crestez@xxxxxxx>

----

After reverting it in a current 5.2.18 kernel, the problem disappears. There 
were some post fixes for other issues caused by this
patch. These fixed other similar issues, but not mine. I've already tried to 
set the tstamp to zero in xfrm4_output.c, but with no
luck so far. I'm pretty sure, that reverting the clock patch isn't the proper 
solution for upstream. So I what other way this can
be fixed?

---
[PATCH net] net: clear skb->tstamp in bridge forwarding path
Matteo reported forwarding issues inside the linux bridge, if the enslaved 
interfaces use the fq qdisc.

Similar to commit 8203e2d844d3 ("net: clear skb->tstamp in forwarding paths"), 
we need to clear the tstamp field in
the bridge forwarding path.

Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time.")
Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
Reported-and-tested-by: Matteo Croce <mcr...@redhat.com>
Signed-off-by: Paolo Abeni <pab...@redhat.com>

and

net: clear skb->tstamp in forwarding paths

Sergey reported that forwarding was no longer working if fq packet scheduler 
was used.

This is caused by the recent switch to EDT model, since incoming packets might 
have been timestamped by __net_timestamp()

__net_timestamp() uses ktime_get_real(), while fq expects packets using 
CLOCK_MONOTONIC base.

The fix is to clear skb->tstamp in forwarding paths.

Fixes: 80b14dee ("net: Add a new socket option for a future transmit time.")
Fixes: fb420d5d ("tcp/fq: move back to CLOCK_MONOTONIC")
Signed-off-by: default avatarEric Dumazet <eduma...@google.com>
Reported-by: default avatarSergey Matyukevich <geoma...@gmail.com>
Tested-by: default avatarSergey Matyukevich <geoma...@gmail.com>
Signed-off-by: default avatarDavid S. Miller <da...@davemloft.net>

Best regards,
--
Thomas Bartschies
CVK IT Systeme


-----Ursprüngliche Nachricht-----
Von: Bartschies, Thomas 
Gesendet: Dienstag, 17. September 2019 09:28
An: 'David Ahern' <dsah...@gmail.com>; 'netdev@vger.kernel.org' 
<netdev@vger.kernel.org>
Betreff: AW: big ICMP requests get disrupted on IPSec tunnel activation

Hello,

thanks for the suggestion. Running pmtu.sh with kernel versions 4.19, 4.20 and 
even 5.2.13 made no difference. All tests were successful every time.

Although my external ping tests still failing with the newer kernels. I've ran 
the script after triggering my problem, to make sure all possible side effects 
happening. 

Please keep in mind, that even when the ICMP requests stalling, other 
connections still going through. Like e.g. ssh or tracepath. I would expect 
that all connection types would be affected if this is a MTU problem. Am I 
wrong?

Any suggestions for more tests to isolate the cause? 

Best regards,
--
Thomas Bartschies
CVK IT Systeme

-----Ursprüngliche Nachricht-----
Von: David Ahern [mailto:dsah...@gmail.com]
Gesendet: Freitag, 13. September 2019 19:13
An: Bartschies, Thomas <thomas.bartsch...@cvk.de>; 'netdev@vger.kernel.org' 
<netdev@vger.kernel.org>
Betreff: Re: big ICMP requests get disrupted on IPSec tunnel activation

On 9/13/19 9:59 AM, Bartschies, Thomas wrote:
> Hello together,
> 
> since kenel 4.20 we're observing a strange behaviour when sending big ICMP 
> packets. An example is a packet size of 3000 bytes.
> The packets should be forwarded by a linux gateway (firewall) having multiple 
> interfaces also acting as a vpn gateway.
> 
> Test steps:
> 1. Disabled all iptables rules
> 2. Enabled the VPN IPSec Policies.
> 3. Start a ping with packet size (e.g. 3000 bytes) from a client in 
> the DMZ passing the machine targeting another LAN machine 4. Ping 
> works 5. Enable a VPN policy by sending pings from the gateway to a 
> tunnel target. System tries to create the tunnel 6. Ping from 3. immediately 
> stalls. No error messages. Just stops.
> 7. Stop Ping from 3. Start another without packet size parameter. Stalls also.
> 
> Result:
> Connections from the client to other services on the LAN machine still 
> work. Tracepath works. Only ICMP requests do not pass the gateway 
> anymore. tcpdump sees them on incoming interface, but not on the outgoing LAN 
> interface. IMCP requests to any other target IP address in LAN still work. 
> Until one uses a bigger packet size. Then these alternative connections stall 
> also.
> 
> Flushing the policy table has no effect. Flushing the conntrack table has no 
> effect. Setting rp_filter to loose (2) has no effect.
> Flush the route cache has no effect.
> 
> Only a reboot of the gateway restores normal behavior.
> 
> What can be the cause? Is this a networking bug?
> 

some of these most likely will fail due to other reasons, but can you run 
'tools/testing/selftests/net/pmtu.sh'[1] on 4.19 and then 4.20 and compare 
results. Hopefully it will shed some light on the problem and can be used to 
bisect to a commit that caused the regression.


[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/net/pmtu.sh

Reply via email to