Credit to Weitse who noticed the strange TSval in the pcap.
It does look like an issue with Checkpoint. Good thing is this is reproducible.
John
-------- Messaggio originale --------
Oggetto: Re: AWS timeout
Da: Frank Hare
A: John Fawcett
CC: postfix-users@postfix.org
John,Wow, good one dude! Turning off tcp timestamps on the client instantly fixed it. So I guess I bring this to Checkpoint Support and see what they say, THAT should be fun.Thanks!On Tue, May 14, 2019 at 3:02 AM John Fawcett <j...@voipsupport.it> wrote:On 14/05/2019 01:27, Wietse Venema wrote:
> Wietse Venema:
>> If you look at the non-VPN captures, then you will see the following:
>>
>> - In one trace, we see a client ACK 138, followed by a client packet
>> with ".<CR><LF>" (data 443:446, ACK 138, and a timestamp field
>> tht is unlike those of al other packets in the stream).
>>
>> - In the other trace, we see that the ACK and ".<CR><LF>" packets
>> are sent as one packet, with a normal timestamp field.
>>
>> - After this, the TCP connection is messed up, the server keeps
>> transmitting "Queued as xxx", and the client keeps transmitting
>> QUIT.
>>
>> This looks like the VPN mucks with TCP and screws up the protocol.
> I verified this another time. On the SMTP client side, there is one packet,
> and on the SMTP server side there are two.
>
> Here is a diff -U output for client-side and server-side pcap files.
>
> -TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 1
> 38, win 211, options [nop,nop,TS val 2968348176 ecr 687501956], length 3: SMTP:
> .
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [.], ack 138, win 211, o
> ptions [nop,nop,TS val 2968348176 ecr 687501956], length 0
> +TIMESTAMP IP 10.110.2.9.33256 > 10.1.3.134.smtp: Flags [P.], seq 443:446, ack 138, win 211, options [nop,nop,TS val 1 ecr 0], length 3: SMTP: .
>
> On the client side we see that the client sends one TCP packet with
> three data bytes (offset 443:446, content ".<CR><LF>"), an ACK for
> server offset 138, and Timestamp Value 2968348176 and Timestamp
> Echo Reply 687501956.
>
> On the server side we see that the server receives two TCP packets.
>
> - One packet with the same ACK for server offset 138, and the same
> Timestamp Value 2968348176 and Timestamp Echo Reply 687501956, as
> the client sent, but no data bytes.
this looks like a duplicate ACK that was generated on the server side
VPN node.
> - One packet with the same ACK for server offset 138, and the same
> three data bytes as the client sent, but with a Timestamp Value of
> 1 and a Timestamp Echo Reply of 0.
The TSval of 1 was generated on the server side VPN node. Seems that the
packet is accepted anyway on the server because we can see that Postfix
responds to that. When the next packet containing the QUIT command comes
throught the TSVal has jumped back to the original clock sequence (as
generated on the client) 2968348359. I can't see why the Linux kernel
would reject that jump, but maybe it's happening on a lower level.
I'd suggest looking into potential bugs in the tcp/ip stack of the vpn
node when it generates a TSval of 1 on that packet.
Another possibility would be to see if turning off tcp timestamps on the
client solves it (systctl net.ipv4.tcp_timestamps=0). I'm not sure I'd
leave it that way, but could be useful to test it.
>
> And from here on TCP appears to be stuck. The server repeatedly
> sends "Ok: queued as 58CE8C0058<CR><LF>" and the client repeatedly
> sends "QUIT<CR><LF>".
>
> Anyway, something is breaking TCP and it isn't Postfix. Maybe the
> VPN is trying to be too smart, maybe something in your TCP stack
> is trying to shape traffic and messing up.
>
> Wietse