[Openvpn-devel] Bug report (long) - OpenVPN dropping small frames

Mike Ireton Wed, 19 Oct 2005 18:02:48 -0700

I'm looking for a solution to a problem with small amounts of packetloss thru Openvpn tunnel, and I think the root cause may be some sort ofsubtle bug in Openvpn. It appears to be dropping small packets - such asthose that result from fragmenting larger frames forcrypto/encapsulation - as well as the small frames used by ppp/pppoefor link control. We aren't dealing with actual packet loss of thecrypted udp packets carrying the tunneld data, nor are we dealing withfirewalls or other common openvpn faq items. Please forgive myverbosity, here's what I have so far:

I do bridging thru openvpn/tap device, and I'm basiclly passing a *lot*of different pppoe sessions thru it. I initially became aware of theproblem back in Feburary and it was due to a very simple observation -that there appeared to be an occasional network disruption which wouldcause just one (and random) pppoe client to think the server had wentaway, but then immediately when the client tried to re-establish pppoe,the server would 'be there' again and ppp would sucessfully renegotiatea new session and everything would be fine.

The PPP protocol can use (and all of my clients do), a sub protocolcalled 'lcp-echo', which is basiclly a protocol layer ping check betweenthe client and server to verify connectivity. In my configuration, thischeck happens once every 10 seconds and if there are 5 failures in arow, the client will decide that the server is dead, send a 'padt'protocol frame with a message embedded in it to the other side and willrecycle itself and begin a new session negotiation process.

I experience issues where random clients will, for no network reason Iam aware of, renegotiate their sessions. And I have spent much time

investigating the circumstances surrounding these. The key bit of
information I see is recorded either on the pppoe client or the pppoe

server, where either of them will claim 'no response to 5 echorequests', which implies that the network path between them has beenunavailable for at least 50 seconds continously (or enough packet lossto make it the same difference).


I have focused on this symptom - of there being enough (apparent) packet

loss enough to cause client disconnection - and what I found was thatthis is NOT happening, and that at no time did the disconnecting clientactually suffer a real loss of connectivity. I have been able to pingsample clients over the routed portion of the network (my clients havepppoe thru the vpn, and a statically routed IP address as well) duringthese times and have observed NO PACKET LOSS, and yet the client inquestion clearly states this (no echo replies) as the reason fordropping and renogiating the session.

This has been maddening to say the least. Because if I can ping theclient, and it's taking exactly the same path thru the network that theencapsulated pppoe frames do, and the pings all work and at the sametime the pppoe client thinks it's partner is gone, then logicallythere's got to be another explanation.

I have recently been able to observe this situation first hand withtcpdump running at the OpenVPN server and client sides. I saw frames gointo the tunnel that _did not_ make it out the other side. Worse yet, itappeared that the frames not making it out (the encapsulated pppoeframes) were all destined for one mac address in particular. Like, forwhatever reason, OpenVPN appeared to not be able to decapsulate anyframes destined for one particular mac address for some period of time.

Suspeciously, I also have been observing an excessive number of ICMP"Frag reassembly time exceeded" messages coming from this openvpn clientdirected at the server. Putting 2 and 2 together, these excess icmpmessages appear to be being generated because the client is notreceiving all fragments. And I think it's the encrypted payload that'snot being received properly.


I have a good test case right now. If I ping with 1393 bytes or more of

data, it doesn't work reliably. Whereas if I ping with 1392 bytes, itdoes work reliably and without loss. Here is an example:



[root@overlord ~]# ping -s1393 172.16.18.18
PING 172.16.18.18 (172.16.18.18) 1393(1421) bytes of data.
1401 bytes from 172.16.18.18: icmp_seq=6 ttl=62 time=42.7 ms
1401 bytes from 172.16.18.18: icmp_seq=7 ttl=62 time=6.89 ms
1401 bytes from 172.16.18.18: icmp_seq=69 ttl=62 time=7.61 ms
1401 bytes from 172.16.18.18: icmp_seq=130 ttl=62 time=8.66 ms
1401 bytes from 172.16.18.18: icmp_seq=131 ttl=62 time=8.19 ms
^C
--- 172.16.18.18 ping statistics ---
137 packets transmitted, 5 received, 96% packet loss, time 136261ms
rtt min/avg/max/mdev = 6.893/14.813/42.703/13.957 ms, pipe 2

Now the 'good' size:
[root@overlord ~]# ping -s1392  -c 100 172.16.18.18
PING 172.16.18.18 (172.16.18.18) 1392(1420) bytes of data.
1400 bytes from 172.16.18.18: icmp_seq=0 ttl=62 time=10.2 ms
1400 bytes from 172.16.18.18: icmp_seq=1 ttl=62 time=9.89 ms
1400 bytes from 172.16.18.18: icmp_seq=2 ttl=62 time=9.96 ms

...^C

--- 172.16.18.18 ping statistics ---
86 packets transmitted, 86 received, 0% packet loss, time 85247ms
rtt min/avg/max/mdev = 8.169/16.197/138.610/21.278 ms, pipe 2


        So what does it mean that a 1 byte larger packet size is just
unreliable, as opposed to just not working at all?!?

I'm not really sure where to go next with this other than to reportthese damm strange observations and hope someone can suggest some nextsteps to try in resolving the basic packet loss problem. Please bear inmind that I have to be careful with the testing I do because I amrunning this in production and can't afford to take it completely downfor long periods of time. I do have a backup system which has everythingthe production system does - except no clients - which I can testagainst, but I haven't yet been able to reproduce the problem on it yet(and I fear it is because of traffic loading).


Mike

[Openvpn-devel] Bug report (long) - OpenVPN dropping small frames

Reply via email to