I'm looking for a solution to a problem with small amounts of packet loss thru Openvpn tunnel, and I think the root cause may be some sort of subtle bug in Openvpn. It appears to be dropping small packets - such as those that result from fragmenting larger frames for crypto/encapsulation - as well as the small frames used by ppp/pppoe for link control. We aren't dealing with actual packet loss of the crypted udp packets carrying the tunneld data, nor are we dealing with firewalls or other common openvpn faq items. Please forgive my verbosity, here's what I have so far:

I do bridging thru openvpn/tap device, and I'm basiclly passing a *lot* of different pppoe sessions thru it. I initially became aware of the problem back in Feburary and it was due to a very simple observation - that there appeared to be an occasional network disruption which would cause just one (and random) pppoe client to think the server had went away, but then immediately when the client tried to re-establish pppoe, the server would 'be there' again and ppp would sucessfully renegotiate a new session and everything would be fine.

The PPP protocol can use (and all of my clients do), a sub protocol called 'lcp-echo', which is basiclly a protocol layer ping check between the client and server to verify connectivity. In my configuration, this check happens once every 10 seconds and if there are 5 failures in a row, the client will decide that the server is dead, send a 'padt' protocol frame with a message embedded in it to the other side and will recycle itself and begin a new session negotiation process.

I experience issues where random clients will, for no network reason I am aware of, renegotiate their sessions. And I have spent much time
investigating the circumstances surrounding these. The key bit of
information I see is recorded either on the pppoe client or the pppoe
server, where either of them will claim 'no response to 5 echo requests', which implies that the network path between them has been unavailable for at least 50 seconds continously (or enough packet loss to make it the same difference).

I have focused on this symptom - of there being enough (apparent) packet
loss enough to cause client disconnection - and what I found was that this is NOT happening, and that at no time did the disconnecting client actually suffer a real loss of connectivity. I have been able to ping sample clients over the routed portion of the network (my clients have pppoe thru the vpn, and a statically routed IP address as well) during these times and have observed NO PACKET LOSS, and yet the client in question clearly states this (no echo replies) as the reason for dropping and renogiating the session.

This has been maddening to say the least. Because if I can ping the client, and it's taking exactly the same path thru the network that the encapsulated pppoe frames do, and the pings all work and at the same time the pppoe client thinks it's partner is gone, then logically there's got to be another explanation.

I have recently been able to observe this situation first hand with tcpdump running at the OpenVPN server and client sides. I saw frames go into the tunnel that _did not_ make it out the other side. Worse yet, it appeared that the frames not making it out (the encapsulated pppoe frames) were all destined for one mac address in particular. Like, for whatever reason, OpenVPN appeared to not be able to decapsulate any frames destined for one particular mac address for some period of time.

Suspeciously, I also have been observing an excessive number of ICMP "Frag reassembly time exceeded" messages coming from this openvpn client directed at the server. Putting 2 and 2 together, these excess icmp messages appear to be being generated because the client is not receiving all fragments. And I think it's the encrypted payload that's not being received properly.

I have a good test case right now. If I ping with 1393 bytes or more of
data, it doesn't work reliably. Whereas if I ping with 1392 bytes, it does work reliably and without loss. Here is an example:


[root@overlord ~]# ping -s1393 172.16.18.18
PING 172.16.18.18 (172.16.18.18) 1393(1421) bytes of data.
1401 bytes from 172.16.18.18: icmp_seq=6 ttl=62 time=42.7 ms
1401 bytes from 172.16.18.18: icmp_seq=7 ttl=62 time=6.89 ms
1401 bytes from 172.16.18.18: icmp_seq=69 ttl=62 time=7.61 ms
1401 bytes from 172.16.18.18: icmp_seq=130 ttl=62 time=8.66 ms
1401 bytes from 172.16.18.18: icmp_seq=131 ttl=62 time=8.19 ms
^C
--- 172.16.18.18 ping statistics ---
137 packets transmitted, 5 received, 96% packet loss, time 136261ms
rtt min/avg/max/mdev = 6.893/14.813/42.703/13.957 ms, pipe 2

Now the 'good' size:
[root@overlord ~]# ping -s1392  -c 100 172.16.18.18
PING 172.16.18.18 (172.16.18.18) 1392(1420) bytes of data.
1400 bytes from 172.16.18.18: icmp_seq=0 ttl=62 time=10.2 ms
1400 bytes from 172.16.18.18: icmp_seq=1 ttl=62 time=9.89 ms
1400 bytes from 172.16.18.18: icmp_seq=2 ttl=62 time=9.96 ms
. . . ^C
--- 172.16.18.18 ping statistics ---
86 packets transmitted, 86 received, 0% packet loss, time 85247ms
rtt min/avg/max/mdev = 8.169/16.197/138.610/21.278 ms, pipe 2


        So what does it mean that a 1 byte larger packet size is just
unreliable, as opposed to just not working at all?!?

I'm not really sure where to go next with this other than to report these damm strange observations and hope someone can suggest some next steps to try in resolving the basic packet loss problem. Please bear in mind that I have to be careful with the testing I do because I am running this in production and can't afford to take it completely down for long periods of time. I do have a backup system which has everything the production system does - except no clients - which I can test against, but I haven't yet been able to reproduce the problem on it yet (and I fear it is because of traffic loading).

Mike



Reply via email to