Results from the Open vSwitch agent...

I highly recommend reading further, but here's the TL;DR: Using physical
network interfaces with MTUs larger than 1500 reveals problems in several
places, but only involving Linux components rather than Open vSwitch
components (such as br-int) on both the controller and compute nodes. Most
of the problems involve MTU disparities in security group bridge components
on the compute node.

First, review the OpenStack bits and resulting network components in the
environment [1] and see that a typical 'ping' works using IPv4 and IPv6 [2].

[1] https://gist.github.com/ionosphere80/23655bedd24730d22c89
[2] https://gist.github.com/ionosphere80/5f309e7021a830246b66

Note: The tcpdump output in each case references up to seven points:
neutron router gateway on the public network (qg), namespace end of the
neutron router interface on the private network (qr), controller node end
of the VXLAN network (underlying interface), compute node end of the VXLAN
network (underlying interface), Open vSwitch end of the veth pair for the
security group bridge (qvo), Linux bridge end of the veth pair for the
security group bridge (qvb), and the bridge end of the tap for the VM (tap).

I can use SSH to access the VM because every component between my host and
the VM supports at least a 1500 MTU. So, let's configure the VM network
interface to use the proper MTU of 9000 minus the VXLAN protocol overhead
of 50 bytes... 8950... and try SSH again.

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc pfifo_fast qlen
1000
    link/ether fa:16:3e:ea:22:3a brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.3/24 brd 172.16.1.255 scope global eth0
    inet6 fd00:100:52:1:f816:3eff:feea:223a/64 scope global dynamic
       valid_lft 86396sec preferred_lft 14396sec
    inet6 fe80::f816:3eff:feea:223a/64 scope link
       valid_lft forever preferred_lft forever

Contrary to the Linux bridge experiment, I can still use SSH to access the
VM. Why?

Let's ping with a payload size of 8922 for IPv4 and 8902 for IPv6, the
maximum for a VXLAN segment with 8950 MTU.

# ping -c 1 -s 8922 -M do 10.100.52.102
PING 10.100.52.102 (10.100.52.102) 8922(8950) bytes of data.
>From 10.100.52.102 icmp_seq=1 Frag needed and DF set (mtu = 1500)

--- 10.100.52.102 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ping6 -c 1 -s 8902 -M do fd00:100:52:1:f816:3eff:feea:223a
PING fd00:100:52:1:f816:3eff:feea:223a(fd00:100:52:1:f816:3eff:feea:223a)
8902 data bytes
>From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=1500

--- fd00:100:52:1:f816:3eff:feea:223a ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

Look at the tcpdump output [3]. The router namespace, operating at layer-3,
sees the MTU discrepancy between inbound packet and the neutron router
gateway on the public network and returns an ICMP "fragmentation needed" or
"packet too big" message to the sender. The sender uses the MTU value in
the ICMP packet to recalculate the length of the first packet and caches it
for future packets.

[3] https://gist.github.com/ionosphere80/4e1389a34fd3a628b294

Although PTMUD enables communication between my host and the VM, it limits
MTU to 1500 regardless of the MTU between the router namespace and VM and
therefore could impact performance on 10 Gbps or faster networks. Also, it
does not address the MTU disparity between a VM and network components on
the compute node. If a VM uses a 1500 or smaller MTU, it cannot send
packets that exceed the MTU of the tap interface, veth pairs, and bridge on
the compute node. In this situation which seems fairly typical for
operators trying to work around MTU problems, communication between a host
(outside of OpenStack) and a VM always works. However, what if a VM uses a
MTU larger than 1500 and attempts to send a large packet? The bridge or
veth pairs would drop it because of the MTU disparity.

Using observations from the Linux bridge experiment, let's configure the
MTU of the interfaces in the router namespace to match the interfaces
outside of the namespace. The public network (gateway) interface MTU
becomes 9000 and the private network router interfaces (IPv4 and IPv6)
become 8950.

31: qr-d744191c-9d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc
noqueue state UNKNOWN mode DEFAULT group default
    link/ether fa:16:3e:34:67:40 brd ff:ff:ff:ff:ff:ff
32: qr-ae54b450-b4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc
noqueue state UNKNOWN mode DEFAULT group default
    link/ether fa:16:3e:d4:f1:63 brd ff:ff:ff:ff:ff:ff
33: qg-e3303f07-e7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc
noqueue state UNKNOWN mode DEFAULT group default
    link/ether fa:16:3e:70:09:54 brd ff:ff:ff:ff:ff:ff

Let's ping again with a payload size of 8922 for IPv4, the maximum for a
VXLAN segment with 8950 MTU, and look at the tcpdump output [4]. For
brevity, I'm only showing IPv4 because IPv6 provides similar results.

# ping -c 1 -s 8922 -M do 10.100.52.102

[4] https://gist.github.com/ionosphere80/703925fbe4ae53e78445

The packet traverses the Open vSwitch infrastructure including the overlay.
However, looking at the compute node, the integration bridge drops the
packet because the MTU changes from 8950 to 1500 over a layer-2 connection
without a router.

Let's increase the MTU on the OVS end of the veth pair to 8950, and ping
again using the same payload. For brevity, I'm only showing tcpdump output
for interfaces on the compute node [5].

# ping -c 1 -s 8922 -M do 10.100.52.102

[5] https://gist.github.com/ionosphere80/0f0d4cf346ee81e43cbb

The packet gets one step further. The veth pair between the Open vSwitch
integration bridge and security group bridge drops the packet because the
MTU changes from 8950 to 1500 over a layer-2 connection without a router.

Let's increase the MTU on the Linux bridge end of the veth pair to 8950 and
ping again using the same payload. For brevity, I'm only showing tcpdump
output for interfaces on the compute node [6].

[6] https://gist.github.com/ionosphere80/dd9270aae23ad286d9cd

The packet gets one step further. The VM tap interface drops the packet
because the MTU changes from 8950 to 1500 over a layer-2 connection without
a router.

Let's perform the final MTU increase on the VM tap interface and ping again
using the same payload. For brevity, I'm only showing tcpdump output for
interfaces on the compute node [7].

[7] https://gist.github.com/ionosphere80/05e02c7a753fad4b2964

Ping works.

Let's ping with a payload size of 8923 for IPv4 and 8903 for IPv6, one byte
larger than the maximum for a VXLAN segment with 8950 MTU. The router
namespace, operating at layer-3, sees the MTU discrepancy between the two
interfaces in the namespace and returns an ICMP "fragmentation needed" or
"packet too big" message to the sender. The sender uses the MTU value in
the ICMP packet to recalculate the length of the first packet and caches it
for future packets.

# ping -c 1 -s 8923 -M do 10.100.52.102
PING 10.100.52.102 (10.100.52.102) 8923(8951) bytes of data.
>From 10.100.52.102 icmp_seq=1 Frag needed and DF set (mtu = 8950)

--- 10.100.52.102 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ping6 -c 1 -s 8903 -M do fd00:100:52:1:f816:3eff:feea:223a
PING fd00:100:52:1:f816:3eff:feea:223a(fd00:100:52:1:f816:3eff:feea:223a)
8903 data bytes
>From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=8950

--- fd00:100:52:1:f816:3eff:feea:223a ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ip route get to 10.100.52.102
10.100.52.102 dev eth1  src 10.100.52.45
    cache  expires 499sec mtu 8950

# ip route get to fd00:100:52:1:f816:3eff:feea:223a
fd00:100:52:1:f816:3eff:feea:223a from :: via fd00:100:52::101 dev eth1
 src fd00:100:52::45  metric 0
    cache  expires 544sec mtu 8950

This experiment reveals a number of problems with the Open vSwitch agent,
none of which seem to involve Open vSwitch itself.

1) Like the Linux bridge agent, interfaces in namespaces assume a 1500 MTU
which prevents communication with VMs using larger packets. However, the
method OVS uses to manage interfaces in namespaces permits them to generate
ICMP messages for PMTUD that notify senders of the correct MTU.
2) Although interfaces in namespaces generate ICMP messages for PMTUD, they
assume a 1500 MTU and therefore limit performance on 10 Gbps or faster
networks regardless of the MTU between the router namespace and a VM.
3) The Open vSwitch agent creates Linux bridges on compute nodes to
implement security groups. These bridges do not contain ports on physical
network interfaces (using a larger MTU) and therefore assume a 1500 MTU.
The veth pairs and tap interfaces also assume a 1500 MTU. Unlike the Linux
bridge agent, only increasing the MTU of the namespace end of the veth pair
for the neutron router interface on the private network simply moves the
problem to the security group bridge components. The latter components
(qvo, qvb, and tap) should all use the MTU of the physical network minus
the overlay protocol overhead, or 8950 for VXLAN in this particular
experiment.

Matt

On Mon, Jan 25, 2016 at 12:10 PM, Rick Jones <rick.jon...@hpe.com> wrote:

> On 01/24/2016 07:43 PM, Ian Wells wrote:
>
>> Also, I say 9000, but why is 9000 even the right number?
>>
>
> While that may have been a rhetorical question...
>
> Because that is the value Alteon picked in the late 1990s when they
> created the de facto standard for "Jumbo Frames" by including it in their
> Gigabit Ethernet kit as a way to enable the systems of the day to have a
> hope of getting link-rate :)
>
> Perhaps they picked 9000 because it was twice the 4500 of FDDI, which
> itself was selected to allow space for 4096 bytes of data and then a good
> bit of headers.
>
>
>
> rick jones
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to