Hi all,

(for those of you who read openstack-dev@, you may notice some duplication in 
this email comparing to the related thread: 
http://lists.openstack.org/pipermail/openstack-dev/2016-June/097189.html If 
that’s the case, sorry!)

tl;dr lots of Open vSwitch based SDN controllers plug devices that are meant to 
have different MTUs into the same ‘integration’ bridge (usually called br-int), 
and it sometimes makes MTU arrangements for those devices ineffective. Neutron 
team seeks guidance from Open vSwitch folks on how to proceed.

First thing, I’d like to note that when speaking about ‘Neutron' below, I 
implicitly mean ‘Neutron ML2/Open vSwitch reference implementation’. Though I 
believe same issues should affect other SDN solutions (OVN? dragonflow?) built 
on top of Open vSwitch that use a single integration bridge.

Now, let’s try to scope the problem. Neutron consistently uses a single bridge 
to plug all devices managed by a node. Those devices may belong to the same 
layer 2 domain ('network' in neutron-speak), as well as different layer 2 
domains. Those domains may be implemented by using different encapsulation 
technologies, that in Neutron ML2 plugin case results in networks having 
different MTU values calculated for those networks. All devices that belong to 
a single network are supposed to use the network MTU. Those include virtualized 
interfaces inside VMs, as well as devices on the data path from VMs to the 
integration bridge. Meaning, for a typical Neutron Open vSwitch setup, the 
following devices are meant to carry the network MTU:

VM interface - tap device - ‘hybrid’ Linux bridge* - VETH pair => plugged into 
br-int.

(* used for iptables based firewall)

Now, Neutron (OpenStack Networking) and Nova (OpenStack Compute) components set 
relevant MTUs on all of those devices (except a VM interface, that is usually 
configured by the guest OS itself, based on information provided through 
DHCP/RA responses, or other means).

It all works as long as all devices we plug into br-int belong to networks with 
identical MTUs. But since Neutron allows for different MTUs, the assumption 
does not hold.

While Neutron indeed plugs devices that belong to different broadcast domains 
into the same switch, it does not mean to allow traffic that belong to 
different domains to be switched. (All inter-domain communication is handled by 
virtual routers that are implemented as network namespaces.) Isolation is 
achieved thru local vlan tagging. Quoting:

"All VM VIFs are plugged into the integration bridge. VM VIFs on a given 
virtual network share a common “local” VLAN (i.e. not propagated externally). 
The VLAN id of this local VLAN is mapped to the physical networking details 
realizing that virtual network.”

http://docs.openstack.org/developer/neutron/devref/openvswitch_agent.html#bridge-management

What it means is that while devices are plugged into the same bridge, due to 
the additional layer of isolation, Neutron effectively uses a single bridge as 
a set of switches, one per network participating in the bridge setup.

So back to MTU. When I boot a VM using a VXLAN backed network, the tap device 
of MTU=1450 is plugged into the br-int bridge, which lowers the bridge MTU to 
1450. Then when I plug a device that belongs to a GRE network (MTU = 1458) into 
that same bridge, the GRE network backed device also gets its MTU reduced to 
1450, and no ‘ip link’ commands allow to raise it to the intended MTU=1458.

Curiously, when I move the latter device into a network namespace and try to 
set MTU on that same device, it works. (Jiri Benc told me that it’s missing 
validation in vswitchd code that allows it). We actually utilized that magic in 
a fix in Neutron to make router devices (that are in a namespace) to get 
intended MTU values: https://review.openstack.org/#/c/327651/ where we now 
first move the device in a namespace, and only then set its MTU.

There are several issues with the Neutron patch. First, it relies on a bug in 
Open vSwitch. Second, it does not solve the problem for other devices that are 
plugged into br-int and that don’t belong to separate namespaces (which are all 
VM VIFs in OpenStack).

One idea that was mentioned to me by Jiri Benc is to reimplement Neutron bridge 
setup to use multiple bridges, one per network. In that way, there won’t be a 
need to have devices with different MTUs on the same integration bridge. 
Isolation between domains would also be simplified, because now we would not 
need to maintain any local VLAN tagging rules to isolate domains from each 
other; isolation would naturally happen, since now all connection paths between 
domains will have an L3 layer (namespace) on their road.

If we would start from scratch, it would probably be the best idea with little 
drawbacks. Sadly, we are looking at a huge number of setups that rely on a 
single bridge for multiple domains, and as I said before, it’s not just 
Neutron. Migrating those existing workloads to a new better bridge setup would 
be a huge pain, and I am not even sure whether it’s possible to replace them 
without full migration of workloads to other nodes. That’s a huge engineering 
work, and something that would need to happen in all affected SDN solutions.

One alternative to that could be kernel/vSwitch layer allowing to relax the 
‘least of all device MTUs’ rule for some setups that explicitly ask for that. 
If only such an option would be available to SDN controllers, it could be 
utilized by them to be able to keep their existing single bridge setup.

And that’s the end of the story. So, what do you think of the problem? Is 
alternative proposed viable? If so, what’s the proper place for such 
configuration to exist - kernel or ovs?

I would be glad to find some solution that is acceptable by both Neutron as 
well as Open vSwitch communities, and something that we both can support in the 
long run.

Cheers,
Ihar
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to