Hi Robin and everyone, now onto mail number two :) I will try to just add some Pros and Cons for the Ideas and maybe some comments below that.
On Thu, Sep 28, 2023 at 06:28:46PM +0200, Robin Jarry via discuss wrote: > Hello OVN community, > > This is a follow up on the message I have sent today [1]. That second > part focuses on some ideas I have to remove the limitations that were > mentioned in the previous email. > > [1] > https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052695.html > > If you didn't read it, my goal is to start a discussion about how we > could improve OVN on the following topics: > > - Reduce the memory and CPU footprint of ovn-controller, ovn-northd. > - Support scaling of L2 connectivity across larger clusters. > - Simplify CMS interoperability. > - Allow support for alternative datapath implementations. > > Disclaimer: > > This message does not mention anything about L3/L4 features of OVN. > I didn't have time to work on these, yet. I hope we can discuss how > these fit with my ideas. I tried to add some L3 implications below as well. But i am by no means an expert in these and just added my current understanding there. > > Distributed mac learning > ======================== > > Use one OVS bridge per logical switch with mac learning enabled. Only > create the bridge if the logical switch has a port bound to the local > chassis. > > Pros: > > - Minimal openflow rules required in each bridge (ACLs and NAT mostly). > - No central mac binding table required. > - Mac table aging comes for free. > - Zero access to southbound DB for learned addresses nor for aging. - Uses the native switching feature of ovs, instead of reimplementing it in ovn > > Cons: > > - How to manage seamless upgrades? > - Requires ovn-controller to move/plug ports in the correct bridge. > - Multiple openflow connections (one per managed bridge). > - Requires ovn-trace to be reimplemented differently (maybe other tools > as well). - No central information anymore on mac bindings. All nodes need to update their data individually - Each bridge generates also a linux network interface. I do not know if there is some kind of limit to the linux interfaces or the ovs bridges somewhere. Would you still preprovision static mac addresses on the bridge for all port_bindings we know the mac address from, or would you rather leave that up for learning as well? I do not know if there is some kind of performance/optimization penality for moving packets between different bridges. You can also not only use the logical switch that have a local port bound. Assume the following topology: +---+ +---+ +---+ +---+ +---+ +---+ +---+ |vm1+-+ls1+-+lr1+-+ls2+-+lr2+-+ls3+-+vm2| +---+ +---+ +---+ +---+ +---+ +---+ +---+ vm1 and vm2 are both running on the same hypervisor. Creating only local logical switches would mean only ls1 and ls3 are available on that hypervisor. This would break the connection between the two vms which would in the current implementation just traverse the two logical routers. I guess we would need to create bridges for each locally reachable logical switch. I am concerned about the potentially significant increase in bridges and openflow connections this brings. > > Use multicast for overlay networks > ================================== > > Use a unique 24bit VNI per overlay network. Derive a multicast group > address from that VNI. Use VXLAN address learning [2] to remove the need > for ovn-controller to know the destination chassis for every mac address > in advance. > > [2] https://datatracker.ietf.org/doc/html/rfc7348#section-4.2 > > Pros: > > - Nodes do not need to know about others in advance. The control plane > load is distributed across the cluster. > - 24bit VNI allows for more than 16 million logical switches. No need > for extended GENEVE tunnel options. Note that using vxlan at the moment significantly reduces the ovn featureset. This is because the geneve header options are currently used for data that would not fit into the vxlan vni. From ovn-architecture.7.xml: ``` The maximum number of networks is reduced to 4096. The maximum number of ports per network is reduced to 2048. ACLs matching against logical ingress port identifiers are not supported. OVN interconnection feature is not supported. ``` > - Limited and scoped "flooding" with IGMP/MLD snooping enabled in > top-of-rack switches. Multicast is only used for BUM traffic. > - Only one VXLAN output port per implemented logical switch on a given > chassis. Would this actually work with one VXLAN output port? Would you not need one port per target node to send unicast traffic (as you otherwise flood all packets to all participating nodes)? > > Cons: > > - OVS does not support VXLAN address learning yet. > - The number of usable multicast groups in a fabric network may be > limited? > - How to manage seamless upgrades and interoperability with older OVN > versions? - This pushes all logic related to chassis management to the underlying networking fabric. It thereby places additional requirements on the network fabric that have not been here before and that might not be available for all users. - The bfd sessions between chassis are no longer possible thereby preventing fast failover of gateway chassis. As this idea requires VXLAN and all current limitation would apply to this solution as well this is probably no general solution but rather a deployment option. > > Connect ovn-controller to the northbound DB > =========================================== > > This idea extends on a previous proposal to migrate the logical flows > creation in ovn-controller [3]. > > [3] > https://patchwork.ozlabs.org/project/ovn/patch/20210625233130.3347463-1-numans%40ovn.org/ > > If the first two proposals are implemented, the southbound database can > be removed from the picture. ovn-controller can directly translate the > northbound schema into OVS configuration bridges, ports and flow rules. > > For other components that require access to the southbound DB (e.g. > neutron metadata agent), ovn-controller should provide an interface to > expose state and configuration data for local consumption. Note that also ovn-interconnect uses access to the southbound DB to add chassis of the interconnected site (and potentially some more magic). > > All state information present in the NB DB should be moved to a separate > state database [4] for CMS consumption. > > [4] https://mail.openvswitch.org/pipermail/ovs-dev/2023-April/403675.html > > For those who like visuals, I have started working on basic use cases > and how they would be implemented without a southbound database [5]. > > [5] https://link.excalidraw.com/p/readonly/jwZgJlPe4zhGF8lE5yY3 > > Pros: > > - The northbound DB is smaller by design: reduced network bandwidth and > memory usage in all chassis. > - If we keep the northbound read-only for ovn-controller, it removes > scaling issues when one controller updates one row that needs to be > replicated everywhere. > - The northbound schema knows nothing about flows. We could introduce > alternative dataplane backends configured by ovn-controller via > plugins. I have done a minimal PoC to check if it could work with the > linux network stack [6]. > > [6] https://github.com/rjarry/ovn-nb-agent/blob/main/backend/linux/bridge.go - one less codebase with northd gone > > Cons: > > - This would be a serious API breakage for systems that depend on the > southbound DB. > - Can all OVN constructs be implemented without a southbound DB? > - Is the community interested in alternative datapaths? - It requires each ovn-controller to do that translation of a given construct (e.g. a logical switch) thereby probably increasing the cpu load and recompute time - The complexity of the ovn-controller grows as it gains nearly all logic of northd I now understand what you meant with the alternative datapaths in your first mail. While i find the option interesting i'm not sure how much value actually would come out of that. For me it feels like this would make ovn siginificantly harder to debug. > > Closing thoughts > ================ > > I mainly focused on OpenStack use cases for now, but I think these > propositions could benefit Kubernetes as well. > > I hope I didn't bore everyone to death. Let me know what you think. > > Cheers! > > -- > Robin Jarry > Red Hat, Telco/NFV > Best Regards Felix Huettner Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>. This e-mail may contain confidential content and is intended only for the specified recipient/s. If you are not the intended recipient, please inform the sender immediately and delete this e-mail. Information on data protection can be found here<https://www.datenschutz.schwarz>. _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss