On Thu, Jul 09, 2015 at 06:12:53PM -0700, Ethan Jackson wrote: > High availability for gateways in network virtualization deployments > is fairly difficult to get right. There are a ton of options, most of > which are too complicated or perform badly. To help solve this > problem, this patch proposes an HA design based on some of the lessons > learned building similar systems. The hope is that it can be used as > a starting point for design discussions and an eventual > implementation. > > Signed-off-by: Ethan Jackson <et...@nicira.com>
Thank you for writing this up! This had encoding "y", which made it challenging to apply ;-) Can we put it in the ovn directory? When a logical network contains a gateway, then both sides are part of the logical network, and thus "logical space". So while I agree with the diagram at the very beginning that shows a gateway between an external network and an OVN virtual network, I think it's a bit misleading to say: The OVN gateway is responsible for shuffling traffic between logical space (governed by ovn-northd), and the legacy physical network. since both sides of the gateway are in logical space. I think it would be more accurate to use some variant of "virtual" here, maybe: The OVN gateway is responsible for shuffling traffic between VMs (governed by ovn-northd), and the legacy physical network. In the second paragraph, I am not sure why HA is critical to performance: An HA solution is both critical to the performance and manageability of the system, and extremely difficult to get right. The second paragraph of "Basic Architecture" starts: Since the broader internet is managed outside of the OVN network domain, all traffic between logical space and the WAN must travel through this gateway. Is that the reason? The reasons that come to mind to me are different (or maybe just more specific?). First, the gateway is the machine that has a connection to the external network of interest; it might be in a remote location such as a branch office away from the bulk of the hypervisors in an OVN deployment. Second, supposing that in fact the gateway isn't in that kind of remote location, we want to have a central point of entry into the virtual part of an OVN network because otherwise we don't know which of N hypervisors should bring the packet into the virtual network. Under "Naive active-backup", do you mean OpenFlow echo requests here (a "hello" message is only sent at the very beginning of an OpenFlow session, to negotiate the OpenFlow version): ovn-northd monitors this gateway via OpenFlow hello messages (or some equivalent), Under "Controller Independent Active-backup", I am not sure that I buy the argument here, because currently ovn-northd doesn't care about the layout of the physical network. The other argument rings true for me of course: This can significantly increase downtime in the event of a failover as the (often already busy) ovn-northd controller has to recompute state for the new leader. Here are some spelling fixes as a patch. This also replaces the fancy Unicode U+2014 em dashes by the more common (in OVS, anyway) ASCII "--". Thanks again for writing this! diff --git a/OVN-GW-HA.md b/OVN-GW-HA.md index ea598b2..e0d5c9f 100644 --- a/OVN-GW-HA.md +++ b/OVN-GW-HA.md @@ -30,8 +30,8 @@ The OVN gateway is responsible for shuffling traffic between logical space implementation, the gateway is a single x86 server, or hardware VTEP. For most deployments, a single system has enough forwarding capacity to service the entire virtualized network, however, it introduces a single point of failure. -If this system dies, the entire OVN deployment becomes unavailable. To mitgate -this risk, an HA solution is critical — by spreading responsibilty across +If this system dies, the entire OVN deployment becomes unavailable. To mitigate +this risk, an HA solution is critical -- by spreading responsibility across multiple systems, no single server failure can take down the network. An HA solution is both critical to the performance and manageability of the @@ -51,7 +51,7 @@ OVN controlled tunnel traffic, to raw physical network traffic. Since the broader internet is managed outside of the OVN network domain, all traffic between logical space and the WAN must travel through this gateway. -This makes it a critical single point of failure — if the gateway dies, +This makes it a critical single point of failure -- if the gateway dies, communication with the WAN ceases for all systems in logical space. To mitigate this risk, multiple gateways should be run in a "High Availability @@ -128,15 +128,15 @@ absolute simplest way to achive this is what we'll call "naive-active-backup". Naive Active Backup HA Implementation ``` -In a naive active-bakup, one of the Gateways is choosen (arbitrarily) as a +In a naive active-backup, one of the Gateways is choosen (arbitrarily) as a leader. All logical routers (A, B, C in the figure), are scheduled on this leader gateway and all traffic flows through it. ovn-northd monitors this gateway via OpenFlow hello messages (or some equivalent), and if the gateway dies, it recreates the routers on one of the backups. This approach basically works in most cases and should likely be the starting -point for OVN — it's strictly better than no HA solution and is a good -foundation for more sophisticated solutions. That said, it's not without it's +point for OVN -- it's strictly better than no HA solution and is a good +foundation for more sophisticated solutions. That said, it's not without its limitations. Specifically, this approach doesn't coordinate with the physical network to minimize disruption during failures, and it tightly couples failover to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by @@ -167,7 +167,7 @@ ethernet source address of the RARP is that of the logical router it corresponds to, and its destination is the broadcast address. This causes the RARP to travel to every L2 switch in the broadcast domain, updating forwarding tables accordingly. This strategy is recommended in all failover mechanisms -discussed in this document — when a router newly boots on a new leader, it +discussed in this document -- when a router newly boots on a new leader, it should RARP its MAC address. ### Controller Independent Active-backup @@ -188,7 +188,7 @@ Controller Independent Active-Backup Implementation ``` The fundamental problem with naive active-backup, is it tightly couples the -failover solution to ovn-northd. This can signifcantly increase downtime in +failover solution to ovn-northd. This can significantly increase downtime in the event of a failover as the (often already busy) ovn-northd controller has to recompute state for the new leader. Worse, if ovn-northd goes down, we can't perform gateway failover at all. This violates the principle that @@ -207,7 +207,7 @@ priority to each node it controls. Nodes use the leadership priority to determine which gateway in the cluster is the active leader by using a simple metric: the leader is the gateway that is healthy, with the highest priority. If that gateway goes down, leadership falls to the next highest priority, and -conversley, if a new gateway comes up with a higher priority, it takes over +conversely, if a new gateway comes up with a higher priority, it takes over leadership. Thus, in this model, leadership of the HA cluster is determined simply by the @@ -221,7 +221,7 @@ of member gateways, a key problem is how do we communicate this information to the relevant transport nodes. Luckily, we can do this fairly cheaply using tunnel monitoring protocols like BFD. -The basic idea is pretty straight forward. Each transport node maintains a +The basic idea is pretty straightforward. Each transport node maintains a tunnel to every gateway in the HA cluster (not just the leader). These tunnels are monitored using the BFD protocol to see which are alive. Given this information, hypervisors can trivially compute the highest priority live @@ -277,7 +277,7 @@ even though its tunnels are still healthy. Router Specific Active-Backup ``` Controller independent active-backup is a great advance over naive -active-backup, but it still has one glaring problem — it under-utilizes the +active-backup, but it still has one glaring problem -- it under-utilizes the backup gateways. In ideal scenario, all traffic would split evenly among the live set of gateways. Getting all the way there is somewhat tricky, but as a step in the direction, one could use the "Router Specific Active-Backup" @@ -286,7 +286,7 @@ router basis, with one twist. It chooses a different active Gateway for each logical router. Thus, in situations where there are several logical routers, all with somewhat balanced load, this algorithm performs better. -Implementation of this strategy is quite straight forward if built on top of +Implementation of this strategy is quite straightforward if built on top of basic controller independent active-backup. On a per logical router basis, the algorithm is the same, leadership is determined by the liveness of the gateways. The key difference here is that the gateways must have a different @@ -295,7 +295,7 @@ be computed by ovn-northd just as they had been in the controller independent active-backup model. Once we have these per logical router priorities, they simply need be -comminucated to the members of the gateway cluster and the hypervisors. The +communicated to the members of the gateway cluster and the hypervisors. The hypervisors in particular, need simply have an active-backup bundle action (or group action) per logical router listing the gateways in priority order for *that router*, rather than having a single bundle action shared for all the @@ -327,7 +327,7 @@ undesirable. The controller can optionally avoid preemption by cleverly tweaking the leadership priorities. For each router, new gateways should be assigned priorities that put them second in line or later when they eventually come up. -Furthermore, if a gateway goes down for a significant period of time, it's old +Furthermore, if a gateway goes down for a significant period of time, its old leadership priorities should be revoked and new ones should be assigned as if it's a brand new gateway. Note that this should only happen if a gateway has been down for a while (several minutes), otherwise a flapping gateway could @@ -368,7 +368,7 @@ gateways end up implementing an overly conservative "when in doubt drop all traffic" policy, or they implement something like MLAG. MLAG has multiple gateways work together to pretend to be a single L2 switch -with a large LACP bond. In principle, it's the right right solution to the +with a large LACP bond. In principle, it's the right solution to the problem as it solves the broadcast storm problem, and has been deployed successfully in other contexts. That said, it's difficult to get right and not recommended. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev