All,
A short-term solution to VR upgrade or network restart (with cleanup=true) has been implemented: - The strategy for redundant VRs builds on top of Wei's original patch where backup routers are removed and replace in a rolling basis. The downtime I saw was usually 0-2 seconds, and theoretically downtime is maximum of [0, 3*advertisement interval + skew seconds] or 0-10 seconds (with cloudstack's default of 1s advertisement interval). - For non-redundant routers, I've implemented a strategy where first a new VR is deployed, then old VR is powered-off/destroyed, and the new VR is again re-programmed. With this strategy, two identical VRs may be up for a brief moment (few seconds) where both can serve traffic, however the new VR performs arp-ping on its interfaces to update neighbours. After the old VR is removed, the new VR is re-programmed which among many things performs another arpping. The theoretical downtime is therefore limited by the arp-cache refresh which can be up to 30 seconds. In my experiments, against various VMware, KVM and XenServer versions I found that the downtime was indeed less than 30s, usually between 5-20 seconds. Compared to older ACS versions, especially in cases where VRs deployment require full volume copy (like in VMware) a 10x-12x improvement was seen. Please review, test the following PRs which has test details, benchmarks, and some screenshots: https://github.com/apache/cloudstack/pull/2508 Future work can be driven towards making all VRs redundant enabled by default that can allow for a firewall+connections state transfer (conntrackd + VRRP2/3 based) during rolling reboots. - Rohit <https://cloudstack.apache.org> ________________________________ From: Daan Hoogland <daan.hoogl...@gmail.com> Sent: Thursday, February 8, 2018 3:11:51 PM To: dev Subject: Re: [DISCUSS] VR upgrade downtime reduction to stop the vote and continue the discussion. I personally want unification of all router vms: VR, 'shared network', rVR, VPC, rVPC, and eventually the one we want to create for 'enterprise topology hand-off points'. And I think we have some level of consensus on that but the path there is a concern for Wido and for some of my colleagues as well, and rightly so. One issue is upgrades from older versions. I the common scenario as follows: + redundancy is deprecated and only number of instances remain. + an old VR is replicated in memory by an redundant enabled version, that will be in a state of running but inactive. - the old one will be destroyed while a ping is running - as soon as the ping fails more then three times in a row (this might have to have a hypervisor specific implementation or require a helper vm) + the new one is activated after this upgrade Wei's and/or Remi's code will do the work for any following upgrade. flames, please On Wed, Feb 7, 2018 at 12:17 PM, Nux! <n...@li.nux.ro> wrote: > +1 too > > -- > Sent from the Delta quadrant using Borg technology! > > Nux! > www.nux.ro > > rohit.ya...@shapeblue.com www.shapeblue.com 53 Chandos Place, Covent Garden, London WC2N 4HSUK @shapeblue ----- Original Message ----- > > From: "Rene Moser" <m...@renemoser.net> > > To: "dev" <dev@cloudstack.apache.org> > > Sent: Wednesday, 7 February, 2018 10:11:45 > > Subject: Re: [DISCUSS] VR upgrade downtime reduction > > > On 02/06/2018 02:47 PM, Remi Bergsma wrote: > >> Hi Daan, > >> > >> In my opinion the biggest issue is the fact that there are a lot of > different > >> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's > why you > >> cannot simply switch from a single VPC to a redundant VPC for example. > >> > >> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a > VPC with a > >> single tier and made sure all features are supported. Next we merged > the single > >> and redundant VPC code paths. The idea here is that redundancy or not > should > >> only be a difference in the number of routers. Code should be the same. > A > >> single router, is also "master" but there just is no "backup". > >> > >> That simplifies things A LOT, as keepalived is now the master of the > whole > >> thing. No more assigning ip addresses in Python, but leave that to > keepalived > >> instead. Lots of code deleted. Easier to maintain, way more stable. We > just > >> released Cosmic 6 that has this feature and are now rolling it out in > >> production. Looking good so far. This change unlocks a lot of > possibilities, > >> like live upgrading from a single VPC to a redundant one (and back). In > the > >> end, if the redundant VPC is rock solid, you most likely don't even > want single > >> VPCs any more. But that will come. > >> > >> As I said, we're rolling this out as we speak. In a few weeks when > everything is > >> upgraded I can share what we learned and how well it works. CloudStack > could > >> use a similar approach. > > > > +1 Pretty much this. > > > > René > -- Daan