Re: [DISCUSS] VR upgrade downtime reduction

Simon Weller Tue, 01 May 2018 06:21:13 -0700

Yes, nice work!




________________________________
From: Daan Hoogland <[email protected]>
Sent: Tuesday, May 1, 2018 5:28 AM
To: [email protected]
Cc: dev
Subject: Re: [DISCUSS] VR upgrade downtime reduction

good work Rohit,
I'll review 2508 https://github.com/apache/cloudstack/pull/2508

On Tue, May 1, 2018 at 12:08 PM, Rohit Yadav <[email protected]>
wrote:

> All,
>
>
> A short-term solution to VR upgrade or network restart (with cleanup=true)
> has been implemented:
>
>
> - The strategy for redundant VRs builds on top of Wei's original patch
> where backup routers are removed and replace in a rolling basis. The
> downtime I saw was usually 0-2 seconds, and theoretically downtime is
> maximum of [0, 3*advertisement interval + skew seconds] or 0-10 seconds
> (with cloudstack's default of 1s advertisement interval).
>
>
> - For non-redundant routers, I've implemented a strategy where first a new
> VR is deployed, then old VR is powered-off/destroyed, and the new VR is
> again re-programmed. With this strategy, two identical VRs may be up for a
> brief moment (few seconds) where both can serve traffic, however the new VR
> performs arp-ping on its interfaces to update neighbours. After the old VR
> is removed, the new VR is re-programmed which among many things performs
> another arpping. The theoretical downtime is therefore limited by the
> arp-cache refresh which can be up to 30 seconds. In my experiments, against
> various VMware, KVM and XenServer versions I found that the downtime was
> indeed less than 30s, usually between 5-20 seconds. Compared to older ACS
> versions, especially in cases where VRs deployment require full volume copy
> (like in VMware) a 10x-12x improvement was seen.
>
>
> Please review, test the following PRs which has test details, benchmarks,
> and some screenshots:
>
> https://github.com/apache/cloudstack/pull/2508
>
>
> Future work can be driven towards making all VRs redundant enabled by
> default that can allow for a firewall+connections state transfer
> (conntrackd + VRRP2/3 based) during rolling reboots.
>
>
> - Rohit
>
> <https://cloudstack.apache.org>
>
>
>
> ________________________________
> From: Daan Hoogland <[email protected]>
> Sent: Thursday, February 8, 2018 3:11:51 PM
> To: dev
> Subject: Re: [DISCUSS] VR upgrade downtime reduction
>
> to stop the vote and continue the discussion. I personally want unification
> of all router vms: VR, 'shared network', rVR, VPC, rVPC, and eventually the
> one we want to create for 'enterprise topology hand-off points'. And I
> think we have some level of consensus on that but the path there is a
> concern for Wido and for some of my colleagues as well, and rightly so. One
> issue is upgrades from older versions.
>
> I the common scenario as follows:
> + redundancy is deprecated and only number of instances remain.
> + an old VR is replicated in memory by an redundant enabled version, that
> will be in a state of running but inactive.
> - the old one will be destroyed while a ping is running
> - as soon as the ping fails more then three times in a row (this might have
> to have a hypervisor specific implementation or require a helper vm)
> + the new one is activated
>
> after this upgrade Wei's and/or Remi's code will do the work for any
> following upgrade.
>
> flames, please
>
>
>
> On Wed, Feb 7, 2018 at 12:17 PM, Nux! <[email protected]> wrote:
>
> > +1 too
> >
> > --
> > Sent from the Delta quadrant using Borg technology!
> >
> > Nux!
> > www.nux.ro
> >
> >
> [email protected]
> www.shapeblue.com<http://www.shapeblue.com>
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
> ----- Original Message -----
> > > From: "Rene Moser" <[email protected]>
> > > To: "dev" <[email protected]>
> > > Sent: Wednesday, 7 February, 2018 10:11:45
> > > Subject: Re: [DISCUSS] VR upgrade downtime reduction
> >
> > > On 02/06/2018 02:47 PM, Remi Bergsma wrote:
> > >> Hi Daan,
> > >>
> > >> In my opinion the biggest issue is the fact that there are a lot of
> > different
> > >> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's
> > why you
> > >> cannot simply switch from a single VPC to a redundant VPC for example.
> > >>
> > >> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a
> > VPC with a
> > >> single tier and made sure all features are supported. Next we merged
> > the single
> > >> and redundant VPC code paths. The idea here is that redundancy or not
> > should
> > >> only be a difference in the number of routers. Code should be the
> same.
> > A
> > >> single router, is also "master" but there just is no "backup".
> > >>
> > >> That simplifies things A LOT, as keepalived is now the master of the
> > whole
> > >> thing. No more assigning ip addresses in Python, but leave that to
> > keepalived
> > >> instead. Lots of code deleted. Easier to maintain, way more stable. We
> > just
> > >> released Cosmic 6 that has this feature and are now rolling it out in
> > >> production. Looking good so far. This change unlocks a lot of
> > possibilities,
> > >> like live upgrading from a single VPC to a redundant one (and back).
> In
> > the
> > >> end, if the redundant VPC is rock solid, you most likely don't even
> > want single
> > >> VPCs any more. But that will come.
> > >>
> > >> As I said, we're rolling this out as we speak. In a few weeks when
> > everything is
> > >> upgraded I can share what we learned and how well it works. CloudStack
> > could
> > >> use a similar approach.
> > >
> > > +1 Pretty much this.
> > >
> > > René
> >
>
>
>
> --
> Daan
>



--
Daan

Re: [DISCUSS] VR upgrade downtime reduction

Reply via email to