Re: [DISCUSS] VR upgrade downtime reduction

Rohit Yadav Tue, 01 May 2018 03:09:07 -0700

All,


A short-term solution to VR upgrade or network restart (with cleanup=true) has 
been implemented:


- The strategy for redundant VRs builds on top of Wei's original patch where 
backup routers are removed and replace in a rolling basis. The downtime I saw 
was usually 0-2 seconds, and theoretically downtime is maximum of [0, 
3*advertisement interval + skew seconds] or 0-10 seconds (with cloudstack's 
default of 1s advertisement interval).


- For non-redundant routers, I've implemented a strategy where first a new VR 
is deployed, then old VR is powered-off/destroyed, and the new VR is again 
re-programmed. With this strategy, two identical VRs may be up for a brief 
moment (few seconds) where both can serve traffic, however the new VR performs 
arp-ping on its interfaces to update neighbours. After the old VR is removed, 
the new VR is re-programmed which among many things performs another arpping. 
The theoretical downtime is therefore limited by the arp-cache refresh which 
can be up to 30 seconds. In my experiments, against various VMware, KVM and 
XenServer versions I found that the downtime was indeed less than 30s, usually 
between 5-20 seconds. Compared to older ACS versions, especially in cases where 
VRs deployment require full volume copy (like in VMware) a 10x-12x improvement 
was seen.


Please review, test the following PRs which has test details, benchmarks, and 
some screenshots:

https://github.com/apache/cloudstack/pull/2508


Future work can be driven towards making all VRs redundant enabled by default 
that can allow for a firewall+connections state transfer (conntrackd + VRRP2/3 
based) during rolling reboots.


- Rohit

<https://cloudstack.apache.org>



________________________________
From: Daan Hoogland <daan.hoogl...@gmail.com>
Sent: Thursday, February 8, 2018 3:11:51 PM
To: dev
Subject: Re: [DISCUSS] VR upgrade downtime reduction

to stop the vote and continue the discussion. I personally want unification
of all router vms: VR, 'shared network', rVR, VPC, rVPC, and eventually the
one we want to create for 'enterprise topology hand-off points'. And I
think we have some level of consensus on that but the path there is a
concern for Wido and for some of my colleagues as well, and rightly so. One
issue is upgrades from older versions.

I the common scenario as follows:
+ redundancy is deprecated and only number of instances remain.
+ an old VR is replicated in memory by an redundant enabled version, that
will be in a state of running but inactive.
- the old one will be destroyed while a ping is running
- as soon as the ping fails more then three times in a row (this might have
to have a hypervisor specific implementation or require a helper vm)
+ the new one is activated

after this upgrade Wei's and/or Remi's code will do the work for any
following upgrade.

flames, please



On Wed, Feb 7, 2018 at 12:17 PM, Nux! <n...@li.nux.ro> wrote:

> +1 too
>
> --
> Sent from the Delta quadrant using Borg technology!
>
> Nux!
> www.nux.ro
>
> 
rohit.ya...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 

----- Original Message -----
> > From: "Rene Moser" <m...@renemoser.net>
> > To: "dev" <dev@cloudstack.apache.org>
> > Sent: Wednesday, 7 February, 2018 10:11:45
> > Subject: Re: [DISCUSS] VR upgrade downtime reduction
>
> > On 02/06/2018 02:47 PM, Remi Bergsma wrote:
> >> Hi Daan,
> >>
> >> In my opinion the biggest issue is the fact that there are a lot of
> different
> >> code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc. That's
> why you
> >> cannot simply switch from a single VPC to a redundant VPC for example.
> >>
> >> For SBP, we mitigated that in Cosmic by converting all non-VPCs to a
> VPC with a
> >> single tier and made sure all features are supported. Next we merged
> the single
> >> and redundant VPC code paths. The idea here is that redundancy or not
> should
> >> only be a difference in the number of routers. Code should be the same.
> A
> >> single router, is also "master" but there just is no "backup".
> >>
> >> That simplifies things A LOT, as keepalived is now the master of the
> whole
> >> thing. No more assigning ip addresses in Python, but leave that to
> keepalived
> >> instead. Lots of code deleted. Easier to maintain, way more stable. We
> just
> >> released Cosmic 6 that has this feature and are now rolling it out in
> >> production. Looking good so far. This change unlocks a lot of
> possibilities,
> >> like live upgrading from a single VPC to a redundant one (and back). In
> the
> >> end, if the redundant VPC is rock solid, you most likely don't even
> want single
> >> VPCs any more. But that will come.
> >>
> >> As I said, we're rolling this out as we speak. In a few weeks when
> everything is
> >> upgraded I can share what we learned and how well it works. CloudStack
> could
> >> use a similar approach.
> >
> > +1 Pretty much this.
> >
> > René
>



--
Daan

Re: [DISCUSS] VR upgrade downtime reduction

Reply via email to