Re: Caching modes

2018-02-21 Thread Andrija Panic
Rafael, I just successfully merged (strange?)
https://github.com/andrijapanic/cloudstack-docs/pull/1 and I can see
changes are publicly available on
http://docs.cloudstack.apache.org/en/latest/networking/vxlan.html#important-note-on-max-number-of-multicast-groups-and-thus-vxlan-intefaces

Is this normal, that I can merge my own pull request on cloudstack-doc repo
? There is no limitations (I was able to make PR and merge myself)

On 21 February 2018 at 00:58, Andrija Panic  wrote:

> pls merge also https://github.com/apache/cloudstack-docs-admin/pull/48
>
> just correct code block syntax (to display code properly)
>
> On 20 February 2018 at 21:02, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
>
>> Thanks, we will proceed reviweing
>>
>> On Tue, Feb 20, 2018 at 3:12 PM, Andrija Panic 
>> wrote:
>>
>> > Here it is:
>> >
>> > https://github.com/apache/cloudstack-docs-admin/pull/47
>> >
>> > Added KVM online storage migration (atm only CEPH/NFS to SolidFire, new
>> in
>> > 4.11 release)
>> > Added KVM cache mode setup and limitations.
>> >
>> >
>> > Cheers
>> >
>> > On 20 February 2018 at 16:49, Rafael Weingärtner <
>> > rafaelweingart...@gmail.com> wrote:
>> >
>> > > If you are willing to write it down, please do so, and open a PR. We
>> will
>> > > review and merged it afterwards.
>> > >
>> > > On Tue, Feb 20, 2018 at 12:41 PM, Andrija Panic <
>> andrija.pa...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > I advise (or not...depends on point of view) to stay that way -
>> because
>> > > > when you activate write-back cache - live migrations will stop, and
>> > this
>> > > > makes *Enable maintenance mode (put host into maintenance)*
>> impossible.
>> > > >
>> > > > I would perhaps suggest that there is documentation for "advanced
>> > users"
>> > > or
>> > > > similar, that will say "it's possible to enable this and this way
>> via
>> > DB
>> > > > hack, but be warned on live migration consequences etc..." since
>> this
>> > > will
>> > > > render more problems if people start using it.
>> > > >
>> > > > If you choose to do, let me know, I can write that (documentation)
>> > > briefly.
>> > > >
>> > > > Not to mention it can be unsafe (power failure - less possible I
>> guess,
>> > > but
>> > > > rare kernel panic etc might have it's consequences I assume)
>> > > >
>> > > > It does indeed increase performance on NFS much, but not
>> necessarily on
>> > > > CEPH (if you are using lirbd cache on client side as explained
>> above)
>> > > >
>> > > > On 20 February 2018 at 15:48, Rafael Weingärtner <
>> > > > rafaelweingart...@gmail.com> wrote:
>> > > >
>> > > > > Yes. Weird enough, the code is using the value in the database if
>> it
>> > is
>> > > > > provided there, but there is no easy way for users to change that
>> > > > > configuration in the database. ¯\_(ツ)_/¯
>> > > > >
>> > > > > On Tue, Feb 20, 2018 at 11:45 AM, Andrija Panic <
>> > > andrija.pa...@gmail.com
>> > > > >
>> > > > > wrote:
>> > > > >
>> > > > > > So it seems that just passing the cachemode value to API is not
>> > > there,
>> > > > or
>> > > > > > somehow messedup, but deployVM process does read DB values from
>> > > > > > disk_offering table for sure, and applies it to XML file for
>> KVM.
>> > > > > > This is above ACS 4.8.x.
>> > > > > >
>> > > > > >
>> > > > > > On 20 February 2018 at 15:44, Andrija Panic <
>> > andrija.pa...@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > I have edited the disk_offering table, in the cache_mode just
>> > enter
>> > > > > > > "writeback". Stop and start VM, and it will pickup/inherit the
>> > > > > cache_mode
>> > > > > > > from it's parrent offering
>> > > > > > > This also applies to Compute/Service offering, again inside
>> > > > > disk_offering
>> > > > > > > table - just tested both
>> > > > > > >
>> > > > > > > i.e.
>> > > > > > >
>> > > > > > > UPDATE `cloud`.`disk_offering` SET `cache_mode`='writeback'
>> WHERE
>> > > > > > > `id`=102; # Compute Offering (Service offering)
>> > > > > > > UPDATE `cloud`.`disk_offering` SET `cache_mode`='writeback'
>> WHERE
>> > > > > > > `id`=114; #data disk offering
>> > > > > > >
>> > > > > > > Before SQL:
>> > > > > > >
>> > > > > > > root@ix1-c7-4:~# virsh dumpxml i-2-10-VM | grep cache -A2
>> > > > > > >   
>> > > > > > >   
>> > > > > > >   
>> > > > > > > --
>> > > > > > >   
>> > > > > > >   
>> > > > > > >   
>> > > > > > > --
>> > > > > > >
>> > > > > > > STOP and START VM = after SQL
>> > > > > > >
>> > > > > > > root@ix1-c7-4:~# virsh dumpxml i-2-10-VM | grep cache -A2
>> > > > > > >   
>> > > > > > >   
>> > > > > > >   
>> > > > > > > --
>> > > > > > >   
>> > > > > > >   
>> > > > > > >   
>> > > > > > > --
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On 20 February 2018 at 14:03, Rafael Weingärtner <
>> > > > > > > rafaelweingart...@gmail.com> wrote:
>> > > > > > >
>> > > > > > >> I have no idea how it can change the performance. If you
>> look at
>> 

Re: Caching modes

2018-02-21 Thread Rafael Weingärtner
You merged from branch andrijapanic-patch-1, into master. However, this
process was done in your own repository and not ACS.
You had this PR opened: https://github.com/apache/cloudstack-docs/pull/22,
and it was merged on Nov 9, 2017.

That is the content you are seeing on link http://docs.cloudstack.apache.
org/en/latest/networking/vxlan.html#important-note-on-
max-number-of-multicast-groups-and-thus-vxlan-intefaces



On Wed, Feb 21, 2018 at 9:08 AM, Andrija Panic 
wrote:

> Rafael, I just successfully merged (strange?)
> https://github.com/andrijapanic/cloudstack-docs/pull/1 and I can see
> changes are publicly available on
> http://docs.cloudstack.apache.org/en/latest/networking/
> vxlan.html#important-note-on-max-number-of-multicast-
> groups-and-thus-vxlan-intefaces
>
> Is this normal, that I can merge my own pull request on cloudstack-doc repo
> ? There is no limitations (I was able to make PR and merge myself)
>
> On 21 February 2018 at 00:58, Andrija Panic 
> wrote:
>
> > pls merge also https://github.com/apache/cloudstack-docs-admin/pull/48
> >
> > just correct code block syntax (to display code properly)
> >
> > On 20 February 2018 at 21:02, Rafael Weingärtner <
> > rafaelweingart...@gmail.com> wrote:
> >
> >> Thanks, we will proceed reviweing
> >>
> >> On Tue, Feb 20, 2018 at 3:12 PM, Andrija Panic  >
> >> wrote:
> >>
> >> > Here it is:
> >> >
> >> > https://github.com/apache/cloudstack-docs-admin/pull/47
> >> >
> >> > Added KVM online storage migration (atm only CEPH/NFS to SolidFire,
> new
> >> in
> >> > 4.11 release)
> >> > Added KVM cache mode setup and limitations.
> >> >
> >> >
> >> > Cheers
> >> >
> >> > On 20 February 2018 at 16:49, Rafael Weingärtner <
> >> > rafaelweingart...@gmail.com> wrote:
> >> >
> >> > > If you are willing to write it down, please do so, and open a PR. We
> >> will
> >> > > review and merged it afterwards.
> >> > >
> >> > > On Tue, Feb 20, 2018 at 12:41 PM, Andrija Panic <
> >> andrija.pa...@gmail.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > I advise (or not...depends on point of view) to stay that way -
> >> because
> >> > > > when you activate write-back cache - live migrations will stop,
> and
> >> > this
> >> > > > makes *Enable maintenance mode (put host into maintenance)*
> >> impossible.
> >> > > >
> >> > > > I would perhaps suggest that there is documentation for "advanced
> >> > users"
> >> > > or
> >> > > > similar, that will say "it's possible to enable this and this way
> >> via
> >> > DB
> >> > > > hack, but be warned on live migration consequences etc..." since
> >> this
> >> > > will
> >> > > > render more problems if people start using it.
> >> > > >
> >> > > > If you choose to do, let me know, I can write that (documentation)
> >> > > briefly.
> >> > > >
> >> > > > Not to mention it can be unsafe (power failure - less possible I
> >> guess,
> >> > > but
> >> > > > rare kernel panic etc might have it's consequences I assume)
> >> > > >
> >> > > > It does indeed increase performance on NFS much, but not
> >> necessarily on
> >> > > > CEPH (if you are using lirbd cache on client side as explained
> >> above)
> >> > > >
> >> > > > On 20 February 2018 at 15:48, Rafael Weingärtner <
> >> > > > rafaelweingart...@gmail.com> wrote:
> >> > > >
> >> > > > > Yes. Weird enough, the code is using the value in the database
> if
> >> it
> >> > is
> >> > > > > provided there, but there is no easy way for users to change
> that
> >> > > > > configuration in the database. ¯\_(ツ)_/¯
> >> > > > >
> >> > > > > On Tue, Feb 20, 2018 at 11:45 AM, Andrija Panic <
> >> > > andrija.pa...@gmail.com
> >> > > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > So it seems that just passing the cachemode value to API is
> not
> >> > > there,
> >> > > > or
> >> > > > > > somehow messedup, but deployVM process does read DB values
> from
> >> > > > > > disk_offering table for sure, and applies it to XML file for
> >> KVM.
> >> > > > > > This is above ACS 4.8.x.
> >> > > > > >
> >> > > > > >
> >> > > > > > On 20 February 2018 at 15:44, Andrija Panic <
> >> > andrija.pa...@gmail.com
> >> > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > I have edited the disk_offering table, in the cache_mode
> just
> >> > enter
> >> > > > > > > "writeback". Stop and start VM, and it will pickup/inherit
> the
> >> > > > > cache_mode
> >> > > > > > > from it's parrent offering
> >> > > > > > > This also applies to Compute/Service offering, again inside
> >> > > > > disk_offering
> >> > > > > > > table - just tested both
> >> > > > > > >
> >> > > > > > > i.e.
> >> > > > > > >
> >> > > > > > > UPDATE `cloud`.`disk_offering` SET `cache_mode`='writeback'
> >> WHERE
> >> > > > > > > `id`=102; # Compute Offering (Service offering)
> >> > > > > > > UPDATE `cloud`.`disk_offering` SET `cache_mode`='writeback'
> >> WHERE
> >> > > > > > > `id`=114; #data disk offering
> >> > > > > > >
> >> > > > > > > Before SQL:
> >> > > > > > >
> >> > > > > > > root@ix1-c7-4:~# virsh dumpxml i-2-10-VM | grep cach

RE: HA issues

2018-02-21 Thread Sean Lair
Thanks so much for the info - we'll look at that line also!

I'll let you know when we create a PR for our changes - in case you want to 
review them for your environment

-Original Message-
From: Andrija Panic [mailto:andrija.pa...@gmail.com] 
Sent: Tuesday, February 20, 2018 5:16 PM
To: dev 
Subject: Re: HA issues

That is good to hear ( no NFS issues causing Agent Disconnect).

I assume you are using "normal" NFS solution with proper HA and no ZFS (kernel 
panic etc), but anyway be aware of this one

https://github.com/apache/cloudstack/blob/e532b574ddb186a117da638fb6059356fe7c266c/scripts/vm/hypervisor/kvm/kvmheartbeat.sh#L161



we used to comment this line, because we did have some issues with 
communication link, and this commented line saved our a$$ few times :)

CHeers

On 20 February 2018 at 20:50, Sean Lair  wrote:

> Hi Andrija
>
> We are currently running XenServer in production.  We are working on 
> moving to KVM and have it deployed in a development environment.
>
> The team is putting CloudStack + KVM through its paces and that is 
> when it was discovered how broken VM HA is in 4.9.3.  Initially our 
> patches fixed VM HA, but just caused VMs to get started on two hosts 
> during failure testing.  The libvirt lockd has solved that issue thus far.
>
> Short answer to you question is :-), we were not having problems with 
> Agent Disconnects in a production environment.  It was our testing/QA 
> that revealed the issues.  Our NFS has been stable so far, no issues 
> with the agent crashing/stopping that wasn't initiated by the team's testing.
>
> Thanks
> Sean
>
>
> -Original Message-
> From: Andrija Panic [mailto:andrija.pa...@gmail.com]
> Sent: Saturday, February 17, 2018 1:49 PM
> To: dev 
> Subject: Re: HA issues
>
> Hi Sean,
>
> (we have 2 threads interleaving on the libvirt lockd..) - so, did you 
> manage to understand what can cause the Agent Disconnect in most 
> cases, for you specifically? Is there any software (CloudStack) root 
> cause (disregarding i.e. networking issues etc)
>
> Just our examples, which you should probably not have:
>
> We had CEPH cluster running (with ACS), and there any exception in 
> librbd would crash JVM and the agent, but this has been fixed mostly - 
> Now get i.e. agent disconnect when ACS try to delete volume on CEPH 
> (and for some reason not succeed withing 30 minutes, volume deletion 
> fails) - then libvirt get's completety stuck (virsh list even dont 
> work)...so  agent get's disconnect eventually.
>
> It would be good to get rid of agent disconnections in general, 
> obviously
> :) so that is why I'm asking (you are on NFS, so would like to see 
> your experience here).
>
> Thanks
>
> On 16 February 2018 at 21:52, Sean Lair  wrote:
>
> > We were in the same situation as Nux.
> >
> > In our test environment we hit the issue with VMs not getting fenced and
> > coming up on two hosts because of VM HA.   However, we updated some of
> the
> > logic for VM HA and turned on libvirtd's locking mechanism.  Now we 
> > are working great w/o IPMI.  The locking stops the VMs from starting 
> > elsewhere, and everything recovers very nicely when the host starts
> responding again.
> >
> > We are on 4.9.3 and haven't started testing with 4.11 yet, but it 
> > may work along-side IPMI just fine - it would just have affect the fencing.
> > However, we *currently* prefer how we are doing it now, because if 
> > the agent stops responding, but the host is still up, the VMs 
> > continue running and no actual downtime is incurred.  Even when VM 
> > HA attempts to power on the VMs on another host, it just fails the 
> > power-up and the VMs continue to run on the "agent disconnected" 
> > host. The host goes into alarm state and our NOC can look into what 
> > is wrong the agent on the host.  If IPMI was enabled, it sounds like 
> > it would power off the host (fence) and force downtime for us even 
> > if the VMs were actually running OK - and just the agent is unreachable.
> >
> > I plan on submitting our updates via a pull request at some point.
> > But I can also send the updated code to anyone that wants to do some 
> > testing before then.
> >
> > -Original Message-
> > From: Marcus [mailto:shadow...@gmail.com]
> > Sent: Friday, February 16, 2018 11:27 AM
> > To: dev@cloudstack.apache.org
> > Subject: Re: HA issues
> >
> > From your other emails it sounds as though you do not have IPMI 
> > configured, nor host HA enabled, correct? In this case, the correct 
> > thing to do is nothing. If CloudStack cannot guarantee the VM state 
> > (as is the case with an unreachable hypervisor), it should do 
> > nothing, for fear of causing a split brain and corrupting the VM 
> > disk (VM running
> on two hosts).
> >
> > Clustering and fencing is a tricky proposition. When CloudStack (or 
> > any other cluster manager) is not configured to or cannot guarantee 
> > state then things will simply lock up, in this case your HA VM on 
> > your br

FINAL REMINDER: CFP for Apache EU Roadshow Closes 25th February

2018-02-21 Thread Sharan F

Hello Apache Supporters and Enthusiasts

This is your FINAL reminder that the Call for Papers (CFP) for the 
Apache EU Roadshow is closing soon. Our Apache EU Roadshow will focus on 
Cloud, IoT, Apache Tomcat, Apache Http and will run from 13-14 June 2018 
in Berlin.
Note that the CFP deadline has been extended to *25*^*th* *February *and 
it will be your final opportunity to submit a talk for thisevent.


Please make your submissions at http://apachecon.com/euroadshow18/

Also note that early bird ticket registrations to attend FOSS Backstage 
including the Apache EU Roadshow, have also been extended and will be 
available until 23^rd February. Please register at 
https://foss-backstage.de/tickets


We look forward to seeing you in Berlin!

Thanks
Sharan Foga, VP Apache Community Development

PLEASE NOTE: You are receiving this message because you are subscribed 
to a user@ or dev@ list of one or more Apache Software Foundation projects.




Re: [PROPOSAL] reducing VR downtime on upgrade

2018-02-21 Thread Wido den Hollander



On 02/15/2018 04:36 PM, Daan Hoogland wrote:

The intention of this proposal is to have a way forward to reducing maintenance 
downtime for virtual routers. There are two parts to this proposal;

   1.  Dealing with legacy routers and replacing them before shutting down.
   2.  Unifying router embodiments and making use of redundancy mechanisms to 
quickly failover from old to new.

Ad .1 It will always be possible that a router is to old and will not be able 
to talk to a new version that is to replace it. This might be due to a 
keepalived update or replacement or just because it is very old. So though 
Unifying the routers and making them redundant enabled will solve a lot of use 
cases it will never deal with any conceivable situation, not even in systems 
upgraded to a version in which all intended functionality has been implemented. 
Dealing with any older router is to work as follows:

   1.  A check will be done to make sure the old VR is still up.
  *   If it is not there is no consideration it will be replaced as quickly 
as possible. Possible improvements here are the iptables configuration speedup 
and other generic optimisations unrelated to the upgrade itself.
  *   If it is there we need to walk on eggs with provisioning the new one😉
   2.  A new VR will be instantiated
   3.  Configuration data will be send but not applied.
   4.  The interfaces will be added and if need be brought down.
   5.  All configuration is applied
   6.  The old VR is killed
   7.  The interface on the new VR are brought up



Looks good! We might want the VR to send out it's version as well over 
the local socket. Using that 'version' you could see if it supports 
various things.


You could even have the VR send out 'features' so that you know what 
it's capable of.



Ad .2 This is a long-term goal. At the moment we have five (or debatably six) 
different incarnations of the virtual router:

   *   Basic zone dhcp server
   *   Shared network ‘router’
   *   VR
   *   rVR
   *   VPC
   *   rVPC


Don't forget the metadata/password server it runs in almost all cases.

Wido


a first set of steps will be to reduce this to

   *   shared networks (where a basic zone is an automatic implementation of a 
single shared network in a zone)
   *   VR (which is always redundant enabled but may have only one instance)
   *   VPC (as above)
and then the next step is to unify VR and VPC as a VR is really only a VPC with 
just one network
the final step is then to unify a shared network with a VPC and this one is so 
far ahead that I don’t want to make too much statements about it now. We will 
have to find the exact implementation hazards that we will face in this step 
along the way. I think we are talking at least one year in when we reach this 
point.

As Shapeblue we will be starting a short PoC on the first part. We will try to 
figure out if the process under .1 is feasible, or that we need to wait 
configuring interfaces to the last moment and then do a ‘blind’ start.

daan.hoogl...@shapeblue.com
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue