[Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Marc-Aurèle Brothier
Hi everyone,

I was wondering how many of you are running CloudStack with a cluster of
management servers. I would think most of you, but it would be nice to hear
everyone voices. And do you get hosts going over their capacity limits?

We discovered that during the VM allocation, if you get a lot of parallel
requests to create new VMs, most notably with large profiles, the capacity
increase is done too far after the host capacity checks and results in
hosts going over their capacity limits. To detail the steps: the deployment
planner checks for cluster/host capacity and pick up one deployment plan
(zone, cluster, host). The plan is stored in the database under a VMwork
job and another thread picks that entry and starts the deployment,
increasing the host capacity and sending the commands. Here there's a time
gap between the host being picked up and the capacity increase for that
host of a couple of seconds, which is well enough to go over the capacity
on one or more hosts. A few VMwork job can be added in the DB queue
targeting the same host before one gets picked up.

To fix this issue, we're using Zookeeper to act as the multi JVM lock
manager thanks to their curator library (
https://curator.apache.org/curator-recipes/shared-lock.html). We also
changed the time when the capacity is increased, which occurs now pretty
much after the deployment plan is found and inside the zookeeper lock. This
ensure we don't go over the capacity of any host, and it has been proven
efficient since a month in our management server cluster.

This adds another potential requirement which should be discuss before
proposing a PR. Today the code works seamlessly without ZK too, to ensure
it's not a hard requirement, for example in a lab.

Comments?

Kind regards,
Marc-Aurèle


Re: [Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Daan Hoogland
Are you proposing to add zookeeper as an optional requirement, Marc-Aurèle?
or just curator? and what is the decision mech of including it or not?

On Mon, Dec 18, 2017 at 9:33 AM, Marc-Aurèle Brothier 
wrote:

> Hi everyone,
>
> I was wondering how many of you are running CloudStack with a cluster of
> management servers. I would think most of you, but it would be nice to hear
> everyone voices. And do you get hosts going over their capacity limits?
>
> We discovered that during the VM allocation, if you get a lot of parallel
> requests to create new VMs, most notably with large profiles, the capacity
> increase is done too far after the host capacity checks and results in
> hosts going over their capacity limits. To detail the steps: the deployment
> planner checks for cluster/host capacity and pick up one deployment plan
> (zone, cluster, host). The plan is stored in the database under a VMwork
> job and another thread picks that entry and starts the deployment,
> increasing the host capacity and sending the commands. Here there's a time
> gap between the host being picked up and the capacity increase for that
> host of a couple of seconds, which is well enough to go over the capacity
> on one or more hosts. A few VMwork job can be added in the DB queue
> targeting the same host before one gets picked up.
>
> To fix this issue, we're using Zookeeper to act as the multi JVM lock
> manager thanks to their curator library (
> https://curator.apache.org/curator-recipes/shared-lock.html). We also
> changed the time when the capacity is increased, which occurs now pretty
> much after the deployment plan is found and inside the zookeeper lock. This
> ensure we don't go over the capacity of any host, and it has been proven
> efficient since a month in our management server cluster.
>
> This adds another potential requirement which should be discuss before
> proposing a PR. Today the code works seamlessly without ZK too, to ensure
> it's not a hard requirement, for example in a lab.
>
> Comments?
>
> Kind regards,
> Marc-Aurèle
>



-- 
Daan


Re: [Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Ivan Kudryavtsev
Hello, Marc-Aurele, I strongly believe that all mysql locks should be
removed in favour of truly DLM solution like Zookeeper. The performance of
3node ZK ensemble should be enough to hold up to 1000-2000 locks per second
and it helps to move to truly clustered MySQL like galera without single
master server.

2017-12-18 15:33 GMT+07:00 Marc-Aurèle Brothier :

> Hi everyone,
>
> I was wondering how many of you are running CloudStack with a cluster of
> management servers. I would think most of you, but it would be nice to hear
> everyone voices. And do you get hosts going over their capacity limits?
>
> We discovered that during the VM allocation, if you get a lot of parallel
> requests to create new VMs, most notably with large profiles, the capacity
> increase is done too far after the host capacity checks and results in
> hosts going over their capacity limits. To detail the steps: the deployment
> planner checks for cluster/host capacity and pick up one deployment plan
> (zone, cluster, host). The plan is stored in the database under a VMwork
> job and another thread picks that entry and starts the deployment,
> increasing the host capacity and sending the commands. Here there's a time
> gap between the host being picked up and the capacity increase for that
> host of a couple of seconds, which is well enough to go over the capacity
> on one or more hosts. A few VMwork job can be added in the DB queue
> targeting the same host before one gets picked up.
>
> To fix this issue, we're using Zookeeper to act as the multi JVM lock
> manager thanks to their curator library (
> https://curator.apache.org/curator-recipes/shared-lock.html). We also
> changed the time when the capacity is increased, which occurs now pretty
> much after the deployment plan is found and inside the zookeeper lock. This
> ensure we don't go over the capacity of any host, and it has been proven
> efficient since a month in our management server cluster.
>
> This adds another potential requirement which should be discuss before
> proposing a PR. Today the code works seamlessly without ZK too, to ensure
> it's not a hard requirement, for example in a lab.
>
> Comments?
>
> Kind regards,
> Marc-Aurèle
>



-- 
With best regards, Ivan Kudryavtsev
Bitworks Software, Ltd.
Cell: +7-923-414-1515
WWW: http://bitworks.software/ 


Re: [Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Rafael Weingärtner
How hard is it to configure Zookeeper and get everything up and running?
BTW: what zookeeper would be managing? CloudStack management servers or
MySQL nodes?

On Mon, Dec 18, 2017 at 7:13 AM, Ivan Kudryavtsev 
wrote:

> Hello, Marc-Aurele, I strongly believe that all mysql locks should be
> removed in favour of truly DLM solution like Zookeeper. The performance of
> 3node ZK ensemble should be enough to hold up to 1000-2000 locks per second
> and it helps to move to truly clustered MySQL like galera without single
> master server.
>
> 2017-12-18 15:33 GMT+07:00 Marc-Aurèle Brothier :
>
> > Hi everyone,
> >
> > I was wondering how many of you are running CloudStack with a cluster of
> > management servers. I would think most of you, but it would be nice to
> hear
> > everyone voices. And do you get hosts going over their capacity limits?
> >
> > We discovered that during the VM allocation, if you get a lot of parallel
> > requests to create new VMs, most notably with large profiles, the
> capacity
> > increase is done too far after the host capacity checks and results in
> > hosts going over their capacity limits. To detail the steps: the
> deployment
> > planner checks for cluster/host capacity and pick up one deployment plan
> > (zone, cluster, host). The plan is stored in the database under a VMwork
> > job and another thread picks that entry and starts the deployment,
> > increasing the host capacity and sending the commands. Here there's a
> time
> > gap between the host being picked up and the capacity increase for that
> > host of a couple of seconds, which is well enough to go over the capacity
> > on one or more hosts. A few VMwork job can be added in the DB queue
> > targeting the same host before one gets picked up.
> >
> > To fix this issue, we're using Zookeeper to act as the multi JVM lock
> > manager thanks to their curator library (
> > https://curator.apache.org/curator-recipes/shared-lock.html). We also
> > changed the time when the capacity is increased, which occurs now pretty
> > much after the deployment plan is found and inside the zookeeper lock.
> This
> > ensure we don't go over the capacity of any host, and it has been proven
> > efficient since a month in our management server cluster.
> >
> > This adds another potential requirement which should be discuss before
> > proposing a PR. Today the code works seamlessly without ZK too, to ensure
> > it's not a hard requirement, for example in a lab.
> >
> > Comments?
> >
> > Kind regards,
> > Marc-Aurèle
> >
>
>
>
> --
> With best regards, Ivan Kudryavtsev
> Bitworks Software, Ltd.
> Cell: +7-923-414-1515
> WWW: http://bitworks.software/ 
>



-- 
Rafael Weingärtner


Re: [Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Ivan Kudryavtsev
Rafael,

- It's easy to configure and run ZK either in single node or cluster
- zookeeper should replace mysql locking mechanism used inside ACS code
(places where ACS locks tables or rows).

I don't think from the other size, that moving from MySQL locks to ZK locks
is easy and light and (even implemetable) way.

2017-12-18 16:20 GMT+07:00 Rafael Weingärtner :

> How hard is it to configure Zookeeper and get everything up and running?
> BTW: what zookeeper would be managing? CloudStack management servers or
> MySQL nodes?
>
> On Mon, Dec 18, 2017 at 7:13 AM, Ivan Kudryavtsev <
> kudryavtsev...@bw-sw.com>
> wrote:
>
> > Hello, Marc-Aurele, I strongly believe that all mysql locks should be
> > removed in favour of truly DLM solution like Zookeeper. The performance
> of
> > 3node ZK ensemble should be enough to hold up to 1000-2000 locks per
> second
> > and it helps to move to truly clustered MySQL like galera without single
> > master server.
> >
> > 2017-12-18 15:33 GMT+07:00 Marc-Aurèle Brothier :
> >
> > > Hi everyone,
> > >
> > > I was wondering how many of you are running CloudStack with a cluster
> of
> > > management servers. I would think most of you, but it would be nice to
> > hear
> > > everyone voices. And do you get hosts going over their capacity limits?
> > >
> > > We discovered that during the VM allocation, if you get a lot of
> parallel
> > > requests to create new VMs, most notably with large profiles, the
> > capacity
> > > increase is done too far after the host capacity checks and results in
> > > hosts going over their capacity limits. To detail the steps: the
> > deployment
> > > planner checks for cluster/host capacity and pick up one deployment
> plan
> > > (zone, cluster, host). The plan is stored in the database under a
> VMwork
> > > job and another thread picks that entry and starts the deployment,
> > > increasing the host capacity and sending the commands. Here there's a
> > time
> > > gap between the host being picked up and the capacity increase for that
> > > host of a couple of seconds, which is well enough to go over the
> capacity
> > > on one or more hosts. A few VMwork job can be added in the DB queue
> > > targeting the same host before one gets picked up.
> > >
> > > To fix this issue, we're using Zookeeper to act as the multi JVM lock
> > > manager thanks to their curator library (
> > > https://curator.apache.org/curator-recipes/shared-lock.html). We also
> > > changed the time when the capacity is increased, which occurs now
> pretty
> > > much after the deployment plan is found and inside the zookeeper lock.
> > This
> > > ensure we don't go over the capacity of any host, and it has been
> proven
> > > efficient since a month in our management server cluster.
> > >
> > > This adds another potential requirement which should be discuss before
> > > proposing a PR. Today the code works seamlessly without ZK too, to
> ensure
> > > it's not a hard requirement, for example in a lab.
> > >
> > > Comments?
> > >
> > > Kind regards,
> > > Marc-Aurèle
> > >
> >
> >
> >
> > --
> > With best regards, Ivan Kudryavtsev
> > Bitworks Software, Ltd.
> > Cell: +7-923-414-1515
> > WWW: http://bitworks.software/ 
> >
>
>
>
> --
> Rafael Weingärtner
>



-- 
With best regards, Ivan Kudryavtsev
Bitworks Software, Ltd.
Cell: +7-923-414-1515
WWW: http://bitworks.software/ 


Re: [Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Rafael Weingärtner
so, how does that work?
I mean, instead of opening a transaction with the database and executing
locks, what do we need to do in the code?

On Mon, Dec 18, 2017 at 7:24 AM, Ivan Kudryavtsev 
wrote:

> Rafael,
>
> - It's easy to configure and run ZK either in single node or cluster
> - zookeeper should replace mysql locking mechanism used inside ACS code
> (places where ACS locks tables or rows).
>
> I don't think from the other size, that moving from MySQL locks to ZK locks
> is easy and light and (even implemetable) way.
>
> 2017-12-18 16:20 GMT+07:00 Rafael Weingärtner  >:
>
> > How hard is it to configure Zookeeper and get everything up and running?
> > BTW: what zookeeper would be managing? CloudStack management servers or
> > MySQL nodes?
> >
> > On Mon, Dec 18, 2017 at 7:13 AM, Ivan Kudryavtsev <
> > kudryavtsev...@bw-sw.com>
> > wrote:
> >
> > > Hello, Marc-Aurele, I strongly believe that all mysql locks should be
> > > removed in favour of truly DLM solution like Zookeeper. The performance
> > of
> > > 3node ZK ensemble should be enough to hold up to 1000-2000 locks per
> > second
> > > and it helps to move to truly clustered MySQL like galera without
> single
> > > master server.
> > >
> > > 2017-12-18 15:33 GMT+07:00 Marc-Aurèle Brothier :
> > >
> > > > Hi everyone,
> > > >
> > > > I was wondering how many of you are running CloudStack with a cluster
> > of
> > > > management servers. I would think most of you, but it would be nice
> to
> > > hear
> > > > everyone voices. And do you get hosts going over their capacity
> limits?
> > > >
> > > > We discovered that during the VM allocation, if you get a lot of
> > parallel
> > > > requests to create new VMs, most notably with large profiles, the
> > > capacity
> > > > increase is done too far after the host capacity checks and results
> in
> > > > hosts going over their capacity limits. To detail the steps: the
> > > deployment
> > > > planner checks for cluster/host capacity and pick up one deployment
> > plan
> > > > (zone, cluster, host). The plan is stored in the database under a
> > VMwork
> > > > job and another thread picks that entry and starts the deployment,
> > > > increasing the host capacity and sending the commands. Here there's a
> > > time
> > > > gap between the host being picked up and the capacity increase for
> that
> > > > host of a couple of seconds, which is well enough to go over the
> > capacity
> > > > on one or more hosts. A few VMwork job can be added in the DB queue
> > > > targeting the same host before one gets picked up.
> > > >
> > > > To fix this issue, we're using Zookeeper to act as the multi JVM lock
> > > > manager thanks to their curator library (
> > > > https://curator.apache.org/curator-recipes/shared-lock.html). We
> also
> > > > changed the time when the capacity is increased, which occurs now
> > pretty
> > > > much after the deployment plan is found and inside the zookeeper
> lock.
> > > This
> > > > ensure we don't go over the capacity of any host, and it has been
> > proven
> > > > efficient since a month in our management server cluster.
> > > >
> > > > This adds another potential requirement which should be discuss
> before
> > > > proposing a PR. Today the code works seamlessly without ZK too, to
> > ensure
> > > > it's not a hard requirement, for example in a lab.
> > > >
> > > > Comments?
> > > >
> > > > Kind regards,
> > > > Marc-Aurèle
> > > >
> > >
> > >
> > >
> > > --
> > > With best regards, Ivan Kudryavtsev
> > > Bitworks Software, Ltd.
> > > Cell: +7-923-414-1515
> > > WWW: http://bitworks.software/ 
> > >
> >
> >
> >
> > --
> > Rafael Weingärtner
> >
>
>
>
> --
> With best regards, Ivan Kudryavtsev
> Bitworks Software, Ltd.
> Cell: +7-923-414-1515
> WWW: http://bitworks.software/ 
>



-- 
Rafael Weingärtner


Re: [Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Marc-Aurèle Brothier
We added ZK lock for fix this issue but we will remove all current locks in
ZK in favor of ZK one. The ZK lock is already encapsulated in a project
with an interface, but more work should be done to have a proper interface
for locks which could be implemented with the "tool" you want, either a DB
lock for simplicity, or ZK for more advanced scenarios.

@Daan you will need to add the ZK libraries in CS and have a running ZK
server somewhere. The configuration value is read from the
server.properties. If the line is empty, the ZK client is not created and
any lock request will immediately return (not holding any lock).

@Rafael: ZK is pretty easy to setup and have running, as long as you don't
put too much data in it. Regarding our scenario here, with only locks, it's
easy. ZK would be only the gatekeeper to locks in the code, ensuring that
multi JVM can request a true lock.
For the code point of view, you're opening a connection to a ZK node (any
of a cluster) and you create a new InterProcessSemaphoreMutex which handles
the locking mechanism.

On Mon, Dec 18, 2017 at 10:24 AM, Ivan Kudryavtsev  wrote:

> Rafael,
>
> - It's easy to configure and run ZK either in single node or cluster
> - zookeeper should replace mysql locking mechanism used inside ACS code
> (places where ACS locks tables or rows).
>
> I don't think from the other size, that moving from MySQL locks to ZK locks
> is easy and light and (even implemetable) way.
>
> 2017-12-18 16:20 GMT+07:00 Rafael Weingärtner  >:
>
> > How hard is it to configure Zookeeper and get everything up and running?
> > BTW: what zookeeper would be managing? CloudStack management servers or
> > MySQL nodes?
> >
> > On Mon, Dec 18, 2017 at 7:13 AM, Ivan Kudryavtsev <
> > kudryavtsev...@bw-sw.com>
> > wrote:
> >
> > > Hello, Marc-Aurele, I strongly believe that all mysql locks should be
> > > removed in favour of truly DLM solution like Zookeeper. The performance
> > of
> > > 3node ZK ensemble should be enough to hold up to 1000-2000 locks per
> > second
> > > and it helps to move to truly clustered MySQL like galera without
> single
> > > master server.
> > >
> > > 2017-12-18 15:33 GMT+07:00 Marc-Aurèle Brothier :
> > >
> > > > Hi everyone,
> > > >
> > > > I was wondering how many of you are running CloudStack with a cluster
> > of
> > > > management servers. I would think most of you, but it would be nice
> to
> > > hear
> > > > everyone voices. And do you get hosts going over their capacity
> limits?
> > > >
> > > > We discovered that during the VM allocation, if you get a lot of
> > parallel
> > > > requests to create new VMs, most notably with large profiles, the
> > > capacity
> > > > increase is done too far after the host capacity checks and results
> in
> > > > hosts going over their capacity limits. To detail the steps: the
> > > deployment
> > > > planner checks for cluster/host capacity and pick up one deployment
> > plan
> > > > (zone, cluster, host). The plan is stored in the database under a
> > VMwork
> > > > job and another thread picks that entry and starts the deployment,
> > > > increasing the host capacity and sending the commands. Here there's a
> > > time
> > > > gap between the host being picked up and the capacity increase for
> that
> > > > host of a couple of seconds, which is well enough to go over the
> > capacity
> > > > on one or more hosts. A few VMwork job can be added in the DB queue
> > > > targeting the same host before one gets picked up.
> > > >
> > > > To fix this issue, we're using Zookeeper to act as the multi JVM lock
> > > > manager thanks to their curator library (
> > > > https://curator.apache.org/curator-recipes/shared-lock.html). We
> also
> > > > changed the time when the capacity is increased, which occurs now
> > pretty
> > > > much after the deployment plan is found and inside the zookeeper
> lock.
> > > This
> > > > ensure we don't go over the capacity of any host, and it has been
> > proven
> > > > efficient since a month in our management server cluster.
> > > >
> > > > This adds another potential requirement which should be discuss
> before
> > > > proposing a PR. Today the code works seamlessly without ZK too, to
> > ensure
> > > > it's not a hard requirement, for example in a lab.
> > > >
> > > > Comments?
> > > >
> > > > Kind regards,
> > > > Marc-Aurèle
> > > >
> > >
> > >
> > >
> > > --
> > > With best regards, Ivan Kudryavtsev
> > > Bitworks Software, Ltd.
> > > Cell: +7-923-414-1515
> > > WWW: http://bitworks.software/ 
> > >
> >
> >
> >
> > --
> > Rafael Weingärtner
> >
>
>
>
> --
> With best regards, Ivan Kudryavtsev
> Bitworks Software, Ltd.
> Cell: +7-923-414-1515
> WWW: http://bitworks.software/ 
>


Re: [Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Rafael Weingärtner
Do we have framework to do this kind of looking in ZK?
I mean, you said " create a new InterProcessSemaphoreMutex which handles
the locking mechanism.". This feels that we would have to continue opening
and closing this transaction manually, which is what causes a lot of our
headaches with transactions (it is not MySQL locks fault entirely, but our
code structure).

On Mon, Dec 18, 2017 at 7:47 AM, Marc-Aurèle Brothier 
wrote:

> We added ZK lock for fix this issue but we will remove all current locks in
> ZK in favor of ZK one. The ZK lock is already encapsulated in a project
> with an interface, but more work should be done to have a proper interface
> for locks which could be implemented with the "tool" you want, either a DB
> lock for simplicity, or ZK for more advanced scenarios.
>
> @Daan you will need to add the ZK libraries in CS and have a running ZK
> server somewhere. The configuration value is read from the
> server.properties. If the line is empty, the ZK client is not created and
> any lock request will immediately return (not holding any lock).
>
> @Rafael: ZK is pretty easy to setup and have running, as long as you don't
> put too much data in it. Regarding our scenario here, with only locks, it's
> easy. ZK would be only the gatekeeper to locks in the code, ensuring that
> multi JVM can request a true lock.
> For the code point of view, you're opening a connection to a ZK node (any
> of a cluster) and you create a new InterProcessSemaphoreMutex which handles
> the locking mechanism.
>
> On Mon, Dec 18, 2017 at 10:24 AM, Ivan Kudryavtsev <
> kudryavtsev...@bw-sw.com
> > wrote:
>
> > Rafael,
> >
> > - It's easy to configure and run ZK either in single node or cluster
> > - zookeeper should replace mysql locking mechanism used inside ACS code
> > (places where ACS locks tables or rows).
> >
> > I don't think from the other size, that moving from MySQL locks to ZK
> locks
> > is easy and light and (even implemetable) way.
> >
> > 2017-12-18 16:20 GMT+07:00 Rafael Weingärtner <
> rafaelweingart...@gmail.com
> > >:
> >
> > > How hard is it to configure Zookeeper and get everything up and
> running?
> > > BTW: what zookeeper would be managing? CloudStack management servers or
> > > MySQL nodes?
> > >
> > > On Mon, Dec 18, 2017 at 7:13 AM, Ivan Kudryavtsev <
> > > kudryavtsev...@bw-sw.com>
> > > wrote:
> > >
> > > > Hello, Marc-Aurele, I strongly believe that all mysql locks should be
> > > > removed in favour of truly DLM solution like Zookeeper. The
> performance
> > > of
> > > > 3node ZK ensemble should be enough to hold up to 1000-2000 locks per
> > > second
> > > > and it helps to move to truly clustered MySQL like galera without
> > single
> > > > master server.
> > > >
> > > > 2017-12-18 15:33 GMT+07:00 Marc-Aurèle Brothier :
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I was wondering how many of you are running CloudStack with a
> cluster
> > > of
> > > > > management servers. I would think most of you, but it would be nice
> > to
> > > > hear
> > > > > everyone voices. And do you get hosts going over their capacity
> > limits?
> > > > >
> > > > > We discovered that during the VM allocation, if you get a lot of
> > > parallel
> > > > > requests to create new VMs, most notably with large profiles, the
> > > > capacity
> > > > > increase is done too far after the host capacity checks and results
> > in
> > > > > hosts going over their capacity limits. To detail the steps: the
> > > > deployment
> > > > > planner checks for cluster/host capacity and pick up one deployment
> > > plan
> > > > > (zone, cluster, host). The plan is stored in the database under a
> > > VMwork
> > > > > job and another thread picks that entry and starts the deployment,
> > > > > increasing the host capacity and sending the commands. Here
> there's a
> > > > time
> > > > > gap between the host being picked up and the capacity increase for
> > that
> > > > > host of a couple of seconds, which is well enough to go over the
> > > capacity
> > > > > on one or more hosts. A few VMwork job can be added in the DB queue
> > > > > targeting the same host before one gets picked up.
> > > > >
> > > > > To fix this issue, we're using Zookeeper to act as the multi JVM
> lock
> > > > > manager thanks to their curator library (
> > > > > https://curator.apache.org/curator-recipes/shared-lock.html). We
> > also
> > > > > changed the time when the capacity is increased, which occurs now
> > > pretty
> > > > > much after the deployment plan is found and inside the zookeeper
> > lock.
> > > > This
> > > > > ensure we don't go over the capacity of any host, and it has been
> > > proven
> > > > > efficient since a month in our management server cluster.
> > > > >
> > > > > This adds another potential requirement which should be discuss
> > before
> > > > > proposing a PR. Today the code works seamlessly without ZK too, to
> > > ensure
> > > > > it's not a hard requirement, for example in a lab.
> > > > >
> > > > > Comments?
> > > 

Re: [Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Marc-Aurèle Brothier
@rafael, yes there is a framework (curator), it's the link I posted in my
first message: https://curator.apache.org/curator-recipes/shared-lock.html
This framework helps handling all the complexity of ZK.

The ZK client stays connected all the time (as the DB connection pool), and
only one connection (ZKClient) is needed to communicate with the ZK server.
The framework handles reconnection as well.

Have a look at ehc curator website to understand its goal:
https://curator.apache.org/

On Mon, Dec 18, 2017 at 11:01 AM, Rafael Weingärtner <
rafaelweingart...@gmail.com> wrote:

> Do we have framework to do this kind of looking in ZK?
> I mean, you said " create a new InterProcessSemaphoreMutex which handles
> the locking mechanism.". This feels that we would have to continue opening
> and closing this transaction manually, which is what causes a lot of our
> headaches with transactions (it is not MySQL locks fault entirely, but our
> code structure).
>
> On Mon, Dec 18, 2017 at 7:47 AM, Marc-Aurèle Brothier 
> wrote:
>
> > We added ZK lock for fix this issue but we will remove all current locks
> in
> > ZK in favor of ZK one. The ZK lock is already encapsulated in a project
> > with an interface, but more work should be done to have a proper
> interface
> > for locks which could be implemented with the "tool" you want, either a
> DB
> > lock for simplicity, or ZK for more advanced scenarios.
> >
> > @Daan you will need to add the ZK libraries in CS and have a running ZK
> > server somewhere. The configuration value is read from the
> > server.properties. If the line is empty, the ZK client is not created and
> > any lock request will immediately return (not holding any lock).
> >
> > @Rafael: ZK is pretty easy to setup and have running, as long as you
> don't
> > put too much data in it. Regarding our scenario here, with only locks,
> it's
> > easy. ZK would be only the gatekeeper to locks in the code, ensuring that
> > multi JVM can request a true lock.
> > For the code point of view, you're opening a connection to a ZK node (any
> > of a cluster) and you create a new InterProcessSemaphoreMutex which
> handles
> > the locking mechanism.
> >
> > On Mon, Dec 18, 2017 at 10:24 AM, Ivan Kudryavtsev <
> > kudryavtsev...@bw-sw.com
> > > wrote:
> >
> > > Rafael,
> > >
> > > - It's easy to configure and run ZK either in single node or cluster
> > > - zookeeper should replace mysql locking mechanism used inside ACS code
> > > (places where ACS locks tables or rows).
> > >
> > > I don't think from the other size, that moving from MySQL locks to ZK
> > locks
> > > is easy and light and (even implemetable) way.
> > >
> > > 2017-12-18 16:20 GMT+07:00 Rafael Weingärtner <
> > rafaelweingart...@gmail.com
> > > >:
> > >
> > > > How hard is it to configure Zookeeper and get everything up and
> > running?
> > > > BTW: what zookeeper would be managing? CloudStack management servers
> or
> > > > MySQL nodes?
> > > >
> > > > On Mon, Dec 18, 2017 at 7:13 AM, Ivan Kudryavtsev <
> > > > kudryavtsev...@bw-sw.com>
> > > > wrote:
> > > >
> > > > > Hello, Marc-Aurele, I strongly believe that all mysql locks should
> be
> > > > > removed in favour of truly DLM solution like Zookeeper. The
> > performance
> > > > of
> > > > > 3node ZK ensemble should be enough to hold up to 1000-2000 locks
> per
> > > > second
> > > > > and it helps to move to truly clustered MySQL like galera without
> > > single
> > > > > master server.
> > > > >
> > > > > 2017-12-18 15:33 GMT+07:00 Marc-Aurèle Brothier  >:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > I was wondering how many of you are running CloudStack with a
> > cluster
> > > > of
> > > > > > management servers. I would think most of you, but it would be
> nice
> > > to
> > > > > hear
> > > > > > everyone voices. And do you get hosts going over their capacity
> > > limits?
> > > > > >
> > > > > > We discovered that during the VM allocation, if you get a lot of
> > > > parallel
> > > > > > requests to create new VMs, most notably with large profiles, the
> > > > > capacity
> > > > > > increase is done too far after the host capacity checks and
> results
> > > in
> > > > > > hosts going over their capacity limits. To detail the steps: the
> > > > > deployment
> > > > > > planner checks for cluster/host capacity and pick up one
> deployment
> > > > plan
> > > > > > (zone, cluster, host). The plan is stored in the database under a
> > > > VMwork
> > > > > > job and another thread picks that entry and starts the
> deployment,
> > > > > > increasing the host capacity and sending the commands. Here
> > there's a
> > > > > time
> > > > > > gap between the host being picked up and the capacity increase
> for
> > > that
> > > > > > host of a couple of seconds, which is well enough to go over the
> > > > capacity
> > > > > > on one or more hosts. A few VMwork job can be added in the DB
> queue
> > > > > > targeting the same host before one gets picked up.
> > > > > >
> > > > > > To fix this iss

Re: MySQL HA

2017-12-18 Thread Rafael Weingärtner
Here is a fix:
https://www.dropbox.com/s/kgakhs3v05uz88x/cloud-framework-cluster-4.9.3.0.jar?dl=1
You need to replace this jar file in CloudStack installation. You should
also backup the original jar and restore it as soon as you finish testing.
To replace the JARs, you need to stop ACS, and just then start it.

If everything works fine, I will open a PR against master, and with a bit
of luck we can push it into 4.11

On Sat, Dec 16, 2017 at 8:03 AM, Alireza Eskandari 
wrote:

> I'm using CS 4.9.3.0-shapeblue0
>
> On Sat, Dec 16, 2017 at 12:49 PM, Rafael Weingärtner
>  wrote:
> > Awesome!
> > I found one method that might seem the cause of the problem.
> > What is the version of ACS that you are using?
> >
> > On Sat, Dec 16, 2017 at 4:10 AM, Alireza Eskandari <
> astro.alir...@gmail.com>
> > wrote:
> >
> >> Hi
> >>
> >> Gabriel,
> >> My configuration is same as your suggestion, but I get the errors.
> >>
> >> Rafael,
> >> You are right. I confirm that CS works normally but I get those
> warnings.
> >> I would make me happy to help you for this fix :)
> >>
> >>
> >> On Tue, Dec 12, 2017 at 3:30 PM, Rafael Weingärtner
> >>  wrote:
> >> > Alireza,
> >> > This is a warning and should not cause you much trouble. I have been
> >> trying
> >> > to pin point this problem for quite some time now.
> >> > If I generate a fix, would you be willing to test it?
> >> >
> >> > On Tue, Dec 12, 2017 at 8:56 AM, Gabriel Beims Bräscher <
> >> > gabrasc...@gmail.com> wrote:
> >> >
> >> >> Hi Alireza,
> >> >>
> >> >> I have production environments with Master to Master replication and
> >> >> we have no problems. We may need more details of your configuration.
> >> >> Have you configured the slave database? Are you sure that you
> configured
> >> >> correctly the ha heuristic?
> >> >>
> >> >> Considering that you already configured replication and "my.cnf", I
> will
> >> >> focus on the CloudSack db.properties file.
> >> >>
> >> >> When configuring Master-Master replication, you should have at
> >> >> /etc/cloudstack/management/db.properties something like:
> >> >> -
> >> >> db.cloud.autoReconnectForPools=true
> >> >>
> >> >> #High Availability And Cluster Properties
> >> >> db.ha.enabled=true
> >> >>
> >> >> db.cloud.queriesBeforeRetryMaster=5000
> >> >> db.usage.failOverReadOnly=false
> >> >> db.cloud.slaves=acs-db-02
> >> >>
> >> >> cluster.node.IP=
> >> >>
> >> >> db.usage.autoReconnect=true
> >> >>
> >> >> db.cloud.host=acs-db-01
> >> >> db.usage.host=acs-db-01
> >> >>
> >> >> #db.ha.loadBalanceStrategy=com.mysql.jdbc.SequentialBalanceStrategy
> >> >> db.ha.loadBalanceStrategy=com.cloud.utils.db.StaticStrategy
> >> >>
> >> >> db.cloud.failOverReadOnly=false
> >> >> db.usage.slaves=acs-db-02
> >> >> -
> >> >>
> >> >> "db.ha.loadBalanceStrategy" is confiugured with the heuristic
> >> >> "com.cloud.utils.db.StaticStrategy"
> >> >>
> >> >> "db.ha.enabled" need to be “true”
> >> >>
> >> >> The primary database is configured with the variable “db.cloud.host”.
> >> The
> >> >> secondary database(s) is(are) configured with the variable
> >> >> “db.usage.slaves”. One variable that is different from both Apache
> >> >> CloudStack servers is “cluster.node.IP”, being the ACS mgt IP.
> >> >> Additionally, you will need to create a folder
> >> >> “/usr/share/cloudstack-mysql-ha/lib/” and move the jar file
> >> >> “cloud-plugin-database-mysqlha-4.9.3.0.jar” into the new folder.
> >> >>
> >> >> -
> >> >> mkdir -p /usr/share/cloudstack-mysql-ha/lib/
> >> >> cp
> >> >> /usr/share/cloudstack-management/webapps/client/WEB-
> >> >> INF/lib/cloud-plugin-database-mysqlha-4.9.3.0.jar
> >> >> /usr/share/cloudstack-mysql-ha/lib/
> >> >> -
> >> >>
> >> >> Cheers,
> >> >> Gabriel.
> >> >>
> >> >> 2017-12-12 6:30 GMT-02:00 Alireza Eskandari  >:
> >> >>
> >> >> > I have opened a new jira ticket about this problem:
> >> >> > https://issues.apache.org/jira/browse/CLOUDSTACK-10186
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Rafael Weingärtner
> >>
> >
> >
> >
> > --
> > Rafael Weingärtner
>



-- 
Rafael Weingärtner


Re: [Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Rafael Weingärtner
I did not check the link before. Sorry about that.

Reading some of the pages there, I see curator more like a client library
such as MySQL JDBC client.

When I mentioned framework, I was looking for something like Spring-data.
So, we could simply rely on the framework to manage connections and
transactions. For instance, we could define a pattern that would open
connection with a read-only transaction. And then, we could annotate
methods that would write in the database something with
@Transactional(readonly = false). If we are going to a change like this we
need to remove manually open connections and transactions. Also, we have to
remove the transaction management code from our code base.

I would like to see something like this [1] in our future. No manually
written transaction code, and no transaction management in our code base.
Just simple annotation usage or transaction pattern in Spring XML files.

[1]
https://github.com/rafaelweingartner/daily-tasks/blob/master/src/main/java/br/com/supero/desafio/services/TaskService.java

On Mon, Dec 18, 2017 at 8:32 AM, Marc-Aurèle Brothier 
wrote:

> @rafael, yes there is a framework (curator), it's the link I posted in my
> first message: https://curator.apache.org/curator-recipes/shared-lock.html
> This framework helps handling all the complexity of ZK.
>
> The ZK client stays connected all the time (as the DB connection pool), and
> only one connection (ZKClient) is needed to communicate with the ZK server.
> The framework handles reconnection as well.
>
> Have a look at ehc curator website to understand its goal:
> https://curator.apache.org/
>
> On Mon, Dec 18, 2017 at 11:01 AM, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
>
> > Do we have framework to do this kind of looking in ZK?
> > I mean, you said " create a new InterProcessSemaphoreMutex which handles
> > the locking mechanism.". This feels that we would have to continue
> opening
> > and closing this transaction manually, which is what causes a lot of our
> > headaches with transactions (it is not MySQL locks fault entirely, but
> our
> > code structure).
> >
> > On Mon, Dec 18, 2017 at 7:47 AM, Marc-Aurèle Brothier  >
> > wrote:
> >
> > > We added ZK lock for fix this issue but we will remove all current
> locks
> > in
> > > ZK in favor of ZK one. The ZK lock is already encapsulated in a project
> > > with an interface, but more work should be done to have a proper
> > interface
> > > for locks which could be implemented with the "tool" you want, either a
> > DB
> > > lock for simplicity, or ZK for more advanced scenarios.
> > >
> > > @Daan you will need to add the ZK libraries in CS and have a running ZK
> > > server somewhere. The configuration value is read from the
> > > server.properties. If the line is empty, the ZK client is not created
> and
> > > any lock request will immediately return (not holding any lock).
> > >
> > > @Rafael: ZK is pretty easy to setup and have running, as long as you
> > don't
> > > put too much data in it. Regarding our scenario here, with only locks,
> > it's
> > > easy. ZK would be only the gatekeeper to locks in the code, ensuring
> that
> > > multi JVM can request a true lock.
> > > For the code point of view, you're opening a connection to a ZK node
> (any
> > > of a cluster) and you create a new InterProcessSemaphoreMutex which
> > handles
> > > the locking mechanism.
> > >
> > > On Mon, Dec 18, 2017 at 10:24 AM, Ivan Kudryavtsev <
> > > kudryavtsev...@bw-sw.com
> > > > wrote:
> > >
> > > > Rafael,
> > > >
> > > > - It's easy to configure and run ZK either in single node or cluster
> > > > - zookeeper should replace mysql locking mechanism used inside ACS
> code
> > > > (places where ACS locks tables or rows).
> > > >
> > > > I don't think from the other size, that moving from MySQL locks to ZK
> > > locks
> > > > is easy and light and (even implemetable) way.
> > > >
> > > > 2017-12-18 16:20 GMT+07:00 Rafael Weingärtner <
> > > rafaelweingart...@gmail.com
> > > > >:
> > > >
> > > > > How hard is it to configure Zookeeper and get everything up and
> > > running?
> > > > > BTW: what zookeeper would be managing? CloudStack management
> servers
> > or
> > > > > MySQL nodes?
> > > > >
> > > > > On Mon, Dec 18, 2017 at 7:13 AM, Ivan Kudryavtsev <
> > > > > kudryavtsev...@bw-sw.com>
> > > > > wrote:
> > > > >
> > > > > > Hello, Marc-Aurele, I strongly believe that all mysql locks
> should
> > be
> > > > > > removed in favour of truly DLM solution like Zookeeper. The
> > > performance
> > > > > of
> > > > > > 3node ZK ensemble should be enough to hold up to 1000-2000 locks
> > per
> > > > > second
> > > > > > and it helps to move to truly clustered MySQL like galera without
> > > > single
> > > > > > master server.
> > > > > >
> > > > > > 2017-12-18 15:33 GMT+07:00 Marc-Aurèle Brothier <
> ma...@exoscale.ch
> > >:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > I was wondering how many of you are running CloudStack with a
> > > clus

Re: MySQL HA

2017-12-18 Thread L Radhakrishna Rao
On 18-Dec-2017 4:03 PM, "Rafael Weingärtner" 
wrote:

> Here is a fix:
> https://www.dropbox.com/s/kgakhs3v05uz88x/cloud-
> framework-cluster-4.9.3.0.jar?dl=1
> You need to replace this jar file in CloudStack installation. You should
> also backup the original jar and restore it as soon as you finish testing.
> To replace the JARs, you need to stop ACS, and just then start it.
>
> If everything works fine, I will open a PR against master, and with a bit
> of luck we can push it into 4.11
>
> On Sat, Dec 16, 2017 at 8:03 AM, Alireza Eskandari <
> astro.alir...@gmail.com>
> wrote:
>
> > I'm using CS 4.9.3.0-shapeblue0
> >
> > On Sat, Dec 16, 2017 at 12:49 PM, Rafael Weingärtner
> >  wrote:
> > > Awesome!
> > > I found one method that might seem the cause of the problem.
> > > What is the version of ACS that you are using?
> > >
> > > On Sat, Dec 16, 2017 at 4:10 AM, Alireza Eskandari <
> > astro.alir...@gmail.com>
> > > wrote:
> > >
> > >> Hi
> > >>
> > >> Gabriel,
> > >> My configuration is same as your suggestion, but I get the errors.
> > >>
> > >> Rafael,
> > >> You are right. I confirm that CS works normally but I get those
> > warnings.
> > >> I would make me happy to help you for this fix :)
> > >>
> > >>
> > >> On Tue, Dec 12, 2017 at 3:30 PM, Rafael Weingärtner
> > >>  wrote:
> > >> > Alireza,
> > >> > This is a warning and should not cause you much trouble. I have been
> > >> trying
> > >> > to pin point this problem for quite some time now.
> > >> > If I generate a fix, would you be willing to test it?
> > >> >
> > >> > On Tue, Dec 12, 2017 at 8:56 AM, Gabriel Beims Bräscher <
> > >> > gabrasc...@gmail.com> wrote:
> > >> >
> > >> >> Hi Alireza,
> > >> >>
> > >> >> I have production environments with Master to Master replication
> and
> > >> >> we have no problems. We may need more details of your
> configuration.
> > >> >> Have you configured the slave database? Are you sure that you
> > configured
> > >> >> correctly the ha heuristic?
> > >> >>
> > >> >> Considering that you already configured replication and "my.cnf", I
> > will
> > >> >> focus on the CloudSack db.properties file.
> > >> >>
> > >> >> When configuring Master-Master replication, you should have at
> > >> >> /etc/cloudstack/management/db.properties something like:
> > >> >> -
> > >> >> db.cloud.autoReconnectForPools=true
> > >> >>
> > >> >> #High Availability And Cluster Properties
> > >> >> db.ha.enabled=true
> > >> >>
> > >> >> db.cloud.queriesBeforeRetryMaster=5000
> > >> >> db.usage.failOverReadOnly=false
> > >> >> db.cloud.slaves=acs-db-02
> > >> >>
> > >> >> cluster.node.IP=
> > >> >>
> > >> >> db.usage.autoReconnect=true
> > >> >>
> > >> >> db.cloud.host=acs-db-01
> > >> >> db.usage.host=acs-db-01
> > >> >>
> > >> >> #db.ha.loadBalanceStrategy=com.mysql.jdbc.
> SequentialBalanceStrategy
> > >> >> db.ha.loadBalanceStrategy=com.cloud.utils.db.StaticStrategy
> > >> >>
> > >> >> db.cloud.failOverReadOnly=false
> > >> >> db.usage.slaves=acs-db-02
> > >> >> -
> > >> >>
> > >> >> "db.ha.loadBalanceStrategy" is confiugured with the heuristic
> > >> >> "com.cloud.utils.db.StaticStrategy"
> > >> >>
> > >> >> "db.ha.enabled" need to be “true”
> > >> >>
> > >> >> The primary database is configured with the variable
> “db.cloud.host”.
> > >> The
> > >> >> secondary database(s) is(are) configured with the variable
> > >> >> “db.usage.slaves”. One variable that is different from both Apache
> > >> >> CloudStack servers is “cluster.node.IP”, being the ACS mgt IP.
> > >> >> Additionally, you will need to create a folder
> > >> >> “/usr/share/cloudstack-mysql-ha/lib/” and move the jar file
> > >> >> “cloud-plugin-database-mysqlha-4.9.3.0.jar” into the new folder.
> > >> >>
> > >> >> -
> > >> >> mkdir -p /usr/share/cloudstack-mysql-ha/lib/
> > >> >> cp
> > >> >> /usr/share/cloudstack-management/webapps/client/WEB-
> > >> >> INF/lib/cloud-plugin-database-mysqlha-4.9.3.0.jar
> > >> >> /usr/share/cloudstack-mysql-ha/lib/
> > >> >> -
> > >> >>
> > >> >> Cheers,
> > >> >> Gabriel.
> > >> >>
> > >> >> 2017-12-12 6:30 GMT-02:00 Alireza Eskandari <
> astro.alir...@gmail.com
> > >:
> > >> >>
> > >> >> > I have opened a new jira ticket about this problem:
> > >> >> > https://issues.apache.org/jira/browse/CLOUDSTACK-10186
> > >> >> >
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Rafael Weingärtner
> > >>
> > >
> > >
> > >
> > > --
> > > Rafael Weingärtner
> >
>
>
>
> --
> Rafael Weingärtner
>


Re: [Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Marc-Aurèle Brothier
I understand your point, but there isn't any "transaction" in ZK. The
transaction and commit stuff are really for DB and not part of ZK. All
entries (if you start writing data in some nodes) are versioned. For
example you could enforce that to overwrite a node value you must submit
the node data having the same last version id to ensure you were
overwriting from the latest value/state of that node. Bear in mind that you
should not put too much data into your ZK, it's not a database replacement,
neither a nosql db.

The ZK client (CuratorFramework object) is started on the server startup,
and you only need to pass it along your calls so that the connection is
reused, or retried, depending on the state. Nothing manual has to be done,
it's all in this curator library.

On Mon, Dec 18, 2017 at 11:44 AM, Rafael Weingärtner <
rafaelweingart...@gmail.com> wrote:

> I did not check the link before. Sorry about that.
>
> Reading some of the pages there, I see curator more like a client library
> such as MySQL JDBC client.
>
> When I mentioned framework, I was looking for something like Spring-data.
> So, we could simply rely on the framework to manage connections and
> transactions. For instance, we could define a pattern that would open
> connection with a read-only transaction. And then, we could annotate
> methods that would write in the database something with
> @Transactional(readonly = false). If we are going to a change like this we
> need to remove manually open connections and transactions. Also, we have to
> remove the transaction management code from our code base.
>
> I would like to see something like this [1] in our future. No manually
> written transaction code, and no transaction management in our code base.
> Just simple annotation usage or transaction pattern in Spring XML files.
>
> [1]
> https://github.com/rafaelweingartner/daily-tasks/
> blob/master/src/main/java/br/com/supero/desafio/services/TaskService.java
>
> On Mon, Dec 18, 2017 at 8:32 AM, Marc-Aurèle Brothier 
> wrote:
>
> > @rafael, yes there is a framework (curator), it's the link I posted in my
> > first message: https://curator.apache.org/curator-recipes/shared-lock.
> html
> > This framework helps handling all the complexity of ZK.
> >
> > The ZK client stays connected all the time (as the DB connection pool),
> and
> > only one connection (ZKClient) is needed to communicate with the ZK
> server.
> > The framework handles reconnection as well.
> >
> > Have a look at ehc curator website to understand its goal:
> > https://curator.apache.org/
> >
> > On Mon, Dec 18, 2017 at 11:01 AM, Rafael Weingärtner <
> > rafaelweingart...@gmail.com> wrote:
> >
> > > Do we have framework to do this kind of looking in ZK?
> > > I mean, you said " create a new InterProcessSemaphoreMutex which
> handles
> > > the locking mechanism.". This feels that we would have to continue
> > opening
> > > and closing this transaction manually, which is what causes a lot of
> our
> > > headaches with transactions (it is not MySQL locks fault entirely, but
> > our
> > > code structure).
> > >
> > > On Mon, Dec 18, 2017 at 7:47 AM, Marc-Aurèle Brothier <
> ma...@exoscale.ch
> > >
> > > wrote:
> > >
> > > > We added ZK lock for fix this issue but we will remove all current
> > locks
> > > in
> > > > ZK in favor of ZK one. The ZK lock is already encapsulated in a
> project
> > > > with an interface, but more work should be done to have a proper
> > > interface
> > > > for locks which could be implemented with the "tool" you want,
> either a
> > > DB
> > > > lock for simplicity, or ZK for more advanced scenarios.
> > > >
> > > > @Daan you will need to add the ZK libraries in CS and have a running
> ZK
> > > > server somewhere. The configuration value is read from the
> > > > server.properties. If the line is empty, the ZK client is not created
> > and
> > > > any lock request will immediately return (not holding any lock).
> > > >
> > > > @Rafael: ZK is pretty easy to setup and have running, as long as you
> > > don't
> > > > put too much data in it. Regarding our scenario here, with only
> locks,
> > > it's
> > > > easy. ZK would be only the gatekeeper to locks in the code, ensuring
> > that
> > > > multi JVM can request a true lock.
> > > > For the code point of view, you're opening a connection to a ZK node
> > (any
> > > > of a cluster) and you create a new InterProcessSemaphoreMutex which
> > > handles
> > > > the locking mechanism.
> > > >
> > > > On Mon, Dec 18, 2017 at 10:24 AM, Ivan Kudryavtsev <
> > > > kudryavtsev...@bw-sw.com
> > > > > wrote:
> > > >
> > > > > Rafael,
> > > > >
> > > > > - It's easy to configure and run ZK either in single node or
> cluster
> > > > > - zookeeper should replace mysql locking mechanism used inside ACS
> > code
> > > > > (places where ACS locks tables or rows).
> > > > >
> > > > > I don't think from the other size, that moving from MySQL locks to
> ZK
> > > > locks
> > > > > is easy and light and (even implemetable) way.
> > >

Re: MySQL HA

2017-12-18 Thread Alireza Eskandari
Thank you Rafael,
I test your fix and it seems that I have got the expected result. You
can see the exception raised for database failover.
I should notice I replace the file for cloudstack-mnagement and
cloudstack-usage:
/usr/share/cloudstack-usage/lib/cloud-framework-cluster-4.9.3.0.jar
/usr/share/cloudstack-management/webapps/client/WEB-INF/lib/cloud-framework-cluster-4.9.3.0.jar


Logs:

WARN  [c.c.c.d.ManagementServerHostDaoImpl]
(Cluster-Heartbeat-1:ctx-073cca55) (logid:e652d00b) Unexpected
exception,
com.cloud.utils.exception.CloudRuntimeException: Unable to commit or
close the connection.
at 
com.cloud.utils.db.TransactionLegacy.commit(TransactionLegacy.java:740)
at 
com.cloud.cluster.dao.ManagementServerHostDaoImpl.update(ManagementServerHostDaoImpl.java:140)
at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
at 
org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
at 
com.cloud.utils.db.TransactionContextInterceptor.invoke(TransactionContextInterceptor.java:34)
at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:161)
at 
org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
at 
org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at com.sun.proxy.$Proxy203.update(Unknown Source)
at 
com.cloud.cluster.ClusterManagerImpl$4.runInContext(ClusterManagerImpl.java:555)
at 
org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
at 
org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:473)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException:
Deadlock found when trying to get lock; try restarting transaction
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
... 46 more
INFO  [o.a.c.f.j.i.AsyncJobManagerImpl]
(AsyncJobMgr-Heartbeat-1:ctx-5ef0f4d1) (logid:4bfa48b2) Begin cleanup
expired async-jobs
INFO  [o.a.c.f.j.i.AsyncJobManagerImpl]
(AsyncJobMgr-Heartbeat-1:ctx-5ef0f4d1) (logid:4bfa48b2) End cleanup
expired async-jobs
ERROR [c.c.u.d.ConnectionConcierge]
(ConnectionConcierge-1:ctx-d3460aeb) (logid:b8c62262) Unable to keep
the db connection for LockMaster1
java.sql.SQLException: Connection was killed
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3597)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3529)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1990)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2151)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2625)
at 
com.mysql.jdbc.LoadBalancedMySQLConnection.execSQL(LoadBalancedMySQLConnection.java:155)
at 
com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2119)
at 
com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2283)
at sun.reflect.GeneratedMethodAccessor75.invoke(Unknown Source)
  

Re: MySQL HA

2017-12-18 Thread Rafael Weingärtner
So, this fixed the problem?
Can you keep this running for a while longer? Just to make sure. Then, I
can open a PR to fix it in master.

On Mon, Dec 18, 2017 at 9:02 AM, Alireza Eskandari 
wrote:

> Thank you Rafael,
> I test your fix and it seems that I have got the expected result. You
> can see the exception raised for database failover.
> I should notice I replace the file for cloudstack-mnagement and
> cloudstack-usage:
> /usr/share/cloudstack-usage/lib/cloud-framework-cluster-4.9.3.0.jar
> /usr/share/cloudstack-management/webapps/client/WEB-
> INF/lib/cloud-framework-cluster-4.9.3.0.jar
>
>
> Logs:
>
> WARN  [c.c.c.d.ManagementServerHostDaoImpl]
> (Cluster-Heartbeat-1:ctx-073cca55) (logid:e652d00b) Unexpected
> exception,
> com.cloud.utils.exception.CloudRuntimeException: Unable to commit or
> close the connection.
> at com.cloud.utils.db.TransactionLegacy.commit(
> TransactionLegacy.java:740)
> at com.cloud.cluster.dao.ManagementServerHostDaoImpl.update(
> ManagementServerHostDaoImpl.java:140)
> at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.springframework.aop.support.AopUtils.
> invokeJoinpointUsingReflection(AopUtils.java:317)
> at org.springframework.aop.framework.ReflectiveMethodInvocation.
> invokeJoinpoint(ReflectiveMethodInvocation.java:183)
> at org.springframework.aop.framework.ReflectiveMethodInvocation.
> proceed(ReflectiveMethodInvocation.java:150)
> at com.cloud.utils.db.TransactionContextInterceptor.invoke(
> TransactionContextInterceptor.java:34)
> at org.springframework.aop.framework.ReflectiveMethodInvocation.
> proceed(ReflectiveMethodInvocation.java:161)
> at org.springframework.aop.interceptor.
> ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
> at org.springframework.aop.framework.ReflectiveMethodInvocation.
> proceed(ReflectiveMethodInvocation.java:172)
> at org.springframework.aop.framework.JdkDynamicAopProxy.
> invoke(JdkDynamicAopProxy.java:204)
> at com.sun.proxy.$Proxy203.update(Unknown Source)
> at com.cloud.cluster.ClusterManagerImpl$4.runInContext(
> ClusterManagerImpl.java:555)
> at org.apache.cloudstack.managed.context.
> ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
> at org.apache.cloudstack.managed.context.impl.
> DefaultManagedContext$1.call(DefaultManagedContext.java:56)
> at org.apache.cloudstack.managed.context.impl.
> DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
> at org.apache.cloudstack.managed.context.impl.
> DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
> at org.apache.cloudstack.managed.context.
> ManagedContextRunnable.run(ManagedContextRunnable.java:46)
> at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:473)
> at java.util.concurrent.FutureTask.runAndReset(
> FutureTask.java:304)
> at java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> at java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1152)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:622)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackExcept
> ion:
> Deadlock found when trying to get lock; try restarting transaction
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> NativeConstructorAccessorImpl.java:57)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
> ... 46 more
> INFO  [o.a.c.f.j.i.AsyncJobManagerImpl]
> (AsyncJobMgr-Heartbeat-1:ctx-5ef0f4d1) (logid:4bfa48b2) Begin cleanup
> expired async-jobs
> INFO  [o.a.c.f.j.i.AsyncJobManagerImpl]
> (AsyncJobMgr-Heartbeat-1:ctx-5ef0f4d1) (logid:4bfa48b2) End cleanup
> expired async-jobs
> ERROR [c.c.u.d.ConnectionConcierge]
> (ConnectionConcierge-1:ctx-d3460aeb) (logid:b8c62262) Unable to keep
> the db connection for LockMaster1
> java.sql.SQLException: Connection was killed
> at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3597)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3529)
> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1990)
> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2151)
> at com.mysql.jdbc.ConnectionImpl.execSQL

Re: MySQL HA

2017-12-18 Thread Alireza Eskandari
Yes, I'll keep it and do some stress tests on it to be sure about its
functionality.

On Dec 18, 2017 14:53, "Rafael Weingärtner" 
wrote:

> So, this fixed the problem?
> Can you keep this running for a while longer? Just to make sure. Then, I
> can open a PR to fix it in master.
>
> On Mon, Dec 18, 2017 at 9:02 AM, Alireza Eskandari <
> astro.alir...@gmail.com>
> wrote:
>
> > Thank you Rafael,
> > I test your fix and it seems that I have got the expected result. You
> > can see the exception raised for database failover.
> > I should notice I replace the file for cloudstack-mnagement and
> > cloudstack-usage:
> > /usr/share/cloudstack-usage/lib/cloud-framework-cluster-4.9.3.0.jar
> > /usr/share/cloudstack-management/webapps/client/WEB-
> > INF/lib/cloud-framework-cluster-4.9.3.0.jar
> >
> >
> > Logs:
> >
> > WARN  [c.c.c.d.ManagementServerHostDaoImpl]
> > (Cluster-Heartbeat-1:ctx-073cca55) (logid:e652d00b) Unexpected
> > exception,
> > com.cloud.utils.exception.CloudRuntimeException: Unable to commit or
> > close the connection.
> > at com.cloud.utils.db.TransactionLegacy.commit(
> > TransactionLegacy.java:740)
> > at com.cloud.cluster.dao.ManagementServerHostDaoImpl.update(
> > ManagementServerHostDaoImpl.java:140)
> > at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source)
> > at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> > DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:606)
> > at org.springframework.aop.support.AopUtils.
> > invokeJoinpointUsingReflection(AopUtils.java:317)
> > at org.springframework.aop.framework.ReflectiveMethodInvocation.
> > invokeJoinpoint(ReflectiveMethodInvocation.java:183)
> > at org.springframework.aop.framework.ReflectiveMethodInvocation.
> > proceed(ReflectiveMethodInvocation.java:150)
> > at com.cloud.utils.db.TransactionContextInterceptor.invoke(
> > TransactionContextInterceptor.java:34)
> > at org.springframework.aop.framework.ReflectiveMethodInvocation.
> > proceed(ReflectiveMethodInvocation.java:161)
> > at org.springframework.aop.interceptor.
> > ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
> > at org.springframework.aop.framework.ReflectiveMethodInvocation.
> > proceed(ReflectiveMethodInvocation.java:172)
> > at org.springframework.aop.framework.JdkDynamicAopProxy.
> > invoke(JdkDynamicAopProxy.java:204)
> > at com.sun.proxy.$Proxy203.update(Unknown Source)
> > at com.cloud.cluster.ClusterManagerImpl$4.runInContext(
> > ClusterManagerImpl.java:555)
> > at org.apache.cloudstack.managed.context.
> > ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
> > at org.apache.cloudstack.managed.context.impl.
> > DefaultManagedContext$1.call(DefaultManagedContext.java:56)
> > at org.apache.cloudstack.managed.context.impl.
> > DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
> > at org.apache.cloudstack.managed.context.impl.
> > DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
> > at org.apache.cloudstack.managed.context.
> > ManagedContextRunnable.run(ManagedContextRunnable.java:46)
> > at java.util.concurrent.Executors$RunnableAdapter.
> > call(Executors.java:473)
> > at java.util.concurrent.FutureTask.runAndReset(
> > FutureTask.java:304)
> > at java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> > at java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > at java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1152)
> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:622)
> > at java.lang.Thread.run(Thread.java:748)
> > Caused by: com.mysql.jdbc.exceptions.jdbc4.
> MySQLTransactionRollbackExcept
> > ion:
> > Deadlock found when trying to get lock; try restarting transaction
> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> > at sun.reflect.NativeConstructorAccessorImpl.newInstance(
> > NativeConstructorAccessorImpl.java:57)
> > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> > DelegatingConstructorAccessorImpl.java:45)
> > ... 46 more
> > INFO  [o.a.c.f.j.i.AsyncJobManagerImpl]
> > (AsyncJobMgr-Heartbeat-1:ctx-5ef0f4d1) (logid:4bfa48b2) Begin cleanup
> > expired async-jobs
> > INFO  [o.a.c.f.j.i.AsyncJobManagerImpl]
> > (AsyncJobMgr-Heartbeat-1:ctx-5ef0f4d1) (logid:4bfa48b2) End cleanup
> > expired async-jobs
> > ERROR [c.c.u.d.ConnectionConcierge]
> > (ConnectionConcierge-1:ctx-d3460aeb) (logid:b8c62262) Unable to keep
> > the db connection for LockMaster1
> > java.sql.SQLException: Connection was killed
> > at com.mysql.jdbc.SQLError.createSQLExc

Re: [Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Rafael Weingärtner
So, we would need to change every piece of code that opens and uses
connections and transactions to change to ZK model? I mean, to direct the
flow to ZK.

On Mon, Dec 18, 2017 at 8:55 AM, Marc-Aurèle Brothier 
wrote:

> I understand your point, but there isn't any "transaction" in ZK. The
> transaction and commit stuff are really for DB and not part of ZK. All
> entries (if you start writing data in some nodes) are versioned. For
> example you could enforce that to overwrite a node value you must submit
> the node data having the same last version id to ensure you were
> overwriting from the latest value/state of that node. Bear in mind that you
> should not put too much data into your ZK, it's not a database replacement,
> neither a nosql db.
>
> The ZK client (CuratorFramework object) is started on the server startup,
> and you only need to pass it along your calls so that the connection is
> reused, or retried, depending on the state. Nothing manual has to be done,
> it's all in this curator library.
>
> On Mon, Dec 18, 2017 at 11:44 AM, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
>
> > I did not check the link before. Sorry about that.
> >
> > Reading some of the pages there, I see curator more like a client library
> > such as MySQL JDBC client.
> >
> > When I mentioned framework, I was looking for something like Spring-data.
> > So, we could simply rely on the framework to manage connections and
> > transactions. For instance, we could define a pattern that would open
> > connection with a read-only transaction. And then, we could annotate
> > methods that would write in the database something with
> > @Transactional(readonly = false). If we are going to a change like this
> we
> > need to remove manually open connections and transactions. Also, we have
> to
> > remove the transaction management code from our code base.
> >
> > I would like to see something like this [1] in our future. No manually
> > written transaction code, and no transaction management in our code base.
> > Just simple annotation usage or transaction pattern in Spring XML files.
> >
> > [1]
> > https://github.com/rafaelweingartner/daily-tasks/
> > blob/master/src/main/java/br/com/supero/desafio/services/
> TaskService.java
> >
> > On Mon, Dec 18, 2017 at 8:32 AM, Marc-Aurèle Brothier  >
> > wrote:
> >
> > > @rafael, yes there is a framework (curator), it's the link I posted in
> my
> > > first message: https://curator.apache.org/curator-recipes/shared-lock.
> > html
> > > This framework helps handling all the complexity of ZK.
> > >
> > > The ZK client stays connected all the time (as the DB connection pool),
> > and
> > > only one connection (ZKClient) is needed to communicate with the ZK
> > server.
> > > The framework handles reconnection as well.
> > >
> > > Have a look at ehc curator website to understand its goal:
> > > https://curator.apache.org/
> > >
> > > On Mon, Dec 18, 2017 at 11:01 AM, Rafael Weingärtner <
> > > rafaelweingart...@gmail.com> wrote:
> > >
> > > > Do we have framework to do this kind of looking in ZK?
> > > > I mean, you said " create a new InterProcessSemaphoreMutex which
> > handles
> > > > the locking mechanism.". This feels that we would have to continue
> > > opening
> > > > and closing this transaction manually, which is what causes a lot of
> > our
> > > > headaches with transactions (it is not MySQL locks fault entirely,
> but
> > > our
> > > > code structure).
> > > >
> > > > On Mon, Dec 18, 2017 at 7:47 AM, Marc-Aurèle Brothier <
> > ma...@exoscale.ch
> > > >
> > > > wrote:
> > > >
> > > > > We added ZK lock for fix this issue but we will remove all current
> > > locks
> > > > in
> > > > > ZK in favor of ZK one. The ZK lock is already encapsulated in a
> > project
> > > > > with an interface, but more work should be done to have a proper
> > > > interface
> > > > > for locks which could be implemented with the "tool" you want,
> > either a
> > > > DB
> > > > > lock for simplicity, or ZK for more advanced scenarios.
> > > > >
> > > > > @Daan you will need to add the ZK libraries in CS and have a
> running
> > ZK
> > > > > server somewhere. The configuration value is read from the
> > > > > server.properties. If the line is empty, the ZK client is not
> created
> > > and
> > > > > any lock request will immediately return (not holding any lock).
> > > > >
> > > > > @Rafael: ZK is pretty easy to setup and have running, as long as
> you
> > > > don't
> > > > > put too much data in it. Regarding our scenario here, with only
> > locks,
> > > > it's
> > > > > easy. ZK would be only the gatekeeper to locks in the code,
> ensuring
> > > that
> > > > > multi JVM can request a true lock.
> > > > > For the code point of view, you're opening a connection to a ZK
> node
> > > (any
> > > > > of a cluster) and you create a new InterProcessSemaphoreMutex which
> > > > handles
> > > > > the locking mechanism.
> > > > >
> > > > > On Mon, Dec 18, 2017 at 10:24 AM, Ivan Kudryavtsev <
> > > > > kudrya

Re: Clean up old and obsolete branches

2017-12-18 Thread Rafael Weingärtner
Guys, this is the moment to give your opinion here. Since nobody has
commented anything on the protocol. I will just add some more steps before
deletion.

   1. Only maintain the master and major release branches. We currently
   have a system of X.Y.Z.S. I define major release here as a release that
   changes either ((X or Y) or (X and Y));
   2. We will use tags for versioning. Therefore, all versions we release
   are tagged accordingly, including minor and security releases;
   3. When releasing the “SNAPSHOT” is removed and the branch of the
   version is created (if the version is being cut from master). Rule (1) one
   is applied here; therefore, only major releases will receive branches.
   Every release must have a tag in the format X.Y.Z.S. After releasing, we
   bump the pom of the version to next available SNAPSHOT;
   4. If there's a need to fix an old version, we work on HEAD of
   corresponding release branch;
   5. People should avoid using the official apache repository to store
   working branches. If we want to work together on some issues, we can set up
   a fork and give permission to interested parties (the official repository
   is restricted to committers). If one uses the official repository, the
   branch used must be cleaned right after merging;
   6. Branches not following these rules will be removed if they have not
   received attention (commits) for over 6 (six) months;
   7. Before the removal of a branch in the official repository it is
   mandatory to create a Jira ticket and send a notification email to
   CloudStack’s dev mailing list. If there are no objections, the branch can
   be deleted seven (7) business days after the notification email is sent;
   8. After the branch removal, the Jira ticket must be closed.


 I will wait more two days. If we do not get comments here anymore, I will
call for a vote, and then if there are no objections I will write the
protocol on our Wiki. Afterwards, we can start removing branches (following
the defined protocol).

On Thu, Dec 14, 2017 at 5:08 PM, Daan Hoogland 
wrote:

> sounds lime a lazy consensus vote; +1 from me
>
> On Thu, Dec 14, 2017 at 7:07 PM, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
>
> > Guys,
> >
> > Khosrow has done a great job here, but we still need to move this forward
> > and create a standard/guidelines on how to use the official repo. Looking
> > at the list in [1] we can clearly see that things are messy.  This is a
> > minor discussion, but in my opinion, we should finish it.
> >
> > [1] https://issues.apache.org/jira/browse/CLOUDSTACK-10169
> >
> > I will propose the following regarding the use of the official
> repository.
> > I will be waiting for your feedback, and then proceed with a vote.
> >
> >1. Only maintain the master and major release branches. We currently
> >have a system of X.Y.Z.S. I define major release here as a release
> that
> >changes either ((X or Y) or (X and Y));
> >2. We will use tags for versioning. Therefore, all versions we release
> >are tagged accordingly, including minor and security releases;
> >3. When releasing the “SNAPSHOT” is removed and the branch of the
> >version is created (if the version is being cut from master). Rule (1)
> > one
> >is applied here; therefore, only major releases will receive branches.
> >Every release must have a tag in the format X.Y.Z.S. After releasing,
> we
> >bump the pom of the version to next available SNAPSHOT;
> >4. If there's a need to fix an old version, we work on HEAD of
> >corresponding release branch;
> >5. People should avoid using the official apache repository to store
> >working branches. If we want to work together on some issues, we can
> > set up
> >a fork and give permission to interested parties (the official
> > repository
> >is restricted to committers). If one uses the official repository, the
> >branch used must be cleaned right after merging;
> >6. Branches not following these rules will be removed if they have not
> >received attention (commits) for over 6 (six) months.
> >
> > I think that is all. Do you guys have additions/removals/proposals so we
> > can move this forward?
> >
> > On Mon, Dec 4, 2017 at 7:20 PM, Khosrow Moossavi  >
> > wrote:
> >
> > > I agree Erik. I updated the list in CLOUDSTACK-10169 with more
> > information
> > > (last updated, last commit, HEAD on master and PR status/number) to
> give
> > us
> > > more immediate visibility of the status of those branches. So any
> > branches
> > > can
> > > be deleted if:
> > >
> > > - which its HEAD exists on master
> > > - its PR was merged or closed (which surprisingly are not so many)
> > > - it's old (last updated in 2015 or before?)
> > >
> > > and the rest of them can be deleted after more examination (if need
> be).
> > >
> > >
> > > On Mon, Dec 4, 2017 at 6:37 AM, Rafael Weingärtner <
> > > rafaelweingart...@gmail.com> wrote:
> > >
> > > > I thought someo

Re: [Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Marc-Aurèle Brothier
Sorry about the confusion. It's not going to replace the DB transactions in
the DAO way. Today we can say that there are 2 types of locks in CS, either
a pure transaction one, with the select for update which locks a row for
any operation by other threads, or a more programmatic one with the op_lock
table holding entries for pure locking mechanism used by the Merovigian
class. Zookeeper could be used to replace the latter, and wouldn't be a
good candidate for the other one.

To give more precise example of the replacement, it could be use to replace
the lock on VM operations, when only one opertion at a time must be
performed on a VM. It should not be used to replace locks in DAOs which
lock a VO entry to update some of its field.

Rafael,does that clarifies you thoughts and concerns about transactions,
connections ?

On Mon, Dec 18, 2017 at 1:10 PM, Rafael Weingärtner <
rafaelweingart...@gmail.com> wrote:

> So, we would need to change every piece of code that opens and uses
> connections and transactions to change to ZK model? I mean, to direct the
> flow to ZK.
>
> On Mon, Dec 18, 2017 at 8:55 AM, Marc-Aurèle Brothier 
> wrote:
>
> > I understand your point, but there isn't any "transaction" in ZK. The
> > transaction and commit stuff are really for DB and not part of ZK. All
> > entries (if you start writing data in some nodes) are versioned. For
> > example you could enforce that to overwrite a node value you must submit
> > the node data having the same last version id to ensure you were
> > overwriting from the latest value/state of that node. Bear in mind that
> you
> > should not put too much data into your ZK, it's not a database
> replacement,
> > neither a nosql db.
> >
> > The ZK client (CuratorFramework object) is started on the server startup,
> > and you only need to pass it along your calls so that the connection is
> > reused, or retried, depending on the state. Nothing manual has to be
> done,
> > it's all in this curator library.
> >
> > On Mon, Dec 18, 2017 at 11:44 AM, Rafael Weingärtner <
> > rafaelweingart...@gmail.com> wrote:
> >
> > > I did not check the link before. Sorry about that.
> > >
> > > Reading some of the pages there, I see curator more like a client
> library
> > > such as MySQL JDBC client.
> > >
> > > When I mentioned framework, I was looking for something like
> Spring-data.
> > > So, we could simply rely on the framework to manage connections and
> > > transactions. For instance, we could define a pattern that would open
> > > connection with a read-only transaction. And then, we could annotate
> > > methods that would write in the database something with
> > > @Transactional(readonly = false). If we are going to a change like this
> > we
> > > need to remove manually open connections and transactions. Also, we
> have
> > to
> > > remove the transaction management code from our code base.
> > >
> > > I would like to see something like this [1] in our future. No manually
> > > written transaction code, and no transaction management in our code
> base.
> > > Just simple annotation usage or transaction pattern in Spring XML
> files.
> > >
> > > [1]
> > > https://github.com/rafaelweingartner/daily-tasks/
> > > blob/master/src/main/java/br/com/supero/desafio/services/
> > TaskService.java
> > >
> > > On Mon, Dec 18, 2017 at 8:32 AM, Marc-Aurèle Brothier <
> ma...@exoscale.ch
> > >
> > > wrote:
> > >
> > > > @rafael, yes there is a framework (curator), it's the link I posted
> in
> > my
> > > > first message: https://curator.apache.org/
> curator-recipes/shared-lock.
> > > html
> > > > This framework helps handling all the complexity of ZK.
> > > >
> > > > The ZK client stays connected all the time (as the DB connection
> pool),
> > > and
> > > > only one connection (ZKClient) is needed to communicate with the ZK
> > > server.
> > > > The framework handles reconnection as well.
> > > >
> > > > Have a look at ehc curator website to understand its goal:
> > > > https://curator.apache.org/
> > > >
> > > > On Mon, Dec 18, 2017 at 11:01 AM, Rafael Weingärtner <
> > > > rafaelweingart...@gmail.com> wrote:
> > > >
> > > > > Do we have framework to do this kind of looking in ZK?
> > > > > I mean, you said " create a new InterProcessSemaphoreMutex which
> > > handles
> > > > > the locking mechanism.". This feels that we would have to continue
> > > > opening
> > > > > and closing this transaction manually, which is what causes a lot
> of
> > > our
> > > > > headaches with transactions (it is not MySQL locks fault entirely,
> > but
> > > > our
> > > > > code structure).
> > > > >
> > > > > On Mon, Dec 18, 2017 at 7:47 AM, Marc-Aurèle Brothier <
> > > ma...@exoscale.ch
> > > > >
> > > > > wrote:
> > > > >
> > > > > > We added ZK lock for fix this issue but we will remove all
> current
> > > > locks
> > > > > in
> > > > > > ZK in favor of ZK one. The ZK lock is already encapsulated in a
> > > project
> > > > > > with an interface, but more work should be done to have a proper
> > > 

Re: [Discuss] Management cluster / Zookeeper holding locks

2017-12-18 Thread Rafael Weingärtner
Now, yes! Thanks for the clarification.

On Mon, Dec 18, 2017 at 11:16 AM, Marc-Aurèle Brothier 
wrote:

> Sorry about the confusion. It's not going to replace the DB transactions in
> the DAO way. Today we can say that there are 2 types of locks in CS, either
> a pure transaction one, with the select for update which locks a row for
> any operation by other threads, or a more programmatic one with the op_lock
> table holding entries for pure locking mechanism used by the Merovigian
> class. Zookeeper could be used to replace the latter, and wouldn't be a
> good candidate for the other one.
>
> To give more precise example of the replacement, it could be use to replace
> the lock on VM operations, when only one opertion at a time must be
> performed on a VM. It should not be used to replace locks in DAOs which
> lock a VO entry to update some of its field.
>
> Rafael,does that clarifies you thoughts and concerns about transactions,
> connections ?
>
> On Mon, Dec 18, 2017 at 1:10 PM, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
>
> > So, we would need to change every piece of code that opens and uses
> > connections and transactions to change to ZK model? I mean, to direct the
> > flow to ZK.
> >
> > On Mon, Dec 18, 2017 at 8:55 AM, Marc-Aurèle Brothier  >
> > wrote:
> >
> > > I understand your point, but there isn't any "transaction" in ZK. The
> > > transaction and commit stuff are really for DB and not part of ZK. All
> > > entries (if you start writing data in some nodes) are versioned. For
> > > example you could enforce that to overwrite a node value you must
> submit
> > > the node data having the same last version id to ensure you were
> > > overwriting from the latest value/state of that node. Bear in mind that
> > you
> > > should not put too much data into your ZK, it's not a database
> > replacement,
> > > neither a nosql db.
> > >
> > > The ZK client (CuratorFramework object) is started on the server
> startup,
> > > and you only need to pass it along your calls so that the connection is
> > > reused, or retried, depending on the state. Nothing manual has to be
> > done,
> > > it's all in this curator library.
> > >
> > > On Mon, Dec 18, 2017 at 11:44 AM, Rafael Weingärtner <
> > > rafaelweingart...@gmail.com> wrote:
> > >
> > > > I did not check the link before. Sorry about that.
> > > >
> > > > Reading some of the pages there, I see curator more like a client
> > library
> > > > such as MySQL JDBC client.
> > > >
> > > > When I mentioned framework, I was looking for something like
> > Spring-data.
> > > > So, we could simply rely on the framework to manage connections and
> > > > transactions. For instance, we could define a pattern that would open
> > > > connection with a read-only transaction. And then, we could annotate
> > > > methods that would write in the database something with
> > > > @Transactional(readonly = false). If we are going to a change like
> this
> > > we
> > > > need to remove manually open connections and transactions. Also, we
> > have
> > > to
> > > > remove the transaction management code from our code base.
> > > >
> > > > I would like to see something like this [1] in our future. No
> manually
> > > > written transaction code, and no transaction management in our code
> > base.
> > > > Just simple annotation usage or transaction pattern in Spring XML
> > files.
> > > >
> > > > [1]
> > > > https://github.com/rafaelweingartner/daily-tasks/
> > > > blob/master/src/main/java/br/com/supero/desafio/services/
> > > TaskService.java
> > > >
> > > > On Mon, Dec 18, 2017 at 8:32 AM, Marc-Aurèle Brothier <
> > ma...@exoscale.ch
> > > >
> > > > wrote:
> > > >
> > > > > @rafael, yes there is a framework (curator), it's the link I posted
> > in
> > > my
> > > > > first message: https://curator.apache.org/
> > curator-recipes/shared-lock.
> > > > html
> > > > > This framework helps handling all the complexity of ZK.
> > > > >
> > > > > The ZK client stays connected all the time (as the DB connection
> > pool),
> > > > and
> > > > > only one connection (ZKClient) is needed to communicate with the ZK
> > > > server.
> > > > > The framework handles reconnection as well.
> > > > >
> > > > > Have a look at ehc curator website to understand its goal:
> > > > > https://curator.apache.org/
> > > > >
> > > > > On Mon, Dec 18, 2017 at 11:01 AM, Rafael Weingärtner <
> > > > > rafaelweingart...@gmail.com> wrote:
> > > > >
> > > > > > Do we have framework to do this kind of looking in ZK?
> > > > > > I mean, you said " create a new InterProcessSemaphoreMutex which
> > > > handles
> > > > > > the locking mechanism.". This feels that we would have to
> continue
> > > > > opening
> > > > > > and closing this transaction manually, which is what causes a lot
> > of
> > > > our
> > > > > > headaches with transactions (it is not MySQL locks fault
> entirely,
> > > but
> > > > > our
> > > > > > code structure).
> > > > > >
> > > > > > On Mon, Dec 18, 2017 at 7:47 AM, Marc-Aurèle Brothi

Re: Clean up old and obsolete branches

2017-12-18 Thread Marc-Aurèle Brothier
+1 for me

On the point 5, since you can have people working together on forks, I
would simply state that no other branches except the official ones can be
in the project repository, removing: "If one uses the official repository,
the branch used must be cleaned right after merging;"

On Mon, Dec 18, 2017 at 2:05 PM, Rafael Weingärtner <
rafaelweingart...@gmail.com> wrote:

> Guys, this is the moment to give your opinion here. Since nobody has
> commented anything on the protocol. I will just add some more steps before
> deletion.
>
>1. Only maintain the master and major release branches. We currently
>have a system of X.Y.Z.S. I define major release here as a release that
>changes either ((X or Y) or (X and Y));
>2. We will use tags for versioning. Therefore, all versions we release
>are tagged accordingly, including minor and security releases;
>3. When releasing the “SNAPSHOT” is removed and the branch of the
>version is created (if the version is being cut from master). Rule (1)
> one
>is applied here; therefore, only major releases will receive branches.
>Every release must have a tag in the format X.Y.Z.S. After releasing, we
>bump the pom of the version to next available SNAPSHOT;
>4. If there's a need to fix an old version, we work on HEAD of
>corresponding release branch;
>5. People should avoid using the official apache repository to store
>working branches. If we want to work together on some issues, we can
> set up
>a fork and give permission to interested parties (the official
> repository
>is restricted to committers). If one uses the official repository, the
>branch used must be cleaned right after merging;
>6. Branches not following these rules will be removed if they have not
>received attention (commits) for over 6 (six) months;
>7. Before the removal of a branch in the official repository it is
>mandatory to create a Jira ticket and send a notification email to
>CloudStack’s dev mailing list. If there are no objections, the branch
> can
>be deleted seven (7) business days after the notification email is sent;
>8. After the branch removal, the Jira ticket must be closed.
>
>
>  I will wait more two days. If we do not get comments here anymore, I will
> call for a vote, and then if there are no objections I will write the
> protocol on our Wiki. Afterwards, we can start removing branches (following
> the defined protocol).
>
> On Thu, Dec 14, 2017 at 5:08 PM, Daan Hoogland 
> wrote:
>
> > sounds lime a lazy consensus vote; +1 from me
> >
> > On Thu, Dec 14, 2017 at 7:07 PM, Rafael Weingärtner <
> > rafaelweingart...@gmail.com> wrote:
> >
> > > Guys,
> > >
> > > Khosrow has done a great job here, but we still need to move this
> forward
> > > and create a standard/guidelines on how to use the official repo.
> Looking
> > > at the list in [1] we can clearly see that things are messy.  This is a
> > > minor discussion, but in my opinion, we should finish it.
> > >
> > > [1] https://issues.apache.org/jira/browse/CLOUDSTACK-10169
> > >
> > > I will propose the following regarding the use of the official
> > repository.
> > > I will be waiting for your feedback, and then proceed with a vote.
> > >
> > >1. Only maintain the master and major release branches. We currently
> > >have a system of X.Y.Z.S. I define major release here as a release
> > that
> > >changes either ((X or Y) or (X and Y));
> > >2. We will use tags for versioning. Therefore, all versions we
> release
> > >are tagged accordingly, including minor and security releases;
> > >3. When releasing the “SNAPSHOT” is removed and the branch of the
> > >version is created (if the version is being cut from master). Rule
> (1)
> > > one
> > >is applied here; therefore, only major releases will receive
> branches.
> > >Every release must have a tag in the format X.Y.Z.S. After
> releasing,
> > we
> > >bump the pom of the version to next available SNAPSHOT;
> > >4. If there's a need to fix an old version, we work on HEAD of
> > >corresponding release branch;
> > >5. People should avoid using the official apache repository to store
> > >working branches. If we want to work together on some issues, we can
> > > set up
> > >a fork and give permission to interested parties (the official
> > > repository
> > >is restricted to committers). If one uses the official repository,
> the
> > >branch used must be cleaned right after merging;
> > >6. Branches not following these rules will be removed if they have
> not
> > >received attention (commits) for over 6 (six) months.
> > >
> > > I think that is all. Do you guys have additions/removals/proposals so
> we
> > > can move this forward?
> > >
> > > On Mon, Dec 4, 2017 at 7:20 PM, Khosrow Moossavi <
> kmooss...@cloudops.com
> > >
> > > wrote:
> > >
> > > > I agree Erik. I updated the list in CLOUDSTACK-10169 with more
> > > informati

Re: Clean up old and obsolete branches

2017-12-18 Thread Daan Hoogland
any workable procedure (including yours, Rafael) will do but let's be
extremely patient and lenient. I think we can start deleting a lot of old
branches (RC-branches and merged PRs to start with)

On Mon, Dec 18, 2017 at 2:23 PM, Marc-Aurèle Brothier 
wrote:

> +1 for me
>
> On the point 5, since you can have people working together on forks, I
> would simply state that no other branches except the official ones can be
> in the project repository, removing: "If one uses the official repository,
> the branch used must be cleaned right after merging;"
>
> On Mon, Dec 18, 2017 at 2:05 PM, Rafael Weingärtner <
> rafaelweingart...@gmail.com> wrote:
>
> > Guys, this is the moment to give your opinion here. Since nobody has
> > commented anything on the protocol. I will just add some more steps
> before
> > deletion.
> >
> >1. Only maintain the master and major release branches. We currently
> >have a system of X.Y.Z.S. I define major release here as a release
> that
> >changes either ((X or Y) or (X and Y));
> >2. We will use tags for versioning. Therefore, all versions we release
> >are tagged accordingly, including minor and security releases;
> >3. When releasing the “SNAPSHOT” is removed and the branch of the
> >version is created (if the version is being cut from master). Rule (1)
> > one
> >is applied here; therefore, only major releases will receive branches.
> >Every release must have a tag in the format X.Y.Z.S. After releasing,
> we
> >bump the pom of the version to next available SNAPSHOT;
> >4. If there's a need to fix an old version, we work on HEAD of
> >corresponding release branch;
> >5. People should avoid using the official apache repository to store
> >working branches. If we want to work together on some issues, we can
> > set up
> >a fork and give permission to interested parties (the official
> > repository
> >is restricted to committers). If one uses the official repository, the
> >branch used must be cleaned right after merging;
> >6. Branches not following these rules will be removed if they have not
> >received attention (commits) for over 6 (six) months;
> >7. Before the removal of a branch in the official repository it is
> >mandatory to create a Jira ticket and send a notification email to
> >CloudStack’s dev mailing list. If there are no objections, the branch
> > can
> >be deleted seven (7) business days after the notification email is
> sent;
> >8. After the branch removal, the Jira ticket must be closed.
> >
> >
> >  I will wait more two days. If we do not get comments here anymore, I
> will
> > call for a vote, and then if there are no objections I will write the
> > protocol on our Wiki. Afterwards, we can start removing branches
> (following
> > the defined protocol).
> >
> > On Thu, Dec 14, 2017 at 5:08 PM, Daan Hoogland 
> > wrote:
> >
> > > sounds lime a lazy consensus vote; +1 from me
> > >
> > > On Thu, Dec 14, 2017 at 7:07 PM, Rafael Weingärtner <
> > > rafaelweingart...@gmail.com> wrote:
> > >
> > > > Guys,
> > > >
> > > > Khosrow has done a great job here, but we still need to move this
> > forward
> > > > and create a standard/guidelines on how to use the official repo.
> > Looking
> > > > at the list in [1] we can clearly see that things are messy.  This
> is a
> > > > minor discussion, but in my opinion, we should finish it.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/CLOUDSTACK-10169
> > > >
> > > > I will propose the following regarding the use of the official
> > > repository.
> > > > I will be waiting for your feedback, and then proceed with a vote.
> > > >
> > > >1. Only maintain the master and major release branches. We
> currently
> > > >have a system of X.Y.Z.S. I define major release here as a release
> > > that
> > > >changes either ((X or Y) or (X and Y));
> > > >2. We will use tags for versioning. Therefore, all versions we
> > release
> > > >are tagged accordingly, including minor and security releases;
> > > >3. When releasing the “SNAPSHOT” is removed and the branch of the
> > > >version is created (if the version is being cut from master). Rule
> > (1)
> > > > one
> > > >is applied here; therefore, only major releases will receive
> > branches.
> > > >Every release must have a tag in the format X.Y.Z.S. After
> > releasing,
> > > we
> > > >bump the pom of the version to next available SNAPSHOT;
> > > >4. If there's a need to fix an old version, we work on HEAD of
> > > >corresponding release branch;
> > > >5. People should avoid using the official apache repository to
> store
> > > >working branches. If we want to work together on some issues, we
> can
> > > > set up
> > > >a fork and give permission to interested parties (the official
> > > > repository
> > > >is restricted to committers). If one uses the official repository,
> > the
> > > >branch used must be cleaned right 

[DISCUSS] Management server (pre-)shutdown to avoid killing jobs

2017-12-18 Thread Marc-Aurèle Brothier
Hi everyone,

Another point, another thread. Currently when shutting down a management
server, despite all the "stop()" method not being called as far as I know,
the server could be in the middle of processing an async job task. It will
lead to a failed job since the response won't be delivered to the correct
management server even though the job might have succeed on the agent. To
overcome this limitation due to our weekly production upgrades, we added a
pre-shutdown mechanism which works along side HA-proxy. The management
server keeps a eye onto a file "lb-agent" in which some keywords can be
written following the HA proxy guide (
https://cbonte.github.io/haproxy-dconv/1.9/configuration.html#5.2-agent-check).
When it finds "maint", "stopped" or "drain", it stops those threads:
 - AsyncJobManager._heartbeatScheduler: responsible to fetch and start
execution of AsyncJobs
 - AlertManagerImpl._timer: responsible to send capacity check commands
 - StatsCollector._executor: responsible to schedule stats command

Then the management server stops most of its scheduled tasks. The correct
thing to do before shutting down the server would be to send
"rebalance/reconnect" commands to all agents connected on that management
server to ensure that commands won't go through this server at all.

Here, HA-proxy is responsible to stop sending API requests to the
corresponding server with the help of this local agent check.

In case you want to cancel the maintenance shutdown, you could write
"up/ready" in the file and the different schedulers will be restarted.

This is really more a change for operation around CS for people doing live
upgrade on a regular basis, so I'm unsure if the community would want such
a change in the code base. It goes a bit in the opposite direction of the
change for removing the need of HA-proxy
https://github.com/apache/cloudstack/pull/2309

If there is enough positive feedback for such a change, I will port them to
match with the upstream branch in a PR.

Kind regards,
Marc-Aurèle


Re: Clean up old and obsolete branches

2017-12-18 Thread Rafael Weingärtner
@Marc, I like this idea. However, some folks believe it might be useful to
use the official repo to work in groups (group of committers). I did not
want to push this without a broader discussion; that is why I am proposing
that people can use the official repository, as long as they remove the
branch after the merge. If they do not remove the branch right after
merging, according to the set of rules I wrote, we would be able to remove
it sic (6) months after the work is done (after following the other
procedures to remove a branch). Therefore, we do not have the risk of
someone deleting things that should not be deleted.

@Daan, that is the idea. A protocol is good to make the rules clear to
everybody; and as we state there, one cannot delete branches right away.
There are certain criteria that have to be met, and notice has to be given
on the dev mailing list.

On Mon, Dec 18, 2017 at 11:39 AM, Daan Hoogland 
wrote:

> any workable procedure (including yours, Rafael) will do but let's be
> extremely patient and lenient. I think we can start deleting a lot of old
> branches (RC-branches and merged PRs to start with)
>
> On Mon, Dec 18, 2017 at 2:23 PM, Marc-Aurèle Brothier 
> wrote:
>
> > +1 for me
> >
> > On the point 5, since you can have people working together on forks, I
> > would simply state that no other branches except the official ones can be
> > in the project repository, removing: "If one uses the official
> repository,
> > the branch used must be cleaned right after merging;"
> >
> > On Mon, Dec 18, 2017 at 2:05 PM, Rafael Weingärtner <
> > rafaelweingart...@gmail.com> wrote:
> >
> > > Guys, this is the moment to give your opinion here. Since nobody has
> > > commented anything on the protocol. I will just add some more steps
> > before
> > > deletion.
> > >
> > >1. Only maintain the master and major release branches. We currently
> > >have a system of X.Y.Z.S. I define major release here as a release
> > that
> > >changes either ((X or Y) or (X and Y));
> > >2. We will use tags for versioning. Therefore, all versions we
> release
> > >are tagged accordingly, including minor and security releases;
> > >3. When releasing the “SNAPSHOT” is removed and the branch of the
> > >version is created (if the version is being cut from master). Rule
> (1)
> > > one
> > >is applied here; therefore, only major releases will receive
> branches.
> > >Every release must have a tag in the format X.Y.Z.S. After
> releasing,
> > we
> > >bump the pom of the version to next available SNAPSHOT;
> > >4. If there's a need to fix an old version, we work on HEAD of
> > >corresponding release branch;
> > >5. People should avoid using the official apache repository to store
> > >working branches. If we want to work together on some issues, we can
> > > set up
> > >a fork and give permission to interested parties (the official
> > > repository
> > >is restricted to committers). If one uses the official repository,
> the
> > >branch used must be cleaned right after merging;
> > >6. Branches not following these rules will be removed if they have
> not
> > >received attention (commits) for over 6 (six) months;
> > >7. Before the removal of a branch in the official repository it is
> > >mandatory to create a Jira ticket and send a notification email to
> > >CloudStack’s dev mailing list. If there are no objections, the
> branch
> > > can
> > >be deleted seven (7) business days after the notification email is
> > sent;
> > >8. After the branch removal, the Jira ticket must be closed.
> > >
> > >
> > >  I will wait more two days. If we do not get comments here anymore, I
> > will
> > > call for a vote, and then if there are no objections I will write the
> > > protocol on our Wiki. Afterwards, we can start removing branches
> > (following
> > > the defined protocol).
> > >
> > > On Thu, Dec 14, 2017 at 5:08 PM, Daan Hoogland <
> daan.hoogl...@gmail.com>
> > > wrote:
> > >
> > > > sounds lime a lazy consensus vote; +1 from me
> > > >
> > > > On Thu, Dec 14, 2017 at 7:07 PM, Rafael Weingärtner <
> > > > rafaelweingart...@gmail.com> wrote:
> > > >
> > > > > Guys,
> > > > >
> > > > > Khosrow has done a great job here, but we still need to move this
> > > forward
> > > > > and create a standard/guidelines on how to use the official repo.
> > > Looking
> > > > > at the list in [1] we can clearly see that things are messy.  This
> > is a
> > > > > minor discussion, but in my opinion, we should finish it.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/CLOUDSTACK-10169
> > > > >
> > > > > I will propose the following regarding the use of the official
> > > > repository.
> > > > > I will be waiting for your feedback, and then proceed with a vote.
> > > > >
> > > > >1. Only maintain the master and major release branches. We
> > currently
> > > > >have a system of X.Y.Z.S. I define major release here as a
> release
> > > > 

RE: Master Blockers and Criticals

2017-12-18 Thread Paul Angus
Hi All, here is an updated summary of the open Critical and Blocker Issues in 
Jira.
If you are working on any of these issues, please whether you believe that you 
will have this issue closed by 8th Jan.

@Jayapal Reddy please respond to the pings on the subject of the blocker that 
you have raised and are working on.

Key PrioritySummary 
AssigneeReporter
CLOUDSTACK-9885 Blocker VPC RVR: On deleting first tier and configuring Private 
GW both VRs becoming MASTER Jayapal Reddy   Jayapal Reddy
CLOUDSTACK-10127Critical4.9 / 4.10 KVM + openvswitch + vpc + 
static nat / secondary ip on eth2? Frank Maximus   Sven 
Vogel
CLOUDSTACK-9892 CriticalPrimary storage resource check is broken when 
using root disk size override to deploy VMKoushik Das Koushik Das
CLOUDSTACK-9862 Criticallist template with id= no longer work as domain 
admin   Unassigned  Pierre-Luc Dion
CLOUDSTACK-10128CriticalTemplate from snapshot not merging vhd 
filesRafael Weingärtner 
 Marcelo Lima
CLOUDSTACK-9964 CriticalSnapahots are getting deleted if VM is assigned 
to another user Pavan Kumar Aravapalli  Pavan 
Kumar Aravapalli
CLOUDSTACK-9855 CriticalVPC RVR when master guest interface is down 
backup VR is not switched to Master Jayapal Reddy   Jayapal Reddy
CLOUDSTACK-9837 CriticalUpon stoping and starting an InternalLbVM 
device from CloudStack, HAProxy service on the device is not being started 
properly

Unassigned  Mani Prashanth 
Varma Manthena

project = CLOUDSTACK AND issuetype = Bug AND status in (Open, "In Progress", 
Reopened) AND priority in (Blocker, Critical) AND affectedVersion in (4.10.0.0, 
4.10.1.0, 4.11.0.0, Future) ORDER BY priority DESC, updated DESC



Kind regards,

Paul Angus

paul.an...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 


-Original Message-
From: Paul Angus [mailto:paul.an...@shapeblue.com] 
Sent: 11 December 2017 11:00
To: dev@cloudstack.apache.org
Cc: Boris Stoyanov ; Rohit Yadav 

Subject: Master Blockers and Criticals

Hi All,
Please find a summary - of open critical and blocker bugs in 4.11 If you know 
of one which is missing please update Jira accordingly.  I will chase Assignees 
individually (via ML) to get status updates...


KeySummary  

   Priority 
Assignee  Reporter
CLOUDSTACK-9885  VPC RVR: On deleting first tier and configuring 
Private GW both VRs becoming MASTER   Blocker Jayapal Reddy 
  Jayapal Reddy
CLOUDSTACK-10164   UI - not able to create a VPC

Blocker Sigert Goeminne  Sigert Goeminne
CLOUDSTACK-10127   4.9 / 4.10 KVM + openvswitch + vpc + static nat / 
secondary ip on eth2?  Critical 
  Frank Maximus Sven Vogel
CLOUDSTACK-9892  Primary storage resource check is broken when using 
root disk size override to deploy VMCritical   Koushik Das  
  Koushik Das
CLOUDSTACK-10140   When template is created from snapshot 
template.properties are corrupted
Critical   UnassignedIvan Kudryavtsev
CLOUDSTACK-10128   Template from snapshot not merging vhd files 

   Critical   UnassignedMarcelo Lima
CLOUDSTACK-9964  Snapahots are getting deleted if VM is assigned to 
another user 
Critical   Pavan Kumar Aravapalli Pavan Kumar Aravapalli
CLOUDSTACK-9855  VPC RVR when master guest interface is down backup VR 
is not switched to Master  Critical   Jayapal Reddy 
  Jayapal Reddy
CLOUDSTACK-9862  list template with id= no longer work as domain admin  

 Critical   UnassignedPierre-Luc Dion
CLOUDSTACK-9837  Upon stoping and starting an InternalLbVM device from 
CloudStack, HAProxy servi

RE: [DISCUSS] Management server (pre-)shutdown to avoid killing jobs

2017-12-18 Thread Paul Angus
Hi Marc-Aurèle,

Personally, my utopia would be to be able to pass async jobs between mgmt. 
servers.
So rather than waiting in indeterminate time for a snapshot to complete, 
monitoring the job is passed to another management server. 

I would LOVE that something like Zookeeper monitored the state of the mgmt. 
servers, so that 'other' management servers could take over the async jobs in 
the (unlikely) event that a management server becomes unavailable.



Kind regards,

Paul Angus

paul.an...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 


-Original Message-
From: Marc-Aurèle Brothier [mailto:ma...@exoscale.ch] 
Sent: 18 December 2017 13:56
To: dev@cloudstack.apache.org
Subject: [DISCUSS] Management server (pre-)shutdown to avoid killing jobs

Hi everyone,

Another point, another thread. Currently when shutting down a management 
server, despite all the "stop()" method not being called as far as I know, the 
server could be in the middle of processing an async job task. It will lead to 
a failed job since the response won't be delivered to the correct management 
server even though the job might have succeed on the agent. To overcome this 
limitation due to our weekly production upgrades, we added a pre-shutdown 
mechanism which works along side HA-proxy. The management server keeps a eye 
onto a file "lb-agent" in which some keywords can be written following the HA 
proxy guide ( 
https://cbonte.github.io/haproxy-dconv/1.9/configuration.html#5.2-agent-check).
When it finds "maint", "stopped" or "drain", it stops those threads:
 - AsyncJobManager._heartbeatScheduler: responsible to fetch and start 
execution of AsyncJobs
 - AlertManagerImpl._timer: responsible to send capacity check commands
 - StatsCollector._executor: responsible to schedule stats command

Then the management server stops most of its scheduled tasks. The correct thing 
to do before shutting down the server would be to send "rebalance/reconnect" 
commands to all agents connected on that management server to ensure that 
commands won't go through this server at all.

Here, HA-proxy is responsible to stop sending API requests to the corresponding 
server with the help of this local agent check.

In case you want to cancel the maintenance shutdown, you could write "up/ready" 
in the file and the different schedulers will be restarted.

This is really more a change for operation around CS for people doing live 
upgrade on a regular basis, so I'm unsure if the community would want such a 
change in the code base. It goes a bit in the opposite direction of the change 
for removing the need of HA-proxy
https://github.com/apache/cloudstack/pull/2309

If there is enough positive feedback for such a change, I will port them to 
match with the upstream branch in a PR.

Kind regards,
Marc-Aurèle


Re: [DISCUSS] Management server (pre-)shutdown to avoid killing jobs

2017-12-18 Thread ilya musayev
I very much agree with Paul, we should consider moving into resilient model
with least dependence I.e ha-proxy..

Send a notification to partner MS to take over the job management would be
ideal.

On Mon, Dec 18, 2017 at 9:28 AM Paul Angus  wrote:

> Hi Marc-Aurèle,
>
> Personally, my utopia would be to be able to pass async jobs between mgmt.
> servers.
> So rather than waiting in indeterminate time for a snapshot to complete,
> monitoring the job is passed to another management server.
>
> I would LOVE that something like Zookeeper monitored the state of the
> mgmt. servers, so that 'other' management servers could take over the async
> jobs in the (unlikely) event that a management server becomes unavailable.
>
>
>
> Kind regards,
>
> Paul Angus
>
> paul.an...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
>
> -Original Message-
> From: Marc-Aurèle Brothier [mailto:ma...@exoscale.ch]
> Sent: 18 December 2017 13:56
> To: dev@cloudstack.apache.org
> Subject: [DISCUSS] Management server (pre-)shutdown to avoid killing jobs
>
> Hi everyone,
>
> Another point, another thread. Currently when shutting down a management
> server, despite all the "stop()" method not being called as far as I know,
> the server could be in the middle of processing an async job task. It will
> lead to a failed job since the response won't be delivered to the correct
> management server even though the job might have succeed on the agent. To
> overcome this limitation due to our weekly production upgrades, we added a
> pre-shutdown mechanism which works along side HA-proxy. The management
> server keeps a eye onto a file "lb-agent" in which some keywords can be
> written following the HA proxy guide (
> https://cbonte.github.io/haproxy-dconv/1.9/configuration.html#5.2-agent-check
> ).
> When it finds "maint", "stopped" or "drain", it stops those threads:
>  - AsyncJobManager._heartbeatScheduler: responsible to fetch and start
> execution of AsyncJobs
>  - AlertManagerImpl._timer: responsible to send capacity check commands
>  - StatsCollector._executor: responsible to schedule stats command
>
> Then the management server stops most of its scheduled tasks. The correct
> thing to do before shutting down the server would be to send
> "rebalance/reconnect" commands to all agents connected on that management
> server to ensure that commands won't go through this server at all.
>
> Here, HA-proxy is responsible to stop sending API requests to the
> corresponding server with the help of this local agent check.
>
> In case you want to cancel the maintenance shutdown, you could write
> "up/ready" in the file and the different schedulers will be restarted.
>
> This is really more a change for operation around CS for people doing live
> upgrade on a regular basis, so I'm unsure if the community would want such
> a change in the code base. It goes a bit in the opposite direction of the
> change for removing the need of HA-proxy
> https://github.com/apache/cloudstack/pull/2309
>
> If there is enough positive feedback for such a change, I will port them
> to match with the upstream branch in a PR.
>
> Kind regards,
> Marc-Aurèle
>


Re: [UPDATE] Debian 9 "stretch" systemvmtemplate for master

2017-12-18 Thread Rohit Yadav
All,

Thanks for your feedback.

We're reaching close to completion now. All smoketests are now passing on KVM, 
XenServer and VMWare now. There are however few intermittent failures on VMware 
being looked into. The rVR smoketests failures on VMware have been fixed as 
well.

The systemvmtemplate build has been now been migrated to packer, making it 
easier for anyone to build systemvm templates. Overall, VRs are now 2x to 3x 
faster, lighter (reduced disk size by 1.2GB), requiring no reboot after 
patching, improved systemvm python code, strongswan provided vpn/ipsec is more 
robust along with rVR functionality on kvm and xenserver, with good support for 
vmware (still needs further improvements). Overall, the PR2211 also aims to 
stabilize master branch. The outstanding task is to improve some tests to avoid 
env introduced failures and update the sql/db upgrade path which is on going.

Given the current state, and smoketests passing, I would like to request for 
your comments and reviews on pull request 2211: 
https://github.com/apache/cloudstack/pull/2211


Regards.

From: Wido den Hollander 
Sent: Saturday, December 9, 2017 12:32:24 AM
To: dev@cloudstack.apache.org; Rohit Yadav
Cc: us...@cloudstack.apache.org
Subject: Re: [UPDATE] Debian 9 "stretch" systemvmtemplate for master

Awesome work!

More replies below

On 12/08/2017 03:58 PM, Rohit Yadav wrote:
> All,
>
>
> Our effort to move to Debian9 systemvmtemplate seems to be soon coming to 
> conclusion, the following high-level goals have been achieved so far:
>
>
> - Several infra improvements such as faster patching (no reboots on 
> patching), smaller setup/patch scripts and even smaller cloud-early-config, 
> old file cleanups and directory/filesystem refactorings
>
> - Tested and boots/runs on KVM, VMware, XenServer and HyperV (thanks to Paul 
> for hyperv)
>
> - Boots, patches, runs systemvm/VR in about 10s (tested with KVM/XenServer 
> and NFS+SSDs) with faster console-proxy (cloud) service launch
>
> - Disk size reduced to 2GB from the previous 3+GB with still bigger /var/log 
> partition
>
> - Migration to systemd based cloud-early-config, cloud services etc (thanks 
> Wido!)
>
> - Strongswan provided vpn/ipsec improvements (ports based on work from 
> Will/Syed)
>
> - Several fixes to redundant virtual routers and scripts for VPC (ports from 
> Remi json/gzip PR and additional fixes/improvements to execute update_config 
> faster)
>
> - Packages installation improvements (thanks to Rene for review)
>
> - Several integration test fixes -- all smoke tests passing on KVM and most 
> on XenServer, work on fixing VMware test failures is on-going
>
> - Several UI/UX improvements and systemvm python codebase linting/unit tests 
> added to Travis
>
>
> Here's the pull request:
>
> https://github.com/apache/cloudstack/pull/2211
>
>
> I've temporarily hosted the templates here:
>
> http://hydra.yadav.xyz/debian9/
>
>
> Outstanding tasks/issues:
>
> - Should we skip rVR related tests for VMware noting a reference to a jira 
> ticket to renable them once the feature is support for VMware?
>
> - Fix intermittent failures for XenServer and test failures on VMware
>
> - Misc issues and items (full checklist available on the PR)
>
> - Review and additional test effort from community
>
>
> After your due review, if we're able to show that the test results are on par 
> with the previous 4.9.2.0/4.9.3.0 smoke test results (i.e. most are passing) 
> on XenServer, KVM, and VMware I would proceed with merging the PR by end of 
> this month. Thoughts, comments?
>

We might want to look/verify that this works:

- Running with VirtIO-SCSI under KVM (allows disk trimming)
- Make sure the Qemu Guest Agent works

If those two things work we can keep the footprint of the SSVM rather small.

Wido

>
> Regards.
>
> rohit.ya...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
>

rohit.ya...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 



Re: [UPDATE] Debian 9 "stretch" systemvmtemplate for master

2017-12-18 Thread Rohit Yadav
Hi Wido,

Thanks. I've verified, virtio-scsi seems to work for me. Qemu guest agent also 
works, I was able to write poc code to get rid of patchviasocket.py as well. 
Can you help review and test the PR?

Regards.

From: Wido den Hollander 
Sent: Saturday, December 9, 2017 12:32:24 AM
To: dev@cloudstack.apache.org; Rohit Yadav
Cc: us...@cloudstack.apache.org
Subject: Re: [UPDATE] Debian 9 "stretch" systemvmtemplate for master

Awesome work!

More replies below

On 12/08/2017 03:58 PM, Rohit Yadav wrote:
> All,
>
>
> Our effort to move to Debian9 systemvmtemplate seems to be soon coming to 
> conclusion, the following high-level goals have been achieved so far:
>
>
> - Several infra improvements such as faster patching (no reboots on 
> patching), smaller setup/patch scripts and even smaller cloud-early-config, 
> old file cleanups and directory/filesystem refactorings
>
> - Tested and boots/runs on KVM, VMware, XenServer and HyperV (thanks to Paul 
> for hyperv)
>
> - Boots, patches, runs systemvm/VR in about 10s (tested with KVM/XenServer 
> and NFS+SSDs) with faster console-proxy (cloud) service launch
>
> - Disk size reduced to 2GB from the previous 3+GB with still bigger /var/log 
> partition
>
> - Migration to systemd based cloud-early-config, cloud services etc (thanks 
> Wido!)
>
> - Strongswan provided vpn/ipsec improvements (ports based on work from 
> Will/Syed)
>
> - Several fixes to redundant virtual routers and scripts for VPC (ports from 
> Remi json/gzip PR and additional fixes/improvements to execute update_config 
> faster)
>
> - Packages installation improvements (thanks to Rene for review)
>
> - Several integration test fixes -- all smoke tests passing on KVM and most 
> on XenServer, work on fixing VMware test failures is on-going
>
> - Several UI/UX improvements and systemvm python codebase linting/unit tests 
> added to Travis
>
>
> Here's the pull request:
>
> https://github.com/apache/cloudstack/pull/2211
>
>
> I've temporarily hosted the templates here:
>
> http://hydra.yadav.xyz/debian9/
>
>
> Outstanding tasks/issues:
>
> - Should we skip rVR related tests for VMware noting a reference to a jira 
> ticket to renable them once the feature is support for VMware?
>
> - Fix intermittent failures for XenServer and test failures on VMware
>
> - Misc issues and items (full checklist available on the PR)
>
> - Review and additional test effort from community
>
>
> After your due review, if we're able to show that the test results are on par 
> with the previous 4.9.2.0/4.9.3.0 smoke test results (i.e. most are passing) 
> on XenServer, KVM, and VMware I would proceed with merging the PR by end of 
> this month. Thoughts, comments?
>

We might want to look/verify that this works:

- Running with VirtIO-SCSI under KVM (allows disk trimming)
- Make sure the Qemu Guest Agent works

If those two things work we can keep the footprint of the SSVM rather small.

Wido

>
> Regards.
>
> rohit.ya...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
>

rohit.ya...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 



Re: XenServer 7.1 and 7.2

2017-12-18 Thread Rohit Yadav
Thanks Paul, the PR has been merged after reviewing and based on smoketests. 
Would you also like to add support for XenServer 7.3?

Regards.

From: Paul Angus 
Sent: Wednesday, December 13, 2017 11:39:28 PM
To: dev
Cc: Syed Ahmed; Pierre-Luc Dion
Subject: XenServer 7.1 and 7.2

Hi All,

I’ve raised a PR to add XenServer 7.1 and 7.2 to  CloudStack’s hypervisor list 
and added new OS mappings.
I’ve also not added deprecated OSes to the new mappings (if that makes sense).

https://github.com/apache/cloudstack/pull/2346

Tests pass with PR rebased to master as of a couple of days ago.  Only ‘usual’ 
failures being seen in smoke tests.

Can we get some review-love going please… 😊



Kind regards,

Paul Angus


paul.an...@shapeblue.com
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue




rohit.ya...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 



Re: XenServer 7.1 and 7.2

2017-12-18 Thread Khosrow Moossavi
Apparently XenServer "xen-tools" has been renamed from version 7.0 onward
to "guest-tools".
https://docs.citrix.com/content/dam/docs/en-us/xenserver/xenserver-7-0/downloads/xenserver-7-0-quick-start-guide.pdf
(Section 4.2, point 3)

And this comment:
https://issues.apache.org/jira/browse/CLOUDSTACK-9839?focusedCommentId=16039096&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16039096

I'm gonna open a PR today.



On Mon, Dec 18, 2017 at 2:09 PM, Rohit Yadav 
wrote:

> Thanks Paul, the PR has been merged after reviewing and based on
> smoketests. Would you also like to add support for XenServer 7.3?
>
> Regards.
> 
> From: Paul Angus 
> Sent: Wednesday, December 13, 2017 11:39:28 PM
> To: dev
> Cc: Syed Ahmed; Pierre-Luc Dion
> Subject: XenServer 7.1 and 7.2
>
> Hi All,
>
> I’ve raised a PR to add XenServer 7.1 and 7.2 to  CloudStack’s hypervisor
> list and added new OS mappings.
> I’ve also not added deprecated OSes to the new mappings (if that makes
> sense).
>
> https://github.com/apache/cloudstack/pull/2346
>
> Tests pass with PR rebased to master as of a couple of days ago.  Only
> ‘usual’ failures being seen in smoke tests.
>
> Can we get some review-love going please… 😊
>
>
>
> Kind regards,
>
> Paul Angus
>
>
> paul.an...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
>
> rohit.ya...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
>


Bug in ViewResponseHelper.java of 4627fb2

2017-12-18 Thread Tutkowski, Mike
Hi,

I noticed an issue today with a fairly recent commit: 4627fb2.

In ViewResponseHelper.java, a NullPointerException can be thrown when 
interacting with a data disk on VMware because the disk chain value in 
cloud.volumes can have a value of NULL.

I can put in a check for NULL and avoid the NullPointerException, but perhaps 
someone knows the history of why this particular field is used in this case and 
can fill me in.

Thanks!
Mike


Re: Master Blockers and Criticals

2017-12-18 Thread Khosrow Moossavi
@Paul you can assign CLOUDSTACK-9862 to me, we already have it fixed in our
own fork.



On Mon, Dec 18, 2017 at 12:05 PM, Paul Angus 
wrote:

> Hi All, here is an updated summary of the open Critical and Blocker Issues
> in Jira.
> If you are working on any of these issues, please whether you believe that
> you will have this issue closed by 8th Jan.
>
> @Jayapal Reddy please respond to the pings on the subject of the blocker
> that you have raised and are working on.
>
> Key PrioritySummary
>  Assignee
> Reporter
> CLOUDSTACK-9885 Blocker VPC RVR: On deleting first tier and configuring
> Private GW both VRs becoming MASTER Jayapal Reddy   Jayapal
> Reddy
> CLOUDSTACK-10127Critical4.9 / 4.10 KVM + openvswitch + vpc
> + static nat / secondary ip on eth2? Frank Maximus
>  Sven Vogel
> CLOUDSTACK-9892 CriticalPrimary storage resource check is broken
> when using root disk size override to deploy VMKoushik Das
>  Koushik Das
> CLOUDSTACK-9862 Criticallist template with id= no longer work as
> domain admin   Unassigned
> Pierre-Luc Dion
> CLOUDSTACK-10128CriticalTemplate from snapshot not merging
> vhd filesRafael
> Weingärtner  Marcelo Lima
> CLOUDSTACK-9964 CriticalSnapahots are getting deleted if VM is
> assigned to another user Pavan Kumar
> Aravapalli  Pavan Kumar Aravapalli
> CLOUDSTACK-9855 CriticalVPC RVR when master guest interface is
> down backup VR is not switched to Master Jayapal Reddy   Jayapal
> Reddy
> CLOUDSTACK-9837 CriticalUpon stoping and starting an InternalLbVM
> device from CloudStack, HAProxy service on the device is not being started
> properly
>
>   Unassigned  Mani
> Prashanth Varma Manthena
>
> project = CLOUDSTACK AND issuetype = Bug AND status in (Open, "In
> Progress", Reopened) AND priority in (Blocker, Critical) AND
> affectedVersion in (4.10.0.0, 4.10.1.0, 4.11.0.0, Future) ORDER BY priority
> DESC, updated DESC
>
>
>
> Kind regards,
>
> Paul Angus
>
> paul.an...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> @shapeblue
>
>
>
>
> -Original Message-
> From: Paul Angus [mailto:paul.an...@shapeblue.com]
> Sent: 11 December 2017 11:00
> To: dev@cloudstack.apache.org
> Cc: Boris Stoyanov ; Rohit Yadav <
> rohit.ya...@shapeblue.com>
> Subject: Master Blockers and Criticals
>
> Hi All,
> Please find a summary - of open critical and blocker bugs in 4.11 If you
> know of one which is missing please update Jira accordingly.  I will chase
> Assignees individually (via ML) to get status updates...
>
>
> KeySummary
>
>
>  Priority Assignee  Reporter
> CLOUDSTACK-9885  VPC RVR: On deleting first tier and configuring
> Private GW both VRs becoming MASTER   Blocker Jayapal
> Reddy   Jayapal Reddy
> CLOUDSTACK-10164   UI - not able to create a VPC
>
>   Blocker Sigert Goeminne
> Sigert Goeminne
> CLOUDSTACK-10127   4.9 / 4.10 KVM + openvswitch + vpc + static nat /
> secondary ip on eth2?
> Critical   Frank Maximus Sven Vogel
> CLOUDSTACK-9892  Primary storage resource check is broken when
> using root disk size override to deploy VMCritical
>  Koushik DasKoushik Das
> CLOUDSTACK-10140   When template is created from snapshot
> template.properties are corrupted
> Critical   UnassignedIvan Kudryavtsev
> CLOUDSTACK-10128   Template from snapshot not merging vhd files
>
> Critical   UnassignedMarcelo Lima
> CLOUDSTACK-9964  Snapahots are getting deleted if VM is assigned
> to another user
>  Critical   Pavan Kumar Aravapalli Pavan Kumar Aravapalli
> CLOUDSTACK-9855  VPC RVR when master guest interface is down
> backup VR is not switched to Master  Critical
>  Jayapal Reddy   Jayapal Reddy
> CLOUDSTACK-9862  list template with id= no longer work as domain
> admin
>  Critical   UnassignedPierre-Luc Dion
> CLOUDSTACK-9837  Upon stoping and starting an InternalLbVM device
> from CloudStack, HAProxy service on the device is not being started properly
> Critical   UnassignedMani Prashanth Varma Manthena
>
>
> Kind regards,
>
> Paul Angus
>
>
> paul.an...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>
>
>
>


Re: Bug in ViewResponseHelper.java of 4627fb2

2017-12-18 Thread Rafael Weingärtner
What is the line in that class that may generate a NPE? 291?
Please do open a PR to propose a fix for this situation.

On Mon, Dec 18, 2017 at 6:38 PM, Tutkowski, Mike 
wrote:

> Hi,
>
> I noticed an issue today with a fairly recent commit: 4627fb2.
>
> In ViewResponseHelper.java, a NullPointerException can be thrown when
> interacting with a data disk on VMware because the disk chain value in
> cloud.volumes can have a value of NULL.
>
> I can put in a check for NULL and avoid the NullPointerException, but
> perhaps someone knows the history of why this particular field is used in
> this case and can fill me in.
>
> Thanks!
> Mike
>



-- 
Rafael Weingärtner


Re: Bug in ViewResponseHelper.java of 4627fb2

2017-12-18 Thread Tutkowski, Mike
I’m not at my computer now, so don’t know the exact line number.

I can open up a PR with my fix.

The problem is that you shouldn’t pass in null for the key of a 
ConcurrentHashMap, but the code can do this for data disks on VMware (hence the 
NullPointerException).

On Dec 18, 2017, at 4:48 PM, Rafael Weingärtner 
mailto:rafaelweingart...@gmail.com>> wrote:

What is the line in that class that may generate a NPE? 291?
Please do open a PR to propose a fix for this situation.

On Mon, Dec 18, 2017 at 6:38 PM, Tutkowski, Mike 
mailto:mike.tutkow...@netapp.com>>
wrote:

Hi,

I noticed an issue today with a fairly recent commit: 4627fb2.

In ViewResponseHelper.java, a NullPointerException can be thrown when
interacting with a data disk on VMware because the disk chain value in
cloud.volumes can have a value of NULL.

I can put in a check for NULL and avoid the NullPointerException, but
perhaps someone knows the history of why this particular field is used in
this case and can fill me in.

Thanks!
Mike




--
Rafael Weingärtner


RE: Master Blockers and Criticals

2017-12-18 Thread Paul Angus
Thank you Khosrow,

Do you have an Apache Jira ID, so that I can assign it in Jira also?


Kind regards,

Paul Angus

paul.an...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 


-Original Message-
From: Khosrow Moossavi [mailto:kmooss...@cloudops.com] 
Sent: 18 December 2017 20:48
To: dev@cloudstack.apache.org
Cc: Boris Stoyanov ; Rohit Yadav 

Subject: Re: Master Blockers and Criticals

@Paul you can assign CLOUDSTACK-9862 to me, we already have it fixed in our own 
fork.



On Mon, Dec 18, 2017 at 12:05 PM, Paul Angus 
wrote:

> Hi All, here is an updated summary of the open Critical and Blocker 
> Issues in Jira.
> If you are working on any of these issues, please whether you believe 
> that you will have this issue closed by 8th Jan.
>
> @Jayapal Reddy please respond to the pings on the subject of the 
> blocker that you have raised and are working on.
>
> Key PrioritySummary
>  Assignee Reporter
> CLOUDSTACK-9885 Blocker VPC RVR: On deleting first tier and configuring
> Private GW both VRs becoming MASTER Jayapal Reddy   Jayapal
> Reddy
> CLOUDSTACK-10127Critical4.9 / 4.10 KVM + openvswitch + vpc
> + static nat / secondary ip on eth2? Frank Maximus
>  Sven Vogel
> CLOUDSTACK-9892 CriticalPrimary storage resource check is broken
> when using root disk size override to deploy VMKoushik Das
>  Koushik Das
> CLOUDSTACK-9862 Criticallist template with id= no longer work as
> domain admin   Unassigned
> Pierre-Luc Dion
> CLOUDSTACK-10128CriticalTemplate from snapshot not merging
> vhd filesRafael
> Weingärtner  Marcelo Lima
> CLOUDSTACK-9964 CriticalSnapahots are getting deleted if VM is
> assigned to another user Pavan Kumar
> Aravapalli  Pavan Kumar Aravapalli
> CLOUDSTACK-9855 CriticalVPC RVR when master guest interface is
> down backup VR is not switched to Master Jayapal Reddy   Jayapal
> Reddy
> CLOUDSTACK-9837 CriticalUpon stoping and starting an InternalLbVM
> device from CloudStack, HAProxy service on the device is not being 
> started properly
>
>   Unassigned  Mani
> Prashanth Varma Manthena
>
> project = CLOUDSTACK AND issuetype = Bug AND status in (Open, "In 
> Progress", Reopened) AND priority in (Blocker, Critical) AND 
> affectedVersion in (4.10.0.0, 4.10.1.0, 4.11.0.0, Future) ORDER BY 
> priority DESC, updated DESC
>
>
>
> Kind regards,
>
> Paul Angus
>
> paul.an...@shapeblue.com
> www.shapeblue.com
> 53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
>
>
>
>
> -Original Message-
> From: Paul Angus [mailto:paul.an...@shapeblue.com]
> Sent: 11 December 2017 11:00
> To: dev@cloudstack.apache.org
> Cc: Boris Stoyanov ; Rohit Yadav <
> rohit.ya...@shapeblue.com>
> Subject: Master Blockers and Criticals
>
> Hi All,
> Please find a summary - of open critical and blocker bugs in 4.11 If you
> know of one which is missing please update Jira accordingly.  I will chase
> Assignees individually (via ML) to get status updates...
>
>
> KeySummary
>
>
>  Priority Assignee  Reporter
> CLOUDSTACK-9885  VPC RVR: On deleting first tier and configuring
> Private GW both VRs becoming MASTER   Blocker Jayapal
> Reddy   Jayapal Reddy
> CLOUDSTACK-10164   UI - not able to create a VPC
>
>   Blocker Sigert Goeminne
> Sigert Goeminne
> CLOUDSTACK-10127   4.9 / 4.10 KVM + openvswitch + vpc + static nat /
> secondary ip on eth2?
> Critical   Frank Maximus Sven Vogel
> CLOUDSTACK-9892  Primary storage resource check is broken when
> using root disk size override to deploy VMCritical
>  Koushik DasKoushik Das
> CLOUDSTACK-10140   When template is created from snapshot
> template.properties are corrupted
> Critical   UnassignedIvan Kudryavtsev
> CLOUDSTACK-10128   Template from snapshot not merging vhd files
>
> Critical   UnassignedMarcelo Lima
> CLOUDSTACK-9964  Snapahots are getting deleted if VM is assigned
> to another user
>  Critical   Pavan Kumar Aravapalli Pavan Kumar Aravapalli
> CLOUDSTACK-9855  VPC RVR when master guest interface is down
> backup VR is not switched to Master  Critical
>  Jayapal Reddy   Jayapal Reddy
> CLOUDSTACK-9862  list template with id= no longer work as domain
> admin
>  Critical   UnassignedPierre-Luc Dion
> CLOUDSTACK-9837  Upon 

Adding Spellchecker to code style validator

2017-12-18 Thread Ivan Kudryavtsev
Hello, devs.

How about adding spell checking to code style guide. ACS uses a lot of java
introspection including JSON generation, etc. so typos migrate to protocol
level. Working on CLOUDSTACK-10168 I found ipv4_adress inside python code /
dhcp related json, trying to improve "the camp" I moved to java code and
found ipv4Adress private var which is used in gson serializer resulting to
the protocol with bad keywords. Might it be spellchecker is able to fix it.

The same thing is for logging messages, I usually looking for "address" not
for "adress" so It's really impossible to find relevant message if typos
exist.

I'm not java guru to add spellchecker by myself and it's a project policy
thing, so might be it's a thing which worth adoption?


-- 
With best regards, Ivan Kudryavtsev
Bitworks Software, Ltd.
Cell: +7-923-414-1515
WWW: http://bitworks.software/ 


Re: [DISCUSS] Management server (pre-)shutdown to avoid killing jobs

2017-12-18 Thread Marc-Aurèle Brothier
It's definitively a great direction to take and much more robust. ZK would
be great fit to monitor the state of management servers and agent with the
help of the ephemeral nodes. On the other side, it's not encouraged to use
it as a messaging queue, and kafka would be a much better fit for that
purpose, having partitions/topics. Doing a quick overview of the
architecture I would see ZK used a inter-JVM lock, holding mgmt & agent
status nodes, along with their capacities using a direct connection from
each of them to ZK. Kafka would be use to exchange the current command
messages between management servers, management server & agents. With those
2 kind of brokers in the middle it will make the system super resilient.
For example, if a management sends a command to stop a VM on a host, but
that host's agent is stopping to perform an upgrade, when it connects back
to the kafka topic, its "stop" message would still be there if it didn't
expired and the command could be process. Of course it would have taken
more time, but still, it would not return an error message. This would
remove the need to create and manage threads in the management server to
handle all the async tasks & checks and move it to an event driven
approach. At the same time it adds 2 dependencies that require setup &
configuration, moving out of the goal to have an easy, almost all-included,
installable solution... Trade-offs to be discussed.

On Mon, Dec 18, 2017 at 8:06 PM, ilya musayev 
wrote:

> I very much agree with Paul, we should consider moving into resilient model
> with least dependence I.e ha-proxy..
>
> Send a notification to partner MS to take over the job management would be
> ideal.
>
> On Mon, Dec 18, 2017 at 9:28 AM Paul Angus 
> wrote:
>
> > Hi Marc-Aurèle,
> >
> > Personally, my utopia would be to be able to pass async jobs between
> mgmt.
> > servers.
> > So rather than waiting in indeterminate time for a snapshot to complete,
> > monitoring the job is passed to another management server.
> >
> > I would LOVE that something like Zookeeper monitored the state of the
> > mgmt. servers, so that 'other' management servers could take over the
> async
> > jobs in the (unlikely) event that a management server becomes
> unavailable.
> >
> >
> >
> > Kind regards,
> >
> > Paul Angus
> >
> > paul.an...@shapeblue.com
> > www.shapeblue.com
> > 53 Chandos Place, Covent Garden, London  WC2N 4HSUK
> > @shapeblue
> >
> >
> >
> >
> > -Original Message-
> > From: Marc-Aurèle Brothier [mailto:ma...@exoscale.ch]
> > Sent: 18 December 2017 13:56
> > To: dev@cloudstack.apache.org
> > Subject: [DISCUSS] Management server (pre-)shutdown to avoid killing jobs
> >
> > Hi everyone,
> >
> > Another point, another thread. Currently when shutting down a management
> > server, despite all the "stop()" method not being called as far as I
> know,
> > the server could be in the middle of processing an async job task. It
> will
> > lead to a failed job since the response won't be delivered to the correct
> > management server even though the job might have succeed on the agent. To
> > overcome this limitation due to our weekly production upgrades, we added
> a
> > pre-shutdown mechanism which works along side HA-proxy. The management
> > server keeps a eye onto a file "lb-agent" in which some keywords can be
> > written following the HA proxy guide (
> > https://cbonte.github.io/haproxy-dconv/1.9/configuration.html#5.2-agent-
> check
> > ).
> > When it finds "maint", "stopped" or "drain", it stops those threads:
> >  - AsyncJobManager._heartbeatScheduler: responsible to fetch and start
> > execution of AsyncJobs
> >  - AlertManagerImpl._timer: responsible to send capacity check commands
> >  - StatsCollector._executor: responsible to schedule stats command
> >
> > Then the management server stops most of its scheduled tasks. The correct
> > thing to do before shutting down the server would be to send
> > "rebalance/reconnect" commands to all agents connected on that management
> > server to ensure that commands won't go through this server at all.
> >
> > Here, HA-proxy is responsible to stop sending API requests to the
> > corresponding server with the help of this local agent check.
> >
> > In case you want to cancel the maintenance shutdown, you could write
> > "up/ready" in the file and the different schedulers will be restarted.
> >
> > This is really more a change for operation around CS for people doing
> live
> > upgrade on a regular basis, so I'm unsure if the community would want
> such
> > a change in the code base. It goes a bit in the opposite direction of the
> > change for removing the need of HA-proxy
> > https://github.com/apache/cloudstack/pull/2309
> >
> > If there is enough positive feedback for such a change, I will port them
> > to match with the upstream branch in a PR.
> >
> > Kind regards,
> > Marc-Aurèle
> >
>