Re: [openstack-dev] [nova] Averting the Nova crisis by splitting out virt drivers

Sylvain Bauza Thu, 04 Sep 2014 08:20:06 -0700


Le 04/09/2014 17:00, Solly Ross a écrit :

My only question is about the need to separate out each virt driver into a 
separate project, wouldn't you
accomplish a lot of the benefit by creating a single virt project that includes 
all of the drivers?

I don't think there's particularly a *point* to having all drivers in one repo.  Part of code review is 
looking for code "gotchas", but part of code review is looking for subtle issues that are caused by 
the very nature of the driver.  A HyperV "core" reviewing a libvirt change should certainly be able 
to provide the former, but most likely cannot provide the latter to a sufficient degree (if he or she can, 
then he or she should be a libvirt "core" as well).


A strong +1 to Dan's proposal.  I think this would also make it easier for 
non-core reviewers to get started reviewing, without having a specialized tool 
setup.

As I said previously, I'm also giving a +1 to this proposal. That said,as I think it deserves at least one iteration for getting this done(look at the scheduler split and since hox long we're working on it), Ialso think we need a short-term solution like the one proposed byThierry, ie. what I call "half-cores" - people who help reviewing ancode area and free up time for cores just for approving instead offocusing on each iteration.


-Sylvain

Best Regards,
Solly Ross

P.S.

This is a crisis. A large crisis. In fact, if you got a moment, it's
a twelve-storey crisis with a magnificent entrance hall, carpeting
throughout, 24-hour portage, and an enormous sign on the roof,
saying 'This Is a Large Crisis'. A large crisis requires a large
plan.

Ha!

----- Original Message -----

From: "Donald D Dugger" <donald.d.dug...@intel.com>
To: "Daniel P. Berrange" <berra...@redhat.com>, "OpenStack Development Mailing List 
(not for usage questions)"
<openstack-dev@lists.openstack.org>
Sent: Thursday, September 4, 2014 10:33:27 AM
Subject: Re: [openstack-dev] [nova] Averting the Nova crisis by splitting out   
virt drivers

Basically +1 with what Daniel is saying (note that, as mentioned, a side
effect of our effort to split out the scheduler will help but not solve this
problem).

My only question is about the need to separate out each virt driver into a
separate project, wouldn't you accomplish a lot of the benefit by creating a
single virt project that includes all of the drivers?  I wouldn't
necessarily expect a VMware guy to understand the specifics of the HyperV
implementation but both people should understand what a virt driver does,
how it interfaces to Nova and they should be able to intelligently review
each other's code.

--
Don Dugger
"Censeo Toto nos in Kansa esse decisse." - D. Gale
Ph: 303/443-3786

-----Original Message-----
From: Daniel P. Berrange [mailto:berra...@redhat.com]
Sent: Thursday, September 4, 2014 4:24 AM
To: OpenStack Development
Subject: [openstack-dev] [nova] Averting the Nova crisis by splitting out
virt drivers

Position statement
==================

Over the past year I've increasingly come to the conclusion that Nova is
heading for (or probably already at) a major crisis. If steps are not taken
to avert this, the project is likely to loose a non-trivial amount of
talent, both regular code contributors and core team members. That includes
myself. This is not good for Nova's long term health and so should be of
concern to anyone involved in Nova and OpenStack.

For those who don't want to read the whole mail, the executive summary is
that the nova-core team is an unfixable bottleneck in our development
process with our current project structure.
The only way I see to remove the bottleneck is to split the virt drivers out
of tree and let them all have their own core teams in their area of code,
leaving current nova core to focus on all the common code outside the virt
driver impls. I, now, none the less urge people to read the whole mail.

Background information
======================

I see many factors coming together to form the crisis

  - Burn out of core team members from over work
  - Difficulty bringing new talent into the core team
  - Long delay in getting code reviewed & merged
  - Marginalization of code areas which aren't popular
  - Increasing size of nova code through new drivers
  - Exclusion of developers without corporate backing

Each item on their own may not seem too bad, but combined they add up to a
big problem.

Core team burn out
------------------

Having been involved in Nova for several dev cycles now, it is clear that the
backlog of code up for review never goes away. Even intensive code review
efforts at various points in the dev cycle makes only a small impact on the
backlog. This has a pretty significant impact on core team members, as their
work is never done. At best, the dial is sometimes set to 10, instead of 11.

Many people, myself included, have built tools to help deal with the reviews
in a more efficient manner than plain gerrit allows for. These certainly
help, but they can't ever solve the problem on their own - just make it
slightly more bearable. And this is not even considering that core team
members might have useful contributions to make in ways beyond just code
review. Ultimately the workload is just too high to sustain the levels of
review required, so core team members will eventually burn out (as they have
done many times already).

Even if one person attempts to take the initiative to heavily invest in
review of certain features it is often to no avail.
Unless a second dedicated core reviewer can be found to 'tag team' it is hard
for one person to make a difference. The end result is that a patch is +2d
and then sits idle for weeks or more until a merge conflict requires it to
be reposted at which point even that one +2 is lost. This is a pretty
demotivating outcome for both reviewers & the patch contributor.

New core team talent
--------------------

It can't escape attention that the Nova core team does not grow in size very
often. When Nova was younger and its code base was smaller, it was easier
for contributors to get onto core because the base level of knowledge
required was that much smaller. To get onto core today requires a major
investment in learning Nova over a year or more. Even people who potentially
have the latent skills may not have the time available to invest in learning
the entire of Nova.

With the number of reviews proposed to Nova, the core team should probably be
at least double its current size[1]. There is plenty of expertize in the
project as a whole but it is typically focused into specific areas of the
codebase. There is nowhere we can find
20 more people with broad knowledge of the codebase who could be promoted
even over the next year, let alone today. This is ignoring that many
existing members of core are relatively inactive due to burnout and so need
replacing. That means we really need another
25-30 people for core. That's not going to happen.

Code review delays
------------------

The obvious result of having too much work for too few reviewers is that code
contributors face major delays in getting their work reviewed and merged.
 From personal experience, during Juno, I've probably spent 1 week in
aggregate on actual code development vs
8 weeks on waiting on code review. You have to constantly be on alert for
review comments because unless you can respond quickly (and repost) while
you still have the attention of the reviewer, they may not be look again for
days/weeks.

The length of time to get work merged serves as a demotivator to actually do
work in the first place. I've personally avoided doing alot of code
refactoring & cleanup work that would improve the maintainability of the
libvirt driver in the long term, because I can't face the battle to get it
reviewed & merged. Other people have told me much the same. It is not
uncommon to see changes that have been pending for 2 dev cycles, not because
the code was bad but because they couldn't get people to review it.
Contributors will simply walk away from nova if that happens too often.

Even when fate is on your side and code is reviewed, the chances of it
getting a success result from the CI systems first time around is slim due
to false failures. This really compounds the already poor experiance of
submitting code to Nova.

Marginalization of areas
------------------------

Since the core team has far more work to do than it can manage, it has to
prioritize what it looks at. The core team figures out what the overall
project priorities are and will focus more effort in to those areas.
Individual members will also focus their attention in areas where they have
personal interest. Unfortunately the core team is not representative of the
entire of Nova codebase. The inevitable result is that the HyperV and VMWare
drivers can often loose out in the battle for attention. In the past we've
said that it is the responsibility of people in those teams to invest in
learning the entire of Nova so that they have the knowledge required to be
promoted to core. I used to support that approach, but now consider to be
flawed due to the increased difficulty of *anyone* getting onto core. The
time investment required is simply too great to expect people to undertake
it. The marginalized areas have no freedom to self-organize to solve their
own problems because the!
  y are forever dependant on the core team bottleneck.

Increasing size
---------------

There is a long standing policy that the Nova virt driver API is considered
unstable and thus all virt driver implementations should ultimately be part
of the Nova codebase. In Juno it is likely that the Ironic driver will be
merged into Nova. In a future release we may yet see the Docker driver
return to the Nova tree.

The result of merging yet more drivers is that there will be yet more work
for nova reviewers to do. It is far from obvious that merging new drivers
will be accompanied by new members on the core team. So it is likely that
the workload is going to get worse over future releases.

Splitting out the scheduler will be beneficial in reducing the review
backlog, but probably not enough to counter the growth from virt drivers.
Killing of nova-network is unlikely to help at all, since that consumes
little-to-no review time currently [2].

Exclusion of non-corporate devs
-------------------------------

There is a strong push from nova core for everything that is merged into Nova
to be accompanied by CI testing. This certainly makes sense from the POV of
overall product quality and reducing the burden on the core reviewers to
catch all mistakes through code review. What we don't take into account is
that setting up and maintaining such testing infrastructure requires a major
investment in terms of both hardware costs and man power. It has already
been seen that this is too much to bear for some companies who contribute to
Nova, eg with the Docker driver [3]. Developers who are not affiliated with
any company do not stand any realistic chance of meeting the CI testing
needs unless they're lucky that their feature can be covered by an existing
running CI system. This looks like it could effectively prevent support for
a community submitted FreeBSD BHyve driver from being merged, no matter how
useful it might be to users who want it.
NB, now a FreeBSD BHyve driver would probably be done as part of the libvirt
driver, which complicates this particular point I'm trying to make, since I
don't suggest reducing testing of the libvirt driver compared to what it has
today.

I don't want to get into a detailed testing discussion here really, since
that's somewhat of a tangent to the question of our dev and review process.
I am, however, concerned when our testing policy forces maintainers of some
virt drivers into the position of being treated as second class citizens
within the project as a whole, with a different development structure to the
in-tree approved drivers.
That said, Docker probably benefits from being out of tree, since it thus
avoids the painful nova core bottleneck entirely.

Problem summary
---------------

The common thread through most of these problems is that the nova core team
is a massive bottleneck in the development process.
Processes adopted (or under discussion) by the core team are fundamentally
not helping to remove the bottleneck. Rather they are introducing new layers
of beaurocracy so that we can feel justified in telling contributors that we
are going to ignore or reject their work. At best this is going to result in
far less useful work taking place in Nova. At worst this is further reducing
the ability of people to self organize to solve the problems, will cause our
contribtors to leave the community and possibly even force some virt drivers
to go out of tree to get their work done. Death by a thousand cuts.

A sub-thread is around the idea that our current structure of one big repo
also has other negative consequences for drivers who may not be able to meet
the same high standards as the rest of the drivers. A driver is either in or
out of the club, and if its out of the club life is made comparatively
harder for its developers & users. By all means have rules around that
requirements for a release to use the openstack trademarks based on CI
testing coverage, but don't let that penalize the actual development process
itself.

Overall Nova is being increasingly hostile to its community of contributors.
I don't mean this as a result of any sense of malice or ill-will. What we're
seeing is merely a symptom of a hard worked team struggling to survive with
a burden they can no longer be reasonably expected to cope with. Nova core
has done an amazing job at surviving for so long as the project grew much
larger & more quickly than anyone probably expected. The time has come for
some radical changes to let nova adapt & evolve to the next level.

This is a crisis. A large crisis. In fact, if you got a moment, it's a
twelve-storey crisis with a magnificent entrance hall, carpeting throughout,
24-hour portage, and an enormous sign on the roof, saying 'This Is a Large
Crisis'. A large crisis requires a large plan.

Proposal / solution
===================

In the past Nova has spun out its volume layer to form the cinder project.
The Neutron project started as an attempt to solve the networking space, and
ultimately replace the nova-network. It is likely that the schedular will be
spun out to a separate project.

Now Neutron itself has grown so large and successful that it is considering
going one step further and spinning its actual drivers out of tree into
standalone add-on projects [4]. I've heard on the grapevine that Ironic is
considering similar steps for hardware drivers.

The radical (?) solution to the nova core team bottleneck is thus to follow
this lead and split the nova virt drivers out into separate projects and
delegate their maintainence to new dedicated teams.

  - Nova becomes the home for the public APIs, RPC system, database
    persistent and the glue that ties all this together with the
    virt driver API.

  - Each virt driver project gets its own core team and is responsible
    for dealing with review, merge & release of their codebase.

Note, I really do mean *all* virt drivers should be separate. I do not want
to see some virt drivers split out and others remain in tree because I feel
that signifies that the out of tree ones are second class citizens. It is
important to set up our dev structure so that every virt driver is treated
equally & so has equal chance to achieve success. As long as one driver
remains in tree there will always be pressure for others to join it, which
is exactly what we're trying to get away from here. By everyone being out of
tree, drivers (like
Docker) can take a decision about whether it is the right time for them to be
investing in gating CI systems, without being penalized in their dev process
if they make a decision to not have gate tests right now.

This has quite a few implications for the way development would operate.

  - The Nova core team at least, would be voluntarily giving up a big
    amount of responsibility over the evolution of virt drivers. Due
    to human nature, people are not good at giving up power, so this
    may be painful to swallow. Realistically current nova core are
    not experts in most of the virt drivers to start with, and more
    important we clearly do not have sufficient time to do a good job
    of review with everything submitted. Much of the current need
    for core review of virt drivers is to prevent the mis-use of a
    poorly defined virt driver API...which can be mitigated - See
    later point(s)

  - Nova core would/should not have automatic +2 over the virt driver
    repositories since it is unreasonable to assume they have the
    suitable domain knowledge for all virt drivers out there. People
    would of course be able to be members of multiple core teams. For
    example John G would naturally be nova-core and nova-xen-core. I
    would aim for nova-core and nova-libvirt-core, and so on. I do not
    want any +2 responsibility over VMWare/HyperV/Docker drivers since
    they're not my area of expertize - I only look at them today because
    they have no other nova-core representation.

  - Not sure if it implies the Nova PTL would be solely focused on
    Nova common. eg would there continue to be one PTL over all virt
    driver implementation projects, or would each project have its
    own PTL. Maybe this is irrelevant if a Czars approach is chosen
    by virt driver projects for their work. I'd be inclined to say
    that a single PTL should stay as a figurehead to represent all
    the virt driver projects, acting as a point of contact to ensure
    we keep communication / co-operation between the drivers in sync.

  - A fairly significant amount of nova code would need to be
    considered semi-stable API. Certainly everything under nova/virt
    and any object which is passed in/out of the virt driver API.
    Changes to such APIs would have to be done in a backwards
    compatible manner, since it is no longer possible to lock-step
    change all the virt driver impls. In some ways I think this would
    be a good thing as it will encourage people to put more thought
    into the long term maintainability of nova internal code instead
    of relying on being able to rip it apart later, at will.

  - The nova/virt/driver.py class would need to be much better
    specified. All parameters / return values which are opaque dicts
    must be replaced with objects + attributes. Completion of the
    objectification work is mandatory, so there is cleaner separation
    between virt driver impls & the rest of Nova.

  - If changes are required to common code, the virt driver developer
    would first have to get the necccessary pieces merged into Nova
    common. Then the follow up virt driver specific changes could be
    proposed to their repo. This implies that some changes to virt
    drivers will still contend for resource in the common nova repo
    and team. This contention should be lower than it is today though
    since the current nova core team should have less code to look
    after per-person on aggregate.

  - Changes submitted to nova common code would trigger running of CI
    tests against the external virt drivers. Each virt driver core team
    would decide whether they want their driver to be tested upon Nova
    common changes. Expect that all would choose to be included to the
    same extent that they are today. So level of validation of nova code
    would remain at least at current level. I don't want to reduce the
    amount of code testing here since that's contrary to the direction
    we're taking wrt testing.

  - Changes submitted to virt drivers would trigger running CI tests
    that are applicable. eg changes to libvirt driver repo would not
    involve running database migration tests, since all database code
    is isolated in nova. libvirt changes would not trigger vmware,
    xenserver, ironic, etc CI systems. Virt driver changes should
    see fewer false positives in the tests as a result, and those
    that do occur should be more explicitly related to the code being
    proposed. eg a change to vmware is not going to trigger a tempest
    run that uses libvirt, so non-deterministic failures in libvirt
    will no longer plague vmware developers reviews. This would also
    make it possible for VMWare CI to be made gating for changes to
    the VMWare virt driver repository, without negatively impacting
    other virt drivers. So this change should increase testing quality
    for non-libvirt virt drivers and reduce pain of false failures
    for everyone.

  - Virt drivers shouldn't use oslo incubator code from nova, since
    that can be replaced any time and isn't upgrade safe. Ideally most
    of the incubator stuff virt drivers need should turn into stable
    oslo APIs. Failing that, virt drivers would need their own copy
    of the incubated code in their module namespace, to avoid clash
    or the need to lock-step upgrade code across separate git repos.

Overall the outcome is that

  - Far larger pool of people able to approve changes for merge
    across nova core and the virt driver core teams.

  - Faster review & merge for virt driver patches that don't involve
    changes to common nova code, with less CI system testing pain.

  - Ability to set priority of work in virt drivers without a 3rd
    party being a bottleneck, where the work doesn't involve changes
    to common nova code.

  - Each virt driver team can accept as many features as they feel
    able to deal with, without it negatively impacting amount of
    features that other virt driver teams can accept.

  - Virt drivers have flexibility to set their own policies on testing
    without being penalized in the way they then develop their code.

The migration
-------------

Obviously a proposal such as this is a pretty major undertaking. It should be
clear that it could not be done in a short amount of time.
It is suggested that it be phased in over two dev cycles. In the Kilo release
the focus would be on prep work:

   - Formalizing the separation between the virt driver impls and the
     rest of the nova codebase. Figure out exactly which areas of
     Nova internal code will need to be marked as 'semi-stable' for
     use by virt drivers, and ensure their APIs are sufficiently
     future proof.

   - Discussions with the infrastructure, docs, release, etc teams to
     identify impacts on them and do any required prep work.

   - Identify the teams which will lead the new virt driver projects.
     eg core reviewers, PTL or Czars for each job if applicable

   - Probably more things I can't think of right now

Then at the start of the Lxxxx release, the virt drivers would actually be
split out into separate git repos and start their dev process for the
future. So for bulk of Lxxxx the drivers would be on their own. The two
Lxxxx rc milestones would allow us to ensure our release processes were
working well with the split drivers before the Lxxxx final release.

Final thought
-------------

Overall consider this a vote of no confidence in nova continuing to operate
as it does today. As mentioned above this is not intended to be disrepectful
to the effort every nova core member has put in, just a reflection on the
changed environment we find ourselves in. Fiddling with our processes for
the prioritization of work cannot fix the fundamental fact that nova core
today is a massive single point of failure & bottleneck, increasingly
crippling the project. The only way to address this is by a radical
re-organization of our project to remove the bottlenecks by modularization
of the project & leaders.
Keeping a single team and adding more/changing process is simply akin to
shifting deckchairs on the titanic and not a viable option to coninue with
long term.

Now, I'm realistic. Even with every driver separated out, I expect that each
of them will individually still have more work proposed than their
respective core teams have time to review. The new structure will, however,
make it easier for the core individal teams to grow & adapt in ways that
suit their specific needs. For self-contained virt driver changes it will
mean that acceptance of work by one team will not take away capacity from
another team. Further the burden of knowledge required to make it onto a
virt driver core team would be greatly reduced due to the narrower focus of
each core team, so we'll be able to promote good talent onto virt driver
core teams more quickly.

Thanks for reading so far. Now lets make some real change to prepare us for
future sustainability & even growth.

Regards,
Daniel

[1]
http://lists.openstack.org/pipermail/openstack-dev/2014-August/044459.html
[2] There was a ban on changes to nova-network for much of the past two
     cycles. It was relaxed primarily to allow full conversion of nova
     codebase to use objects, not for major new feature development.
[3] http://lists.openstack.org/pipermail/openstack-dev/2014-July/040443.html
[4]
http://lists.openstack.org/pipermail/openstack-dev/2014-August/043036.html

--
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] Averting the Nova crisis by splitting out virt drivers

Reply via email to