Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

Harald Jensås Wed, 31 Oct 2018 10:17:12 -0700

On Tue, 2018-10-30 at 15:00 -0600, Alex Schultz wrote:
> On Tue, Oct 30, 2018 at 12:25 PM Clark Boylan <[email protected]>
> wrote:
> > 
> > On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote:
> > > On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec <
> > > [email protected]> wrote:
> > > > 
> > > > Tagging with tripleo since my suggestion below is specific to
> > > > that project.
> > > > 
> > > > On 10/30/18 11:03 AM, Clark Boylan wrote:
> > > > > Hello everyone,
> > > > > 
> > > > > A little while back I sent email explaining how the gate
> > > > > queues work and how fixing bugs helps us test and merge more
> > > > > code. All of this still is still true and we should keep
> > > > > pushing to improve our testing to avoid gate resets.
> > > > > 
> > > > > Last week we migrated Zuul and Nodepool to a new Zookeeper
> > > > > cluster. In the process of doing this we had to restart Zuul
> > > > > which brought in a new logging feature that exposes node
> > > > > resource usage by jobs. Using this data I've been able to
> > > > > generate some report information on where our node demand is
> > > > > going. This change [0] produces this report [1].
> > > > > 
> > > > > As with optimizing software we want to identify which changes
> > > > > will have the biggest impact and to be able to measure
> > > > > whether or not changes have had an impact once we have made
> > > > > them. Hopefully this information is a start at doing that.
> > > > > Currently we can only look back to the point Zuul was
> > > > > restarted, but we have a thirty day log rotation for this
> > > > > service and should be able to look at a month's worth of data
> > > > > going forward.
> > > > > 
> > > > > Looking at the data you might notice that Tripleo is using
> > > > > many more node resources than our other projects. They are
> > > > > aware of this and have a plan [2] to reduce their resource
> > > > > consumption. We'll likely be using this report generator to
> > > > > check progress of this plan over time.
> > > > 
> > > > I know at one point we had discussed reducing the concurrency
> > > > of the
> > > > tripleo gate to help with this. Since tripleo is still using
> > > > >50% of the
> > > > resources it seems like maybe we should revisit that, at least
> > > > for the
> > > > short-term until the more major changes can be made? Looking
> > > > through the
> > > > merge history for tripleo projects I don't see a lot of cases
> > > > (any, in
> > > > fact) where more than a dozen patches made it through anyway*,
> > > > so I
> > > > suspect it wouldn't have a significant impact on gate
> > > > throughput, but it
> > > > would free up quite a few nodes for other uses.
> > > > 
> > > 
> > > It's the failures in gate and resets.  At this point I think it
> > > would
> > > be a good idea to turn down the concurrency of the tripleo queue
> > > in
> > > the gate if possible. As of late it's been timeouts but we've
> > > been
> > > unable to track down why it's timing out specifically.  I
> > > personally
> > > have a feeling it's the container download times since we do not
> > > have
> > > a local registry available and are only able to leverage the
> > > mirrors
> > > for some levels of caching. Unfortunately we don't get the best
> > > information about this out of docker (or the mirrors) and it's
> > > really
> > > hard to determine what exactly makes things run a bit slower.
> > 
> > We actually tried this not too long ago 
> > https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b
> >  but decided to revert it because it didn't decrease the check
> > queue backlog significantly. We were still running at several hours
> > behind most of the time.
> > 
> > If we want to set up better monitoring and measuring and try it
> > again we can do that. But we probably want to measure queue sizes
> > with and without the change like that to better understand if it
> > helps.
> > 
> > As for container image download times can we quantify that via
> > docker logs? Basically sum up the amount of time spent by a job
> > downloading images so that we can see what the impact is but also
> > measure if changes improve that? As for other ideas improving
> > things seems like many of the images that tripleo use are quite
> > large. I recall seeing a > 600MB image just for rsyslog. Wouldn't
> > it be advantageous for both the gate and tripleo in the real world
> > to trim the size of those images (which should improve download
> > times). In any case quantifying the size of the downloads and
> > trimming those if possible is likely also worthwhile.
> > 
> 
> So it's not that simple as we don't just download all the images in a
> distinct task and there isn't any information provided around
> size/speed AFAIK.  Additionally we aren't doing anything special with
> the images (it's mostly kolla built containers with a handful of
> tweaks) so that's just the size of the containers.  I am currently
> working on reducing any tripleo specific dependencies (ie removal of
> instack-undercloud, etc) in hopes that we'll shave off some of the
> dependencies but it seems that there's a larger (bloat) issue around
> containers in general.  I have no idea why the rsyslog container
> would
> be 600M, but yea that does seem excessive.
>


We add this to all images:

https://github.com/openstack/tripleo-common/blob/d35af75b0d8c4683a677660646e535cf972c98ef/container-images/tripleo_kolla_template_overrides.j2#L35

/bin/sh -c yum -y install iproute iscsi-initiator-utils lvm2 python
socat sudo which openstack-tripleo-common-container-base rsync cronie
crudini openstack-selinux ansible python-shade puppet-tripleo python2-
kubernetes && yum clean all && rm -rf /var/cache/yum 276 MB 

Is the additional 276 MB reasonable here?
openstack-selinux <- This package run relabling, does that kind of
touching the filesystem impact the size due to docker layers?

Also: python2-kubernetes is a fairly large package (18007990) do we use
that in every image? I don't see any tripleo related repos importing
from that when searching on Hound? The original commit message[1]
adding it states it is for future convenience.

On my undercloud we have 101 images, if we are downloading every 18 MB
per image thats almost 1.8 GB for a package we don't use? (I hope it's
not like this? With docker layers, we only download that 276 MB
transaction once? Or?)


[1] https://review.openstack.org/527927



> > Clark
> > 
> > ___________________________________________________________________
> > _______
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: [email protected]?subject:unsu
> > bscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> _____________________________________________________________________
> _____
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubs
> cribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

Reply via email to