On Wed, 2018-10-31 at 11:35 -0600, Alex Schultz wrote: > On Wed, Oct 31, 2018 at 11:16 AM Harald Jensås <hjen...@redhat.com> > wrote: > > > > On Tue, 2018-10-30 at 15:00 -0600, Alex Schultz wrote: > > > On Tue, Oct 30, 2018 at 12:25 PM Clark Boylan < > > > cboy...@sapwetik.org> > > > wrote: > > > > > > > > On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote: > > > > > On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec < > > > > > openst...@nemebean.com> wrote: > > > > > > > > > > > > Tagging with tripleo since my suggestion below is specific > > > > > > to > > > > > > that project. > > > > > > > > > > > > On 10/30/18 11:03 AM, Clark Boylan wrote: > > > > > > > Hello everyone, > > > > > > > > > > > > > > A little while back I sent email explaining how the gate > > > > > > > queues work and how fixing bugs helps us test and merge > > > > > > > more > > > > > > > code. All of this still is still true and we should keep > > > > > > > pushing to improve our testing to avoid gate resets. > > > > > > > > > > > > > > Last week we migrated Zuul and Nodepool to a new > > > > > > > Zookeeper > > > > > > > cluster. In the process of doing this we had to restart > > > > > > > Zuul > > > > > > > which brought in a new logging feature that exposes node > > > > > > > resource usage by jobs. Using this data I've been able to > > > > > > > generate some report information on where our node demand > > > > > > > is > > > > > > > going. This change [0] produces this report [1]. > > > > > > > > > > > > > > As with optimizing software we want to identify which > > > > > > > changes > > > > > > > will have the biggest impact and to be able to measure > > > > > > > whether or not changes have had an impact once we have > > > > > > > made > > > > > > > them. Hopefully this information is a start at doing > > > > > > > that. > > > > > > > Currently we can only look back to the point Zuul was > > > > > > > restarted, but we have a thirty day log rotation for this > > > > > > > service and should be able to look at a month's worth of > > > > > > > data > > > > > > > going forward. > > > > > > > > > > > > > > Looking at the data you might notice that Tripleo is > > > > > > > using > > > > > > > many more node resources than our other projects. They > > > > > > > are > > > > > > > aware of this and have a plan [2] to reduce their > > > > > > > resource > > > > > > > consumption. We'll likely be using this report generator > > > > > > > to > > > > > > > check progress of this plan over time. > > > > > > > > > > > > I know at one point we had discussed reducing the > > > > > > concurrency > > > > > > of the > > > > > > tripleo gate to help with this. Since tripleo is still > > > > > > using > > > > > > > 50% of the > > > > > > > > > > > > resources it seems like maybe we should revisit that, at > > > > > > least > > > > > > for the > > > > > > short-term until the more major changes can be made? > > > > > > Looking > > > > > > through the > > > > > > merge history for tripleo projects I don't see a lot of > > > > > > cases > > > > > > (any, in > > > > > > fact) where more than a dozen patches made it through > > > > > > anyway*, > > > > > > so I > > > > > > suspect it wouldn't have a significant impact on gate > > > > > > throughput, but it > > > > > > would free up quite a few nodes for other uses. > > > > > > > > > > > > > > > > It's the failures in gate and resets. At this point I think > > > > > it > > > > > would > > > > > be a good idea to turn down the concurrency of the tripleo > > > > > queue > > > > > in > > > > > the gate if possible. As of late it's been timeouts but we've > > > > > been > > > > > unable to track down why it's timing out specifically. I > > > > > personally > > > > > have a feeling it's the container download times since we do > > > > > not > > > > > have > > > > > a local registry available and are only able to leverage the > > > > > mirrors > > > > > for some levels of caching. Unfortunately we don't get the > > > > > best > > > > > information about this out of docker (or the mirrors) and > > > > > it's > > > > > really > > > > > hard to determine what exactly makes things run a bit slower. > > > > > > > > We actually tried this not too long ago > > > > https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b > > > > but decided to revert it because it didn't decrease the check > > > > queue backlog significantly. We were still running at several > > > > hours > > > > behind most of the time. > > > > > > > > If we want to set up better monitoring and measuring and try it > > > > again we can do that. But we probably want to measure queue > > > > sizes > > > > with and without the change like that to better understand if > > > > it > > > > helps. > > > > > > > > As for container image download times can we quantify that via > > > > docker logs? Basically sum up the amount of time spent by a job > > > > downloading images so that we can see what the impact is but > > > > also > > > > measure if changes improve that? As for other ideas improving > > > > things seems like many of the images that tripleo use are quite > > > > large. I recall seeing a > 600MB image just for rsyslog. > > > > Wouldn't > > > > it be advantageous for both the gate and tripleo in the real > > > > world > > > > to trim the size of those images (which should improve download > > > > times). In any case quantifying the size of the downloads and > > > > trimming those if possible is likely also worthwhile. > > > > > > > > > > So it's not that simple as we don't just download all the images > > > in a > > > distinct task and there isn't any information provided around > > > size/speed AFAIK. Additionally we aren't doing anything special > > > with > > > the images (it's mostly kolla built containers with a handful of > > > tweaks) so that's just the size of the containers. I am > > > currently > > > working on reducing any tripleo specific dependencies (ie removal > > > of > > > instack-undercloud, etc) in hopes that we'll shave off some of > > > the > > > dependencies but it seems that there's a larger (bloat) issue > > > around > > > containers in general. I have no idea why the rsyslog container > > > would > > > be 600M, but yea that does seem excessive. > > > > > > > We add this to all images: > > > > https://github.com/openstack/tripleo-common/blob/d35af75b0d8c4683a677660646e535cf972c98ef/container-images/tripleo_kolla_template_overrides.j2#L35 > > > > /bin/sh -c yum -y install iproute iscsi-initiator-utils lvm2 python > > socat sudo which openstack-tripleo-common-container-base rsync > > cronie > > crudini openstack-selinux ansible python-shade puppet-tripleo > > python2- > > kubernetes && yum clean all && rm -rf /var/cache/yum 276 MB > > > > Is the additional 276 MB reasonable here? > > openstack-selinux <- This package run relabling, does that kind of > > touching the filesystem impact the size due to docker layers? > > > > Also: python2-kubernetes is a fairly large package (18007990) do we > > use > > that in every image? I don't see any tripleo related repos > > importing > > from that when searching on Hound? The original commit message[1] > > adding it states it is for future convenience. > > > > On my undercloud we have 101 images, if we are downloading every 18 > > MB > > per image thats almost 1.8 GB for a package we don't use? (I hope > > it's > > not like this? With docker layers, we only download that 276 MB > > transaction once? Or?) > > > > So this is a single layer that is updated once and shared by all the > containers that inherit from it. I did notice the same thing and have > proposed a change in the layering of these packages last night. >
Thanks, that's a releif then! > https://review.openstack.org/#/c/614371/ > cool, +1 > In general this does raise a point about dependencies of services and > what the actual impact of adding new ones to projects is. Especially > in the container world where this might be duplicated N times > depending on the number of services deployed. With the move to > containers, much of the sharedness that being on a single host > provided has been lost at a cost of increased bandwidth, memory, and > storage usage. > > Thanks, > -Alex > > > > > [1] https://review.openstack.org/527927 > > > > > > > > > > Clark > > > > > > > > _______________________________________________________________ > > > > ____ > > > > _______ > > > > OpenStack Development Mailing List (not for usage questions) > > > > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject: > > > > unsu > > > > bscribe > > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > _________________________________________________________________ > > > ____ > > > _____ > > > OpenStack Development Mailing List (not for usage questions) > > > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:un > > > subs > > > cribe > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > ___________________________________________________________________ > > _______ > > OpenStack Development Mailing List (not for usage questions) > > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsu > > bscribe > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev