Re: HBase nightly job failing forever

Andrew Purtell Wed, 25 Jul 2018 10:35:54 -0700

Thanks Joan and Bertrand.

> The number of failed builds in our stream that are directly related to
this "tragedy of the commons" far exceeds the number of successful builds
at this point, and unfortunately Travis CI is having parallel capacity
issues that prevent us from moving to them wholesale as well.


This has been my experience. So at one point years ago I moved my work off
the shared pool at the ASF as an individual contributor and have been
funding the testing I personally do up on EC2 out of pocket. This isn't a
general solution for our project, though, as it depends on my time and
ability to contribute, and focuses only on what I'm doing at the moment
maybe not what the project would like to see happen most.

I will look into targeted donation at my employer but am not optimistic.

It might be better to look at decommissioning some if not most of the
overutilized fixed test resources and use on demand executors launched on
public clouds instead. I'm not sure if the ASF is set up to manage on
demand billing for test resources but this could be advantageous. It would
track actual usage not fixed costs. To avoid budget overrun there would be
caps and limits. Eventually demand would hit this new ceiling but the
impact would be longer queue waiting times not job failures due to
environmental stress, so that would be an improvement. Each job would run
in its own virtual server or container so would be free of many of the
environmental issues we see now. Or to get the same improvement on the
resources we have now limit executor parallelism. Better to have a job wait
in queue than to run and fail anyway because the host environment is under
stress. For what it's worth.


On Wed, Jul 25, 2018 at 10:20 AM Joan Touzet <jo...@lrtw.org> wrote:

> I'll speak to CouchDB - the donation is directly in the form of a Jenkins
> build agent with our tag, no money is changed hands. The donator received
> a letter from fundraising@a.o allowing for tax deduction on the equivalent
> amount that the ASF leasing the machine would have cost for a year's
> donation. We have 24x7 support on the node from the provider, who performs
> all sysadmin (rather than burdening Infra with having to run puppet on our
> build machine). This was arranged so we could have a FreeBSD node in the
> build array.
>
> We have another donator in the wings who will be adding a build node for
> us; at that point, we expect to move all of our builds to our own Jenkins
> build agents and won't be in the common pool any longer. The number of
> failed builds in our stream that are directly related to this "tragedy of
> the commons" far exceeds the number of successful builds at this point,
> and unfortunately Travis CI is having parallel capacity issues that prevent
> us from moving to them wholesale as well.
>
> -Joan
>
> ----- Original Message -----
> From: "Andrew Purtell" <apurt...@apache.org>
> To: ipv6g...@gmail.com
> Cc: "Andrew Purtell" <apurt...@apache.org>, "dev" <d...@hbase.apache.org>,
> builds@apache.org
> Sent: Wednesday, July 25, 2018 12:22:08 PM
> Subject: Re: HBase nightly job failing forever
>
> How does a targeted hardware donation work? I was under the impression that
> targeted donations are not accepted by the ASF. Maybe it is different in
> infrastructure, but this is the first time I've heard of it. Who does the
> donation on those projects? DataStax for Cassandra? Who for CouchDB? Google
> for Beam? By what process are the donations made and how are they audited
> to confirm the donation is spent on the desired resources? Can we get a
> contact for one of them for testimonial regarding this process? Is this
> process documented?
>
>
>
>
> On Tue, Jul 24, 2018 at 4:27 PM Gav <ipv6g...@gmail.com> wrote:
>
> > Hi Andrew,
> >
> > On Wed, Jul 25, 2018 at 3:21 AM Andrew Purtell <apurt...@apache.org>
> > wrote:
> >
> >> Thanks for this note.
> >>
> >> I'm release managing the 1.4 release. I have been running the unit test
> >> suite on reasonably endowed EC2 instances and there are no observed
> always
> >> failing tests. A few can be flaky. In comparison the Apache test
> resources
> >> have been heavily resource constrained for years and frequently suffer
> from
> >> environmental effects like botched settings, disk space issues, and
> >> contention with other test executors.
> >>
> >
> > Our Jenkins nodes are configured via puppet these days and are pretty
> > stable, to which settings do you know of that might (still) be botched?
> > Yes, resources are shared and on occasion run to capacity. This is one
> > reason for my initial mail - these HBase builds are consuming 10 or more
> > executors
> > -at the same time- and are starving executors for other builds. The fact
> > these tests have been failing for well over a month and that you mention
> > below  will be
> > ignoring them does not make for good cross ASF community spirit, we are
> > all in this together and every little bit helps. This is not a target at
> > one project, others
> > will be getting a similar note and I hope we can come to a resolution
> > suitable for all.
> > Disk space issues , yes, not on most of the Hadoop and related projects
> > nodes - H0-H12 do not have disk space issues. As a Hadoop related project
> > HBase should really be concentrating its builds there.
> >
> >
> >> I think a 1.4 release will happen regardless of the job test results on
> >> Apache infrastructure. I tend to ignore them as noisy and low signal.
> >> Others in the HBase community don't necessarily feel the same, so please
> >> don't take my viewpoint as particularly representative. We could try
> Alan's
> >> suggestion first, before ignoring them outright.
> >>
> >
> > No problem
> >
> >
> >> Has anyone given thought toward expanding the pool of test build
> >> resources? Or roping in cloud instances on demand? Jenkins has support
> for
> >> that.
> >>
> >
> > We have currently 19 Hadoop specific nodes available H0-H19 and another
> 28
> > or so general use 'ubuntu' nodes for all to use. In addition we have
> > projects
> > that have targetted donated resources and the likes of Cassandra, CouchDB
> > and Beam all have multiple nodes on which they have priority. I'll throw
> an
> > idea
> > out there than perhaps HBase could do something similar to increase our
> > node pool and at the same time have priority on a few nodes f their own
> via
> > a targeted
> > hardware donation.
> > Cloud on demand has been tried a year or two ago, we will revisit this
> > also soon.
> >
> > Summary then, we currently have over 80 nodes connected to our Jenkins
> > master - what figure did you have in mind when you say 'expanding the
> pool
> > of test build resources' ?
> >
> > Thanks
> >
> > Gav...
> >
> >
> >>
> >> On Tue, Jul 24, 2018 at 9:16 AM Allen Wittenauer
> >> <a...@effectivemachines.com.invalid> wrote:
> >>
> >>>         I suspect the bigger issue is that the hbase tests are running
> >>> on the ‘ubuntu’ machines. Since they only have ~300GB for workspaces,
> the
> >>> hbase tests are eating a significant majority of it and likely could be
> >>> dying randomly due to space issues.  [All the hbase workspace
> directories +
> >>> the yetus-m2 shared mvn cache dirs easily consume 20%+ of the space.
> >>> Significantly more than the 50 or so other jobs that run on those
> >>> machines.]
> >>>
> >>>         By comparison, most of the ‘Hadoop’ nodes have 2-3TB for the
> big
> >>> jobs to consume….
> >>>
> >>>
> >>> > On Jul 24, 2018, at 8:58 AM, Josh Elser <els...@apache.org> wrote:
> >>> >
> >>> > Yep, sadly this is a very long tent-pole for us. There are many
> >>> involved who have invested countless hours in making this better.
> >>> >
> >>> > Specific to that job you linked earlier, 3 test failures out of our
> >>> total 4958 tests (0.06% failure rate) is all but "green" in my mind. I
> >>> would ask that you keep that in mind, too.
> >>> >
> >>> > To that extent, others have also built another job specifically to
> >>> find tests which are failing intermittently:
> >>>
> https://builds.apache.org/job/HBase-Find-Flaky-Tests/25513/artifact/dashboard.html
> .
> >>> I mention this as evidence to prove to you that this is not a baseless
> >>> request from the HBase PMC ;)
> >>> >
> >>> > On 7/24/18 3:14 AM, Gav wrote:
> >>> >> Ok, good enough, will wait, please also note 'master' branch and a
> few
> >>> >> others have been failing for over a month also.
> >>> >> I will check in again next month to see how things are progressing
> >>> >> Thanks
> >>> >> Gav...
> >>> >> On Tue, Jul 24, 2018 at 1:19 AM Josh Elser <els...@apache.org>
> wrote:
> >>> >>> Hi Gav,
> >>> >>>
> >>> >>> Looking at the most recent results, I see that the job failed
> >>> because of
> >>> >>> two unit test failures. These are something that will be looked at
> >>> prior
> >>> >>> to the next 1.4.x release which is about to get off the ground.
> >>> >>>
> >>> >>> I'd kindly request that you not disable the job. Thanks for trying
> to
> >>> >>> find extra resources on these nodes.
> >>> >>>
> >>> >>> On 7/23/18 12:22 AM, Gavin McDonald wrote:
> >>> >>>> https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/
> >>> >>>>
> >>> >>>> can someone take a look into this, the job isnt much good if it is
> >>> >>> failing
> >>> >>>> all the time and even worse if it is being ignored.
> >>> >>>>
> >>> >>>> Otherwise I'll disable the job in a dew days to release these
> wasted
> >>> >>>> resources
> >>> >>>> to builds that matter.
> >>> >>>>
> >>> >>>>
> >>> >>>
> >>>
> >>>
> >>
> >> --
> >> Best regards,
> >> Andrew
> >>
> >> Words like orphans lost among the crosstalk, meaning torn from truth's
> >> decrepit hands
> >>    - A23, Crosstalk
> >>
> >
> >
> > --
> > Gav...
> >
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>    - A23, Crosstalk
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: HBase nightly job failing forever

Reply via email to