Re: HBase nightly job failing forever

Gav Tue, 24 Jul 2018 16:28:16 -0700

Hi Andrew,

On Wed, Jul 25, 2018 at 3:21 AM Andrew Purtell <[email protected]> wrote:


> Thanks for this note.
>
> I'm release managing the 1.4 release. I have been running the unit test
> suite on reasonably endowed EC2 instances and there are no observed always
> failing tests. A few can be flaky. In comparison the Apache test resources
> have been heavily resource constrained for years and frequently suffer from
> environmental effects like botched settings, disk space issues, and
> contention with other test executors.
>

Our Jenkins nodes are configured via puppet these days and are pretty
stable, to which settings do you know of that might (still) be botched?
Yes, resources are shared and on occasion run to capacity. This is one
reason for my initial mail - these HBase builds are consuming 10 or more
executors
-at the same time- and are starving executors for other builds. The fact
these tests have been failing for well over a month and that you mention
below  will be
ignoring them does not make for good cross ASF community spirit, we are all
in this together and every little bit helps. This is not a target at one
project, others
will be getting a similar note and I hope we can come to a resolution
suitable for all.
Disk space issues , yes, not on most of the Hadoop and related projects
nodes - H0-H12 do not have disk space issues. As a Hadoop related project
HBase should really be concentrating its builds there.


> I think a 1.4 release will happen regardless of the job test results on
> Apache infrastructure. I tend to ignore them as noisy and low signal.
> Others in the HBase community don't necessarily feel the same, so please
> don't take my viewpoint as particularly representative. We could try Alan's
> suggestion first, before ignoring them outright.
>

No problem


> Has anyone given thought toward expanding the pool of test build
> resources? Or roping in cloud instances on demand? Jenkins has support for
> that.
>

We have currently 19 Hadoop specific nodes available H0-H19 and another 28
or so general use 'ubuntu' nodes for all to use. In addition we have
projects
that have targetted donated resources and the likes of Cassandra, CouchDB
and Beam all have multiple nodes on which they have priority. I'll throw an
idea
out there than perhaps HBase could do something similar to increase our
node pool and at the same time have priority on a few nodes f their own via
a targeted
hardware donation.
Cloud on demand has been tried a year or two ago, we will revisit this also
soon.

Summary then, we currently have over 80 nodes connected to our Jenkins
master - what figure did you have in mind when you say 'expanding the pool
of test build resources' ?

Thanks

Gav...


>
> On Tue, Jul 24, 2018 at 9:16 AM Allen Wittenauer
> <[email protected]> wrote:
>
>>         I suspect the bigger issue is that the hbase tests are running on
>> the ‘ubuntu’ machines. Since they only have ~300GB for workspaces, the
>> hbase tests are eating a significant majority of it and likely could be
>> dying randomly due to space issues.  [All the hbase workspace directories +
>> the yetus-m2 shared mvn cache dirs easily consume 20%+ of the space.
>> Significantly more than the 50 or so other jobs that run on those
>> machines.]
>>
>>         By comparison, most of the ‘Hadoop’ nodes have 2-3TB for the big
>> jobs to consume….
>>
>>
>> > On Jul 24, 2018, at 8:58 AM, Josh Elser <[email protected]> wrote:
>> >
>> > Yep, sadly this is a very long tent-pole for us. There are many
>> involved who have invested countless hours in making this better.
>> >
>> > Specific to that job you linked earlier, 3 test failures out of our
>> total 4958 tests (0.06% failure rate) is all but "green" in my mind. I
>> would ask that you keep that in mind, too.
>> >
>> > To that extent, others have also built another job specifically to find
>> tests which are failing intermittently:
>> https://builds.apache.org/job/HBase-Find-Flaky-Tests/25513/artifact/dashboard.html.
>> I mention this as evidence to prove to you that this is not a baseless
>> request from the HBase PMC ;)
>> >
>> > On 7/24/18 3:14 AM, Gav wrote:
>> >> Ok, good enough, will wait, please also note 'master' branch and a few
>> >> others have been failing for over a month also.
>> >> I will check in again next month to see how things are progressing
>> >> Thanks
>> >> Gav...
>> >> On Tue, Jul 24, 2018 at 1:19 AM Josh Elser <[email protected]> wrote:
>> >>> Hi Gav,
>> >>>
>> >>> Looking at the most recent results, I see that the job failed because
>> of
>> >>> two unit test failures. These are something that will be looked at
>> prior
>> >>> to the next 1.4.x release which is about to get off the ground.
>> >>>
>> >>> I'd kindly request that you not disable the job. Thanks for trying to
>> >>> find extra resources on these nodes.
>> >>>
>> >>> On 7/23/18 12:22 AM, Gavin McDonald wrote:
>> >>>> https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/
>> >>>>
>> >>>> can someone take a look into this, the job isnt much good if it is
>> >>> failing
>> >>>> all the time and even worse if it is being ignored.
>> >>>>
>> >>>> Otherwise I'll disable the job in a dew days to release these wasted
>> >>>> resources
>> >>>> to builds that matter.
>> >>>>
>> >>>>
>> >>>
>>
>>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>    - A23, Crosstalk
>


-- 
Gav...

Re: HBase nightly job failing forever

Reply via email to