Hi Andrew, On Wed, Jul 25, 2018 at 3:21 AM Andrew Purtell <apurt...@apache.org> wrote:
> Thanks for this note. > > I'm release managing the 1.4 release. I have been running the unit test > suite on reasonably endowed EC2 instances and there are no observed always > failing tests. A few can be flaky. In comparison the Apache test resources > have been heavily resource constrained for years and frequently suffer from > environmental effects like botched settings, disk space issues, and > contention with other test executors. > Our Jenkins nodes are configured via puppet these days and are pretty stable, to which settings do you know of that might (still) be botched? Yes, resources are shared and on occasion run to capacity. This is one reason for my initial mail - these HBase builds are consuming 10 or more executors -at the same time- and are starving executors for other builds. The fact these tests have been failing for well over a month and that you mention below will be ignoring them does not make for good cross ASF community spirit, we are all in this together and every little bit helps. This is not a target at one project, others will be getting a similar note and I hope we can come to a resolution suitable for all. Disk space issues , yes, not on most of the Hadoop and related projects nodes - H0-H12 do not have disk space issues. As a Hadoop related project HBase should really be concentrating its builds there. > I think a 1.4 release will happen regardless of the job test results on > Apache infrastructure. I tend to ignore them as noisy and low signal. > Others in the HBase community don't necessarily feel the same, so please > don't take my viewpoint as particularly representative. We could try Alan's > suggestion first, before ignoring them outright. > No problem > Has anyone given thought toward expanding the pool of test build > resources? Or roping in cloud instances on demand? Jenkins has support for > that. > We have currently 19 Hadoop specific nodes available H0-H19 and another 28 or so general use 'ubuntu' nodes for all to use. In addition we have projects that have targetted donated resources and the likes of Cassandra, CouchDB and Beam all have multiple nodes on which they have priority. I'll throw an idea out there than perhaps HBase could do something similar to increase our node pool and at the same time have priority on a few nodes f their own via a targeted hardware donation. Cloud on demand has been tried a year or two ago, we will revisit this also soon. Summary then, we currently have over 80 nodes connected to our Jenkins master - what figure did you have in mind when you say 'expanding the pool of test build resources' ? Thanks Gav... > > On Tue, Jul 24, 2018 at 9:16 AM Allen Wittenauer > <a...@effectivemachines.com.invalid> wrote: > >> I suspect the bigger issue is that the hbase tests are running on >> the ‘ubuntu’ machines. Since they only have ~300GB for workspaces, the >> hbase tests are eating a significant majority of it and likely could be >> dying randomly due to space issues. [All the hbase workspace directories + >> the yetus-m2 shared mvn cache dirs easily consume 20%+ of the space. >> Significantly more than the 50 or so other jobs that run on those >> machines.] >> >> By comparison, most of the ‘Hadoop’ nodes have 2-3TB for the big >> jobs to consume…. >> >> >> > On Jul 24, 2018, at 8:58 AM, Josh Elser <els...@apache.org> wrote: >> > >> > Yep, sadly this is a very long tent-pole for us. There are many >> involved who have invested countless hours in making this better. >> > >> > Specific to that job you linked earlier, 3 test failures out of our >> total 4958 tests (0.06% failure rate) is all but "green" in my mind. I >> would ask that you keep that in mind, too. >> > >> > To that extent, others have also built another job specifically to find >> tests which are failing intermittently: >> https://builds.apache.org/job/HBase-Find-Flaky-Tests/25513/artifact/dashboard.html. >> I mention this as evidence to prove to you that this is not a baseless >> request from the HBase PMC ;) >> > >> > On 7/24/18 3:14 AM, Gav wrote: >> >> Ok, good enough, will wait, please also note 'master' branch and a few >> >> others have been failing for over a month also. >> >> I will check in again next month to see how things are progressing >> >> Thanks >> >> Gav... >> >> On Tue, Jul 24, 2018 at 1:19 AM Josh Elser <els...@apache.org> wrote: >> >>> Hi Gav, >> >>> >> >>> Looking at the most recent results, I see that the job failed because >> of >> >>> two unit test failures. These are something that will be looked at >> prior >> >>> to the next 1.4.x release which is about to get off the ground. >> >>> >> >>> I'd kindly request that you not disable the job. Thanks for trying to >> >>> find extra resources on these nodes. >> >>> >> >>> On 7/23/18 12:22 AM, Gavin McDonald wrote: >> >>>> https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/ >> >>>> >> >>>> can someone take a look into this, the job isnt much good if it is >> >>> failing >> >>>> all the time and even worse if it is being ignored. >> >>>> >> >>>> Otherwise I'll disable the job in a dew days to release these wasted >> >>>> resources >> >>>> to builds that matter. >> >>>> >> >>>> >> >>> >> >> > > -- > Best regards, > Andrew > > Words like orphans lost among the crosstalk, meaning torn from truth's > decrepit hands > - A23, Crosstalk > -- Gav...