Hi, On Fri, Aug 23, 2019 at 4:07 PM Allen Wittenauer <a...@effectivemachines.com.invalid> wrote:
> > > On Aug 23, 2019, at 9:44 AM, Gavin McDonald <ipv6g...@gmail.com> wrote: > > The issue is, and I have seen this multiple times over the last few > weeks, > > is that Hadoop pre-commit builds, HBase pre-commit, HBase nightly, HBase > > flaky tests and similar are running on multiple nodes at the same time. > > The precommit jobs are exercising potential patches/PRs… of course > there are going to be multiples running on different nodes simultaneously. > That’s how CI systems work. > Yes, I understand how CI systems work. > > > It > > seems that one PR or 1 commit is triggering a job or jobs that split into > > part jobs that run on multiple nodes. > > Unless there is a misconfiguration (and I haven’t been directly > involved with Hadoop in a year+), that’s incorrect. There is just that > much traffic on these big projects. To put this in perspective, the last > time I did some analysis in March of this year, it works out to be ~10 new > JIRAs with patches attached for Hadoop _a day_. (Assuming an equal > distribution across the year/month/week/day. Which of course isn’t true. > Weekdays are higher, weekends lower.) If there are multiple iterations on > those 10, well…. and then there are the PRs... > ok, I will dig deeper on this. > > > Just yesterday I saw Hadoop and HBase > > taking up nearly 45 of 50 H* nodes. Some of these jobs take many hours. > > Some of these jobs that take many hours are triggered on a PR or a commit > > that could be something as trivial as a typo. This is unacceptable. > > The size of the Hadoop jobs is one of the reasons why Yahoo!/Oath > gave the ASF machine resources. (I guess that may have happened before you > were part of INFRA.) Nope, I have been a part of INFRA since 2008 and have been maintaining Jenkins since that time, so I know it and its history with Y! very well. > Also, the job sizes for projects using Yetus are SIGNIFICANTLY reduced: > the full test suite is about 20 hours. Big projects are just that, big. > > > HBase > > in particular is a Hadoop related project and should be limiting its jobs > > to Hadoop labelled nodes H0-H21, but they are running on any and all > nodes. > > Then you should take that up with the HBase project. > That I will, I mention it here as information for everyone and the likelihood that some HBase folks are subscribed here. If no response then I will contact the PMC directly. > > > It is all too familiar to see one job running on a dozen or more > executors, > > the build queue is now constantly in the hundreds, despite the fact we > have > > nearly 100 nodes. This must stop. > > ’nearly 100 nodes’: but how many of those are dedicated to > specific projects? 1/3 of them are just for Cassandra and Beam. > ok so around 20 nodes for Hadoop + related projects and around 30 general purpose ubuntu labelled. > > Also, take a look at the input on the jobs rather than just > looking at the job names. > erm, of course! > > It’s probably also worth pointing out that since INFRA mucked with > the GitHub pull request builder settings, they’ve caused a stampeding herd > problem. 'mucked around with' ? What are you implying here? What INFRA did was completely necessary. As soon as someone runs scan on the project, ALL of the PRs get triggered > at once regardless of if there has been an update to the PR or not. > This needs more investigation of the Cloudbees PR Plugin we are using then. > > > Meanwhile, Chris informs me his single job to deploy to Nexus has been > > waiting in 3 days. > > It sure sounds like Chris’ job is doing something weird though, > given it appears it is switching nodes and such mid-job based upon their > description. That’s just begging to starve. > Sure, his job config needs looking at. > > === > > Also, looking at the queue this morning (~11AM EDT), a few > observations: > > * The ‘ubuntu’ queue is pretty busy while ‘hadoop’ has quite a few open > slots. > Having HBase limit to the hadoop nodes might reverse that stat, making the ubuntu slots available for the rest of the projects. > > * There are lots of jobs in the queue that don’t support multiple runs. > So they are self-starving and the problem lies with the project, not the > infrastructure. > Agree > > * A quick pass show that some of the jobs in the queue are tied to > specific nodes or have such a limited set of nodes as possible hosts that > _of course_ they are going to get starved out. Again, a project-level > problem. > Agree > > * Just looking at the queue size is clearly not going to provide any real > data as what the problems are without also looking into why those jobs are > in the queue to begin with. of course. -- Gav...