Hi,

On Fri, Aug 23, 2019 at 4:07 PM Allen Wittenauer
<a...@effectivemachines.com.invalid> wrote:

>
> > On Aug 23, 2019, at 9:44 AM, Gavin McDonald <ipv6g...@gmail.com> wrote:
> > The issue is, and I have seen this multiple times over the last few
> weeks,
> > is that Hadoop pre-commit builds, HBase pre-commit, HBase nightly, HBase
> > flaky tests and similar are running on multiple nodes at the same time.
>
>         The precommit jobs are exercising potential patches/PRs… of course
> there are going to be multiples running on different nodes simultaneously.
> That’s how CI systems work.
>

Yes, I understand how CI systems work.


>
> > It
> > seems that one PR or 1 commit is triggering a job or jobs that split into
> > part jobs that run on multiple nodes.
>
>         Unless there is a misconfiguration (and I haven’t been directly
> involved with Hadoop in a year+), that’s incorrect.  There is just that
> much traffic on these big projects.  To put this in perspective, the last
> time I did some analysis in March of this year, it works out to be ~10 new
> JIRAs with patches attached for Hadoop _a day_.  (Assuming an equal
> distribution across the year/month/week/day. Which of course isn’t true.
> Weekdays are higher, weekends lower.)  If there are multiple iterations on
> those 10, well….  and then there are the PRs...
>

ok, I will dig deeper on this.


>
> > Just yesterday I saw Hadoop and HBase
> > taking up nearly 45 of 50 H* nodes. Some of these jobs take many hours.
> > Some of these jobs that take many hours are triggered on a PR or a commit
> > that could be something as trivial as a typo. This is unacceptable.
>
>         The size of the Hadoop jobs is one of the reasons why Yahoo!/Oath
> gave the ASF machine resources. (I guess that may have happened before you
> were part of INFRA.)


Nope, I have been a part of INFRA since 2008 and have been maintaining
Jenkins since that time, so I know it and its history with Y! very well.


> Also, the job sizes for projects using Yetus are SIGNIFICANTLY reduced:
> the full test suite is about 20 hours.  Big projects are just that, big.
>
> > HBase
> > in particular is a Hadoop related project and should be limiting its jobs
> > to Hadoop labelled nodes H0-H21, but they are running on any and all
> nodes.
>
>         Then you should take that up with the HBase project.
>

That I will, I mention it here as information for everyone and the
likelihood that some HBase folks are subscribed here. If no response
then I will contact the PMC directly.


>
> > It is all too familiar to see one job running on a dozen or more
> executors,
> > the build queue is now constantly in the hundreds, despite the fact we
> have
> > nearly 100 nodes. This must stop.
>
>         ’nearly 100 nodes’: but how many of those are dedicated to
> specific projects?  1/3 of them are just for Cassandra and Beam.
>

ok so around 20 nodes for Hadoop + related projects and around 30 general
purpose ubuntu labelled.


>
>         Also, take a look at the input on the jobs rather than just
> looking at the job names.
>

erm, of course!


>
>         It’s probably also worth pointing out that since INFRA mucked with
> the GitHub pull request builder settings, they’ve caused a stampeding herd
> problem.


'mucked around with' ? What are you implying here? What INFRA did was
completely necessary.

As soon as someone runs scan on the project, ALL of the PRs get triggered
> at once regardless of if there has been an update to the PR or not.
>

This needs more investigation of the Cloudbees PR Plugin we are using then.


>
> > Meanwhile, Chris informs me his single job to deploy to Nexus has been
> > waiting in 3 days.
>
>         It sure sounds like Chris’ job is doing something weird though,
> given it appears it is switching nodes and such mid-job based upon their
> description.  That’s just begging to starve.
>

Sure, his job config needs looking at.


>
> ===
>
>         Also, looking at the queue this morning (~11AM EDT), a few
> observations:
>
> * The ‘ubuntu’ queue is pretty busy while ‘hadoop’ has quite a few open
> slots.
>

Having HBase limit to the hadoop nodes might reverse that stat, making the
ubuntu slots available for the rest of the projects.


>
> * There are lots of jobs in the queue that don’t support multiple runs.
> So they are self-starving and the problem lies with the project, not the
> infrastructure.
>

Agree


>
> * A quick pass show that some of the jobs in the queue are tied to
> specific nodes or have such a limited set of nodes as possible hosts that
> _of course_ they are going to get starved out.  Again, a project-level
> problem.
>

Agree


>
> * Just looking at the queue size is clearly not going to provide any real
> data as what the problems are without also looking into why those jobs are
> in the queue to begin with.


of course.


-- 
Gav...

Reply via email to