Hi all,

well I agree that we could possibly split up the job into multiple separate 
builds. 

However this makes running the Jenkins Multibranch pipeline plugin quite a bit 
more difficult.

And the thing is, that our setup has been working fine for about 2 years and we 
are just recently having these problems. 
So I didn't want to just configure the actual problem away, because I think 
with splitting up the into multiple separate 
jobs will just Bring other problems and in the end our deploy jobs will then 
just still hang for many, many hours. 

Chris


Am 23.08.19, 17:40 schrieb "Gavin McDonald" <ipv6g...@gmail.com>:

    Hi,
    
    On Fri, Aug 23, 2019 at 4:07 PM Allen Wittenauer
    <a...@effectivemachines.com.invalid> wrote:
    
    >
    > > On Aug 23, 2019, at 9:44 AM, Gavin McDonald <ipv6g...@gmail.com> wrote:
    > > The issue is, and I have seen this multiple times over the last few
    > weeks,
    > > is that Hadoop pre-commit builds, HBase pre-commit, HBase nightly, HBase
    > > flaky tests and similar are running on multiple nodes at the same time.
    >
    >         The precommit jobs are exercising potential patches/PRs… of course
    > there are going to be multiples running on different nodes simultaneously.
    > That’s how CI systems work.
    >
    
    Yes, I understand how CI systems work.
    
    
    >
    > > It
    > > seems that one PR or 1 commit is triggering a job or jobs that split 
into
    > > part jobs that run on multiple nodes.
    >
    >         Unless there is a misconfiguration (and I haven’t been directly
    > involved with Hadoop in a year+), that’s incorrect.  There is just that
    > much traffic on these big projects.  To put this in perspective, the last
    > time I did some analysis in March of this year, it works out to be ~10 new
    > JIRAs with patches attached for Hadoop _a day_.  (Assuming an equal
    > distribution across the year/month/week/day. Which of course isn’t true.
    > Weekdays are higher, weekends lower.)  If there are multiple iterations on
    > those 10, well….  and then there are the PRs...
    >
    
    ok, I will dig deeper on this.
    
    
    >
    > > Just yesterday I saw Hadoop and HBase
    > > taking up nearly 45 of 50 H* nodes. Some of these jobs take many hours.
    > > Some of these jobs that take many hours are triggered on a PR or a 
commit
    > > that could be something as trivial as a typo. This is unacceptable.
    >
    >         The size of the Hadoop jobs is one of the reasons why Yahoo!/Oath
    > gave the ASF machine resources. (I guess that may have happened before you
    > were part of INFRA.)
    
    
    Nope, I have been a part of INFRA since 2008 and have been maintaining
    Jenkins since that time, so I know it and its history with Y! very well.
    
    
    > Also, the job sizes for projects using Yetus are SIGNIFICANTLY reduced:
    > the full test suite is about 20 hours.  Big projects are just that, big.
    >
    > > HBase
    > > in particular is a Hadoop related project and should be limiting its 
jobs
    > > to Hadoop labelled nodes H0-H21, but they are running on any and all
    > nodes.
    >
    >         Then you should take that up with the HBase project.
    >
    
    That I will, I mention it here as information for everyone and the
    likelihood that some HBase folks are subscribed here. If no response
    then I will contact the PMC directly.
    
    
    >
    > > It is all too familiar to see one job running on a dozen or more
    > executors,
    > > the build queue is now constantly in the hundreds, despite the fact we
    > have
    > > nearly 100 nodes. This must stop.
    >
    >         ’nearly 100 nodes’: but how many of those are dedicated to
    > specific projects?  1/3 of them are just for Cassandra and Beam.
    >
    
    ok so around 20 nodes for Hadoop + related projects and around 30 general
    purpose ubuntu labelled.
    
    
    >
    >         Also, take a look at the input on the jobs rather than just
    > looking at the job names.
    >
    
    erm, of course!
    
    
    >
    >         It’s probably also worth pointing out that since INFRA mucked with
    > the GitHub pull request builder settings, they’ve caused a stampeding herd
    > problem.
    
    
    'mucked around with' ? What are you implying here? What INFRA did was
    completely necessary.
    
    As soon as someone runs scan on the project, ALL of the PRs get triggered
    > at once regardless of if there has been an update to the PR or not.
    >
    
    This needs more investigation of the Cloudbees PR Plugin we are using then.
    
    
    >
    > > Meanwhile, Chris informs me his single job to deploy to Nexus has been
    > > waiting in 3 days.
    >
    >         It sure sounds like Chris’ job is doing something weird though,
    > given it appears it is switching nodes and such mid-job based upon their
    > description.  That’s just begging to starve.
    >
    
    Sure, his job config needs looking at.
    
    
    >
    > ===
    >
    >         Also, looking at the queue this morning (~11AM EDT), a few
    > observations:
    >
    > * The ‘ubuntu’ queue is pretty busy while ‘hadoop’ has quite a few open
    > slots.
    >
    
    Having HBase limit to the hadoop nodes might reverse that stat, making the
    ubuntu slots available for the rest of the projects.
    
    
    >
    > * There are lots of jobs in the queue that don’t support multiple runs.
    > So they are self-starving and the problem lies with the project, not the
    > infrastructure.
    >
    
    Agree
    
    
    >
    > * A quick pass show that some of the jobs in the queue are tied to
    > specific nodes or have such a limited set of nodes as possible hosts that
    > _of course_ they are going to get starved out.  Again, a project-level
    > problem.
    >
    
    Agree
    
    
    >
    > * Just looking at the queue size is clearly not going to provide any real
    > data as what the problems are without also looking into why those jobs are
    > in the queue to begin with.
    
    
    of course.
    
    
    -- 
    Gav...
    

Reply via email to