Re: Proposal: keeping precommit times fast

Udi Meiri Tue, 05 Jun 2018 19:26:04 -0700

I've been having a separate discussion on the proposal doc, which is ready
for another round of reviews.
Change summary:
- Changed fast requirement to be < 30 minutes and simplify the check as an
aggregate for each precommit job type.
- Updated slowness notification methods to include automated methods: as a
precommit check result type on GitHub, as a bug.
- Merged in the metrics design doc.
- Added detailed design section.
- Added list of deliverables.


What I would like is consensus regarding:
- How fast we want precommit runs to be. I propose 30m.
- Deadline for fixing a slow test before it is temporarily removed from
precommit. I propose 24 hours.


Replying to the thread:

1. I like the idea of using the Jenkins Job Cacher Plugin to skip
unaffected tests (BEAM-4400).

2. Java Precommit tests include integration tests (example
<https://builds.apache.org/view/A-D/view/Beam/job/beam_PreCommit_Java_GradleBuild/lastCompletedBuild/testReport/org.apache.beam.examples/>
).
We could split these out to get much faster results, i.e., a separate
precommit just for basic integration tests (which will still need to run in
<30m).
Perhaps lint checks for Python could be split out as well.

I'll add these suggestions to the doc tomorrow.

On Thu, May 24, 2018 at 9:25 AM Scott Wegner <[email protected]> wrote:

> So, it sounds like there's agreement that we should improve precommit
> times by only running necessary tests, and configuring Jenkins Job
> Caching + Gradle build cache is a path to get there. I've filed BEAM-4400
> [1] to follow-up on this.
>
> Getting back to Udi's original proposal [2]: I see value in defining a
> metric and target for overall pre-commit timing. The proposal for an
> initial "2 hour" target is helpful as a guardrail: we're already hitting
> it, but if we drift to a point where we're not, that should trigger some
> action to be taken to get back to a healthy state.
>
> I wouldn't mind separately setting a more aspiration goal of getting the
> pre-commits even faster (i.e. 15-30 mins), but I suspect that would require
> a concerted effort to evaluate and improve existing tests across the
> codebase. One idea would be to set up ensure the metric reporting can show
> the trend, and which tests are responsible for the most walltime, so that
> we know where to invest any efforts to improve tests.
>
>
> [1] https://issues.apache.org/jira/browse/BEAM-4400
> [2]
> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing
>
>
> On Wed, May 23, 2018 at 11:46 AM Kenneth Knowles <[email protected]> wrote:
>
>> With regard to the Job Cacher Plugin: I think it is an infra ticket to
>> install? And I guess we need it longer term when we move to containerized
>> builds anyhow? One thing I've experienced with the Travis-CI cache is that
>> the time spent uploading & downloading the remote cache - in that case of
>> all the pip installed dependencies - negated the benefits. Probably for
>> Beam it will have a greater benefit if we can skip most of the build.
>>
>> Regarding integration tests in precommit: I think it is OK to run maybe
>> one Dataflow job in precommit, but it should be in parallel with the unit
>> tests and just a smoke test that takes 5 minutes, not a suite that takes 35
>> minutes. So IMO that is low-hanging fruit. If this would make postcommit
>> unstable, then it also means precommit is unstable. Both are troublesome.
>>
>> More short term, some possible hacks:
>>
>>  - Point gradle to cache outside the git workspace. We already did this
>> for .m2 and it helped a lot.
>>  - Intersect touched files with projects. Our nonstandard project names
>> might be a pain here. Not sure if fixing that is on the roadmap.
>>
>> Kenn
>>
>> On Wed, May 23, 2018 at 9:31 AM Ismaël Mejía <[email protected]> wrote:
>>
>>> I second Robert idea of ‘inteligently’ running only the affected tests,
>>> probably
>>> there is no need to run Java for a go fix (and eventually if any issue it
>>> can be
>>> catched in postcommit), same for a dev who just fixed something in
>>> KafkaIO
>>> and has
>>> to wait for other IO tests to pass. I suppose that languages, IOs and
>>> extensions
>>> are ‘easy’ to isolate so maybe we can start with those.
>>>
>>> Earlier signals are also definitely great to have too, but not sure how
>>> we
>>> can
>>> have those with the current infra.
>>>
>>>  From a quicklook the biggest time is consumed by the examples module
>>> probably
>>> because they run in Dataflow with real IOs no?, that module alone takes
>>> ~35
>>> minutes, so maybe moving it to postcommit will gain us some quick
>>> improvement.
>>> On the other hand we should probably not dismiss the consequences of
>>> moving
>>> more
>>> stuff to postcommit given that our current postcommit is not the most
>>> stable, or
>>> the quickest, only the Dataflow suite takes 1h30!
>>>
>>>
>>> On Tue, May 22, 2018 at 12:01 AM Mikhail Gryzykhin <[email protected]>
>>> wrote:
>>>
>>> > What we can do here is estimate how much effort we want to put in and
>>> set
>>> remote target.
>>> > Such as:
>>> > Third quarter 2018 -- 1hr SLO
>>> > Forth quarter 2018 -- 30min SLO,
>>> > etc.
>>>
>>> > Combined with policy for newly added tests, this can give us some goal
>>> to
>>> aim for.
>>>
>>> > --Mikhail
>>>
>>> > Have feedback?
>>>
>>>
>>> > On Mon, May 21, 2018 at 2:06 PM Scott Wegner <[email protected]>
>>> wrote:
>>>
>>> >> Thanks for the proposal, I left comments in the doc. Overall I think
>>> it's a great idea.
>>>
>>> >> I've seen other projects with much faster pre-commits, and it requires
>>> strict guidelines on unit test design and keeping tests isolated
>>> in-memory
>>> as much as possible. That's not currently the case in Java; we have
>>> pre-commits which submit pipelines to Dataflow service.
>>>
>>> >> I don't know if it's feasible to get Java down to 15-20 mins in the
>>> short term, but a good starting point would be to document the
>>> requirements
>>> for a test to run as pre-commit, and start enforcing it for new tests.
>>>
>>>
>>> >> On Fri, May 18, 2018 at 3:25 PM Henning Rohde <[email protected]>
>>> wrote:
>>>
>>> >>> Good proposal. I think it should be considered in tandem with the "No
>>> commit on red post-commit" proposal and could be far more ambitious than
>>> 2
>>> hours. For example, something in the <15-20 mins range, say, would be
>>> much
>>> less of an inconvenience to the development effort. Go takes ~3 mins,
>>> which
>>> means that it is practical to wait until a PR is green before asking
>>> anyone
>>> to look at it. If I need to wait for a Java or Python pre-commit, I task
>>> switch and come back later. If the post-commits are enforced to be green,
>>> we could possibly gain a much more productive flow at the cost of the
>>> occasional post-commit break, compared to now. Maybe IOs can be less
>>> extensively tested pre-commit, for example, or only if actually changed?
>>>
>>> >>> I also like Robert's suggestion of spitting up pre-commits into
>>> something more fine-grained to get a clear partial signal quicker. If we
>>> have an adequate number of Jenkins slots, it might also speed things up
>>> overall.
>>>
>>> >>> Thanks,
>>> >>>    Henning
>>>
>>> >>> On Fri, May 18, 2018 at 12:30 PM Scott Wegner <[email protected]>
>>> wrote:
>>>
>>> >>>> re: intelligently skipping tests for code that doesn't change (i.e.
>>> Java tests on Python PR): this should be possible. We already have
>>> build-caching enabled in Gradle, but I believe it is local to the git
>>> workspace and doesn't persist between Jenkins runs.
>>>
>>> >>>> With a quick search, I see there is a Jenkins Build Cacher Plugin
>>> [1]
>>> that hooks into Gradle build cache and does exactly what we need. Does
>>> anybody know whether we could get this enabled on our Jenkins?
>>>
>>> >>>> [1] https://wiki.jenkins.io/display/JENKINS/Job+Cacher+Plugin
>>>
>>> >>>> On Fri, May 18, 2018 at 12:08 PM Robert Bradshaw <
>>> [email protected]>
>>> wrote:
>>>
>>> >>>>> [somehow  my email got garbled...]
>>>
>>> >>>>> Now that we're using gradle, perhaps we could be more intelligent
>>> about only running the affected tests? E.g. when you touch Python (or Go)
>>> you shouldn't need to run the Java precommit at all, which would reduce
>>> the
>>> latency for those PRs and also the time spent in queue. Presumably this
>>> could even be applied per-module for the Java tests. (Maybe a large,
>>> shared
>>> build cache could help here as well...)
>>>
>>> >>>>> I also wouldn't be opposed to a quicker immediate signal, plus more
>>> extensive tests before actually merging. It's also nice to not have to
>>> wait
>>> an hour to see that you have a lint error; quick stuff like that could be
>>> signaled quickly before a contributor looses context.
>>>
>>> >>>>> - Robert
>>>
>>>
>>>
>>> >>>>> On Fri, May 18, 2018 at 5:55 AM Kenneth Knowles <[email protected]>
>>> wrote:
>>>
>>> >>>>>> I like the idea. I think it is a good time for the project to
>>> start
>>> tracking this and keeping it usable.
>>>
>>> >>>>>> Certainly 2 hours is more than enough, is that not so? The Java
>>> precommit seems to take <=40 minutes while Python takes ~20 and Go is so
>>> fast it doesn't matter. Do we have enough stragglers that we don't make
>>> it
>>> in the 95th percentile? Is the time spent in the Jenkins queue?
>>>
>>> >>>>>> For our current coverage, I'd be willing to go for:
>>>
>>> >>>>>>    - 1 hr hard cap (someone better at stats could choose %ile)
>>> >>>>>>    - roll back or remove test from precommit if fix looks like
>>> more
>>> than 1 week (roll back if it is perf degradation, remove test from
>>> precommit if it is additional coverage that just doesn't fit in the time)
>>>
>>> >>>>>> There's a longer-term issue that doing a full build each time is
>>> expected to linearly scale up with the size of our repo (it is the
>>> monorepo
>>> problem but for a minirepo) so there is no cap that is feasible until we
>>> have effective cross-build caching. And my long-term goal would be <30
>>> minutes. At the latency of opening a pull request and then checking your
>>> email that's not burdensome, but an hour is.
>>>
>>> >>>>>> Kenn
>>>
>>> >>>>>> On Thu, May 17, 2018 at 6:54 PM Udi Meiri <[email protected]>
>>> wrote:
>>>
>>> >>>>>>> HI,
>>> >>>>>>> I have a proposal to improve contributor experience by keeping
>>> precommit times low.
>>>
>>> >>>>>>> I'm looking to get community consensus and approval about:
>>> >>>>>>> 1. How long should precommits take. 2 hours @95th percentile over
>>> the past 4 weeks is the current proposal.
>>> >>>>>>> 2. The process for dealing with slowness. Do we: fix, roll back,
>>> remove a test from precommit?
>>> >>>>>>> Rolling back if a fix is estimated to take longer than 2 weeks is
>>> the current proposal.
>>>
>>>
>>>
>>> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing
>>>
>>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Proposal: keeping precommit times fast

Reply via email to