Re: Proposal: keeping precommit times fast

Udi Meiri Thu, 07 Jun 2018 09:29:18 -0700

Would I need a vote on installing this plugin, or can I just open a ticket
to infra?


On Wed, Jun 6, 2018, 16:18 Robert Bradshaw <[email protected]> wrote:

> Even if it's not perfect, seems like it'd surely be a net win (and
> probably a large one). Also, the build cache should look back at more than
> just the single previous build, so if any previous jobs (up to the cache
> size limit) built/tested artifacts unchanged by the current PR, the results
> would live in the cache.
>
> I would look at (a) and (b) only if this isn't already good enough.
>
> On Wed, Jun 6, 2018 at 3:50 PM Udi Meiri <[email protected]> wrote:
>
>> To follow up on the Jenkins Job Cacher Plugin:
>>
>> Using a Jenkins plugin to save and reuse the Gradle cache for successive
>> precommit jobs.
>> The problem with this approach is that the precommit runs that a Jenkins
>> server runs are unrelated.
>> Say you have 2 PRs, A and B, and the precommit job for B reuses the cache
>> left by the job for A.
>> The diff between the two will cause tests affected both by A and B to be
>> rerun (at least).
>> If A modifies Python code, then the job for B must rerun ALL Python tests
>> (since Gradle doesn't do dependency tracking for Python).
>>
>> Proposal:
>> a. The cache plugin is still useful for successive Java precommit jobs,
>> but not for Python. (Go, I have no idea)
>> We could use it exclusively for Java precommits.
>> b. To avoid running precommit jobs for code not touched by a PR, look at
>> the paths of files changed.
>> For example, a PR touching only files under sdks/python/... need only run
>> Python precommit tests.
>>
>> On Tue, Jun 5, 2018 at 7:24 PM Udi Meiri <[email protected]> wrote:
>>
>>> I've been having a separate discussion on the proposal doc, which is
>>> ready for another round of reviews.
>>> Change summary:
>>> - Changed fast requirement to be < 30 minutes and simplify the check as
>>> an aggregate for each precommit job type.
>>> - Updated slowness notification methods to include automated methods: as
>>> a precommit check result type on GitHub, as a bug.
>>> - Merged in the metrics design doc.
>>> - Added detailed design section.
>>> - Added list of deliverables.
>>>
>>> What I would like is consensus regarding:
>>> - How fast we want precommit runs to be. I propose 30m.
>>> - Deadline for fixing a slow test before it is temporarily removed from
>>> precommit. I propose 24 hours.
>>>
>>>
>>> Replying to the thread:
>>>
>>> 1. I like the idea of using the Jenkins Job Cacher Plugin to skip
>>> unaffected tests (BEAM-4400).
>>>
>>> 2. Java Precommit tests include integration tests (example
>>> <https://builds.apache.org/view/A-D/view/Beam/job/beam_PreCommit_Java_GradleBuild/lastCompletedBuild/testReport/org.apache.beam.examples/>
>>> ).
>>> We could split these out to get much faster results, i.e., a separate
>>> precommit just for basic integration tests (which will still need to run in
>>> <30m).
>>> Perhaps lint checks for Python could be split out as well.
>>>
>>> I'll add these suggestions to the doc tomorrow.
>>>
>>> On Thu, May 24, 2018 at 9:25 AM Scott Wegner <[email protected]> wrote:
>>>
>>>> So, it sounds like there's agreement that we should improve precommit
>>>> times by only running necessary tests, and configuring Jenkins Job
>>>> Caching + Gradle build cache is a path to get there. I've filed BEAM-4400
>>>> [1] to follow-up on this.
>>>>
>>>> Getting back to Udi's original proposal [2]: I see value in defining a
>>>> metric and target for overall pre-commit timing. The proposal for an
>>>> initial "2 hour" target is helpful as a guardrail: we're already hitting
>>>> it, but if we drift to a point where we're not, that should trigger some
>>>> action to be taken to get back to a healthy state.
>>>>
>>>> I wouldn't mind separately setting a more aspiration goal of getting
>>>> the pre-commits even faster (i.e. 15-30 mins), but I suspect that would
>>>> require a concerted effort to evaluate and improve existing tests across
>>>> the codebase. One idea would be to set up ensure the metric reporting can
>>>> show the trend, and which tests are responsible for the most walltime, so
>>>> that we know where to invest any efforts to improve tests.
>>>>
>>>>
>>>> [1] https://issues.apache.org/jira/browse/BEAM-4400
>>>> [2]
>>>> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing
>>>>
>>>>
>>>> On Wed, May 23, 2018 at 11:46 AM Kenneth Knowles <[email protected]>
>>>> wrote:
>>>>
>>>>> With regard to the Job Cacher Plugin: I think it is an infra ticket to
>>>>> install? And I guess we need it longer term when we move to containerized
>>>>> builds anyhow? One thing I've experienced with the Travis-CI cache is that
>>>>> the time spent uploading & downloading the remote cache - in that case of
>>>>> all the pip installed dependencies - negated the benefits. Probably for
>>>>> Beam it will have a greater benefit if we can skip most of the build.
>>>>>
>>>>> Regarding integration tests in precommit: I think it is OK to run
>>>>> maybe one Dataflow job in precommit, but it should be in parallel with the
>>>>> unit tests and just a smoke test that takes 5 minutes, not a suite that
>>>>> takes 35 minutes. So IMO that is low-hanging fruit. If this would make
>>>>> postcommit unstable, then it also means precommit is unstable. Both are
>>>>> troublesome.
>>>>>
>>>>> More short term, some possible hacks:
>>>>>
>>>>>  - Point gradle to cache outside the git workspace. We already did
>>>>> this for .m2 and it helped a lot.
>>>>>  - Intersect touched files with projects. Our nonstandard project
>>>>> names might be a pain here. Not sure if fixing that is on the roadmap.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Wed, May 23, 2018 at 9:31 AM Ismaël Mejía <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I second Robert idea of ‘inteligently’ running only the affected
>>>>>> tests,
>>>>>> probably
>>>>>> there is no need to run Java for a go fix (and eventually if any
>>>>>> issue it
>>>>>> can be
>>>>>> catched in postcommit), same for a dev who just fixed something in
>>>>>> KafkaIO
>>>>>> and has
>>>>>> to wait for other IO tests to pass. I suppose that languages, IOs and
>>>>>> extensions
>>>>>> are ‘easy’ to isolate so maybe we can start with those.
>>>>>>
>>>>>> Earlier signals are also definitely great to have too, but not sure
>>>>>> how we
>>>>>> can
>>>>>> have those with the current infra.
>>>>>>
>>>>>>  From a quicklook the biggest time is consumed by the examples module
>>>>>> probably
>>>>>> because they run in Dataflow with real IOs no?, that module alone
>>>>>> takes ~35
>>>>>> minutes, so maybe moving it to postcommit will gain us some quick
>>>>>> improvement.
>>>>>> On the other hand we should probably not dismiss the consequences of
>>>>>> moving
>>>>>> more
>>>>>> stuff to postcommit given that our current postcommit is not the most
>>>>>> stable, or
>>>>>> the quickest, only the Dataflow suite takes 1h30!
>>>>>>
>>>>>>
>>>>>> On Tue, May 22, 2018 at 12:01 AM Mikhail Gryzykhin <[email protected]
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>> > What we can do here is estimate how much effort we want to put in
>>>>>> and set
>>>>>> remote target.
>>>>>> > Such as:
>>>>>> > Third quarter 2018 -- 1hr SLO
>>>>>> > Forth quarter 2018 -- 30min SLO,
>>>>>> > etc.
>>>>>>
>>>>>> > Combined with policy for newly added tests, this can give us some
>>>>>> goal to
>>>>>> aim for.
>>>>>>
>>>>>> > --Mikhail
>>>>>>
>>>>>> > Have feedback?
>>>>>>
>>>>>>
>>>>>> > On Mon, May 21, 2018 at 2:06 PM Scott Wegner <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> >> Thanks for the proposal, I left comments in the doc. Overall I
>>>>>> think
>>>>>> it's a great idea.
>>>>>>
>>>>>> >> I've seen other projects with much faster pre-commits, and it
>>>>>> requires
>>>>>> strict guidelines on unit test design and keeping tests isolated
>>>>>> in-memory
>>>>>> as much as possible. That's not currently the case in Java; we have
>>>>>> pre-commits which submit pipelines to Dataflow service.
>>>>>>
>>>>>> >> I don't know if it's feasible to get Java down to 15-20 mins in the
>>>>>> short term, but a good starting point would be to document the
>>>>>> requirements
>>>>>> for a test to run as pre-commit, and start enforcing it for new tests.
>>>>>>
>>>>>>
>>>>>> >> On Fri, May 18, 2018 at 3:25 PM Henning Rohde <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> >>> Good proposal. I think it should be considered in tandem with the
>>>>>> "No
>>>>>> commit on red post-commit" proposal and could be far more ambitious
>>>>>> than 2
>>>>>> hours. For example, something in the <15-20 mins range, say, would be
>>>>>> much
>>>>>> less of an inconvenience to the development effort. Go takes ~3 mins,
>>>>>> which
>>>>>> means that it is practical to wait until a PR is green before asking
>>>>>> anyone
>>>>>> to look at it. If I need to wait for a Java or Python pre-commit, I
>>>>>> task
>>>>>> switch and come back later. If the post-commits are enforced to be
>>>>>> green,
>>>>>> we could possibly gain a much more productive flow at the cost of the
>>>>>> occasional post-commit break, compared to now. Maybe IOs can be less
>>>>>> extensively tested pre-commit, for example, or only if actually
>>>>>> changed?
>>>>>>
>>>>>> >>> I also like Robert's suggestion of spitting up pre-commits into
>>>>>> something more fine-grained to get a clear partial signal quicker. If
>>>>>> we
>>>>>> have an adequate number of Jenkins slots, it might also speed things
>>>>>> up
>>>>>> overall.
>>>>>>
>>>>>> >>> Thanks,
>>>>>> >>>    Henning
>>>>>>
>>>>>> >>> On Fri, May 18, 2018 at 12:30 PM Scott Wegner <[email protected]
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>> >>>> re: intelligently skipping tests for code that doesn't change
>>>>>> (i.e.
>>>>>> Java tests on Python PR): this should be possible. We already have
>>>>>> build-caching enabled in Gradle, but I believe it is local to the git
>>>>>> workspace and doesn't persist between Jenkins runs.
>>>>>>
>>>>>> >>>> With a quick search, I see there is a Jenkins Build Cacher
>>>>>> Plugin [1]
>>>>>> that hooks into Gradle build cache and does exactly what we need. Does
>>>>>> anybody know whether we could get this enabled on our Jenkins?
>>>>>>
>>>>>> >>>> [1] https://wiki.jenkins.io/display/JENKINS/Job+Cacher+Plugin
>>>>>>
>>>>>> >>>> On Fri, May 18, 2018 at 12:08 PM Robert Bradshaw <
>>>>>> [email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> >>>>> [somehow  my email got garbled...]
>>>>>>
>>>>>> >>>>> Now that we're using gradle, perhaps we could be more
>>>>>> intelligent
>>>>>> about only running the affected tests? E.g. when you touch Python (or
>>>>>> Go)
>>>>>> you shouldn't need to run the Java precommit at all, which would
>>>>>> reduce the
>>>>>> latency for those PRs and also the time spent in queue. Presumably
>>>>>> this
>>>>>> could even be applied per-module for the Java tests. (Maybe a large,
>>>>>> shared
>>>>>> build cache could help here as well...)
>>>>>>
>>>>>> >>>>> I also wouldn't be opposed to a quicker immediate signal, plus
>>>>>> more
>>>>>> extensive tests before actually merging. It's also nice to not have
>>>>>> to wait
>>>>>> an hour to see that you have a lint error; quick stuff like that
>>>>>> could be
>>>>>> signaled quickly before a contributor looses context.
>>>>>>
>>>>>> >>>>> - Robert
>>>>>>
>>>>>>
>>>>>>
>>>>>> >>>>> On Fri, May 18, 2018 at 5:55 AM Kenneth Knowles <[email protected]
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>> >>>>>> I like the idea. I think it is a good time for the project to
>>>>>> start
>>>>>> tracking this and keeping it usable.
>>>>>>
>>>>>> >>>>>> Certainly 2 hours is more than enough, is that not so? The Java
>>>>>> precommit seems to take <=40 minutes while Python takes ~20 and Go is
>>>>>> so
>>>>>> fast it doesn't matter. Do we have enough stragglers that we don't
>>>>>> make it
>>>>>> in the 95th percentile? Is the time spent in the Jenkins queue?
>>>>>>
>>>>>> >>>>>> For our current coverage, I'd be willing to go for:
>>>>>>
>>>>>> >>>>>>    - 1 hr hard cap (someone better at stats could choose %ile)
>>>>>> >>>>>>    - roll back or remove test from precommit if fix looks like
>>>>>> more
>>>>>> than 1 week (roll back if it is perf degradation, remove test from
>>>>>> precommit if it is additional coverage that just doesn't fit in the
>>>>>> time)
>>>>>>
>>>>>> >>>>>> There's a longer-term issue that doing a full build each time
>>>>>> is
>>>>>> expected to linearly scale up with the size of our repo (it is the
>>>>>> monorepo
>>>>>> problem but for a minirepo) so there is no cap that is feasible until
>>>>>> we
>>>>>> have effective cross-build caching. And my long-term goal would be <30
>>>>>> minutes. At the latency of opening a pull request and then checking
>>>>>> your
>>>>>> email that's not burdensome, but an hour is.
>>>>>>
>>>>>> >>>>>> Kenn
>>>>>>
>>>>>> >>>>>> On Thu, May 17, 2018 at 6:54 PM Udi Meiri <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> >>>>>>> HI,
>>>>>> >>>>>>> I have a proposal to improve contributor experience by keeping
>>>>>> precommit times low.
>>>>>>
>>>>>> >>>>>>> I'm looking to get community consensus and approval about:
>>>>>> >>>>>>> 1. How long should precommits take. 2 hours @95th percentile
>>>>>> over
>>>>>> the past 4 weeks is the current proposal.
>>>>>> >>>>>>> 2. The process for dealing with slowness. Do we: fix, roll
>>>>>> back,
>>>>>> remove a test from precommit?
>>>>>> >>>>>>> Rolling back if a fix is estimated to take longer than 2
>>>>>> weeks is
>>>>>> the current proposal.
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing
>>>>>>
>>>>>

smime.p7s
Description: S/MIME Cryptographic Signature

Re: Proposal: keeping precommit times fast

Reply via email to