I don't see difference at first glance and no difference is expected. We never utilized concurrent jobs originally, because job took ~1 hour and was triggered once every 6 hours. At some point, I added triggering job when new commit is available and this started triggering jobs in parallel for each commit. That is unnecessary overhead for post-commits. Removing concurrent job runs for post-commits triggers single job for multiple commits that accumulated during execution of previous job.
I believe you are talking about triggering test cases concurrently withing single Jenkins job. That was not changed. --Mikhail Have feedback <http://go/migryz-feedback>? On Mon, Aug 6, 2018 at 2:44 PM Lukasz Cwik <[email protected]> wrote: > How much slower did the post commits become after removing concurrency? > > On Thu, Aug 2, 2018 at 2:32 PM Mikhail Gryzykhin <[email protected]> > wrote: > >> I've disabled concurrency for auto-triggered post-commits job. That >> should reduce job scheduling considerably. >> >> I believe that this change should resolve quota issue we have seen this >> time. I'll monitor if problem reappears. >> >> --Mikhail >> >> Have feedback <http://go/migryz-feedback>? >> >> >> On Wed, Aug 1, 2018 at 9:40 AM Pablo Estrada <[email protected]> wrote: >> >>> It feels to me like a peak of 60 jobs per minute is pretty high. If I >>> understand correctly, we run up to 20 dataflow jobs in parallel per test >>> suite? Or what's the number here? >>> >>> It is also true that most our tests are simple NeedsRunner tests, that >>> test a couple elements, so the whole pipeline overhead is on startup. This >>> may be improved by lumping tests together (though might we lose >>> debuggability?). Our average number of jobs is, I hope, muuuch smaller >>> than 60 per minute... >>> >>> With all these considerations, I would lean more towards having a retry >>> policy as the immediate solution. >>> -P. >>> >>> On Wed, Aug 1, 2018 at 9:07 AM Andrew Pilloud <[email protected]> >>> wrote: >>> >>>> I like 1 and 2. How do credentials get into Jenkins? Could we create a >>>> user per Jenkins host? >>>> >>>> On Tue, Jul 31, 2018 at 4:33 PM Reuven Lax <[email protected]> wrote: >>>> >>>>> There was also a proposal to lump multiple tests into a single >>>>> Dataflow job instead of spinning up a separate Dataflow job for each test. >>>>> >>>>> On Tue, Jul 31, 2018 at 4:26 PM Mikhail Gryzykhin <[email protected]> >>>>> wrote: >>>>> >>>>>> I synced with Rafael. Below is summary of discussion. >>>>>> >>>>>> This quota is CreateRequestsPerMinutePerUser and it has 60 requests >>>>>> per user by default. >>>>>> >>>>>> I've created Jira [BEAM-5053]( >>>>>> https://issues.apache.org/jira/browse/BEAM-5053) for this. >>>>>> >>>>>> I see following options we can utilize: >>>>>> 1. Add retry logic. Although this limits us to 1 dataflow job start >>>>>> per second for whole Jenkins. In long scale this can also block one test >>>>>> job if other jobs take all the slots. >>>>>> 2. Utilize different users to spin Dataflow jobs. >>>>>> 3. Find way to rise quota limit on Dataflow. By default the field >>>>>> limits value to 60 requests per minute. >>>>>> 4. Long run generic suggestion: limit amount of dataflow jobs we spin >>>>>> up and move tests to the form of unit or component tests. >>>>>> >>>>>> Please, fill in any insights or ideas you have on this. >>>>>> >>>>>> Regards, >>>>>> --Mikhail >>>>>> >>>>>> Have feedback <http://go/migryz-feedback>? >>>>>> >>>>>> >>>>>> On Tue, Jul 31, 2018 at 3:55 PM Mikhail Gryzykhin <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Everyone, >>>>>>> >>>>>>> Seems that we hit quota issue again: >>>>>>> https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull >>>>>>> >>>>>>> Can someone share information on how was this triaged last time or >>>>>>> guide me on possible follow-up actions? >>>>>>> >>>>>>> Regards, >>>>>>> --Mikhail >>>>>>> >>>>>>> Have feedback <http://go/migryz-feedback>? >>>>>>> >>>>>>> >>>>>>> On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Summary for all folks following this story -- and many thanks for >>>>>>>> explaining configs to me and pointing me to files and such. >>>>>>>> >>>>>>>> - Scott made changes to the config and we can now run 3 >>>>>>>> ValidatesRunner.Dataflow in parallel (each run is about 2 hours) >>>>>>>> - With the latest quota changes, we peaked at ~70% capacity in >>>>>>>> concurrent Dataflow jobs when running those >>>>>>>> - I've been keeping an eye on quota peaks for all resources today >>>>>>>> and have not seen any worryisome limits overall. >>>>>>>> - Also note there are improvements planned to the >>>>>>>> ValidatesRunner.Dataflow test so various items get batched and the test >>>>>>>> itself runs faster -- I believe it's on Alan's radar >>>>>>>> >>>>>>>> Cheers, >>>>>>>> r >>>>>>>> >>>>>>>> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Done! >>>>>>>>> >>>>>>>>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota >>>>>>>>>> [1]. Can you take a look? I've filed [BEAM-4722]: >>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-4722 >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://github.com/apache/beam/pull/5861#issuecomment-401963630 >>>>>>>>>> >>>>>>>>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 . >>>>>>>>>>> Quotas should not be a problem, if they are, please file a JIRA >>>>>>>>>>> under >>>>>>>>>>> gcp-quota. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> r >>>>>>>>>>> >>>>>>>>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> One thing that is nice when you do this is to be able to share >>>>>>>>>>>> your results. Though if all you are sharing is "they passed" then >>>>>>>>>>>> I guess >>>>>>>>>>>> we don't have to insist on evidence. >>>>>>>>>>>> >>>>>>>>>>>> Kenn >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> A few thoughts: >>>>>>>>>>>>> >>>>>>>>>>>>> * The Jenkins job getting backed up >>>>>>>>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. >>>>>>>>>>>>> Since >>>>>>>>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly >>>>>>>>>>>>> requested >>>>>>>>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So >>>>>>>>>>>>> this job >>>>>>>>>>>>> is idle more often than backlogged. >>>>>>>>>>>>> >>>>>>>>>>>>> * It's difficult to reason about our exact quota needs because >>>>>>>>>>>>> Dataflow jobs get launched from various Jenkins jobs that have >>>>>>>>>>>>> different >>>>>>>>>>>>> parallelism configurations. If we have budget, we could enable >>>>>>>>>>>>> concurrent >>>>>>>>>>>>> execution of this job and increase our quota enough to give some >>>>>>>>>>>>> breathing >>>>>>>>>>>>> room. If we do this, I recommend limiting the max concurrency via >>>>>>>>>>>>> throttleConcurrentBuilds [2] to some reasonable limit. >>>>>>>>>>>>> >>>>>>>>>>>>> * This test suite is meant to be an exhaustive post-commit >>>>>>>>>>>>> validation of Dataflow runner, and tests a lot of different >>>>>>>>>>>>> aspects of a >>>>>>>>>>>>> runner. It would be more efficient to run locally only the tests >>>>>>>>>>>>> affected >>>>>>>>>>>>> by your change. Note that this requires having access to a GCP >>>>>>>>>>>>> project with >>>>>>>>>>>>> billing, but most Dataflow developers probably have access to >>>>>>>>>>>>> this already. >>>>>>>>>>>>> The command for this is: >>>>>>>>>>>>> >>>>>>>>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner >>>>>>>>>>>>> -PdataflowProject=myGcpProject >>>>>>>>>>>>> -PdataflowTempRoot=gs://myGcsTempRoot >>>>>>>>>>>>> --tests "org.apache.beam.MyTestClass" >>>>>>>>>>>>> >>>>>>>>>>>>> [1] >>>>>>>>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend >>>>>>>>>>>>> [2] >>>>>>>>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The validates runner test parallelism is controlled here and >>>>>>>>>>>>>> is currently set to be "unlimited": >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Each test fork is run on a different gradle worker, so the >>>>>>>>>>>>>> number of parallel test runs is limited to the max number of >>>>>>>>>>>>>> workers >>>>>>>>>>>>>> configured which is controlled here: >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50 >>>>>>>>>>>>>> It is currently configured to 3 * number of CPU cores. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We are already running up to 48 Dataflow jobs in parallel. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> - How many resources to ValidatesRunner tests use? >>>>>>>>>>>>>>> - Where are those settings? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <[email protected]> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner >>>>>>>>>>>>>>>> tests. We currently allow only one of these to run at a time, >>>>>>>>>>>>>>>> to control >>>>>>>>>>>>>>>> usage of Dataflow and of GCE quota. Other types of tests do >>>>>>>>>>>>>>>> not suffer from >>>>>>>>>>>>>>>> this issue. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I would like to see if it's possible to increase Dataflow >>>>>>>>>>>>>>>> quota so we can run more of these in parallel. It took me 8 >>>>>>>>>>>>>>>> hours end to >>>>>>>>>>>>>>>> end to run these tests (about 6 hours for the run to be >>>>>>>>>>>>>>>> scheduled). If >>>>>>>>>>>>>>>> there was a failure, I would have had to repeat the whole >>>>>>>>>>>>>>>> process. In the >>>>>>>>>>>>>>>> worst case, this process could have taken me days. While this >>>>>>>>>>>>>>>> is not as >>>>>>>>>>>>>>>> pressing as some other issues (as most people don't need to >>>>>>>>>>>>>>>> run the >>>>>>>>>>>>>>>> Dataflow tests on every PR), fixing it would make such changes >>>>>>>>>>>>>>>> much easier >>>>>>>>>>>>>>>> to manage. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Reuven >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> +Reuven Lax <[email protected]> told me yesterday that he >>>>>>>>>>>>>>>>> was waiting for some test to be scheduled and run, and it >>>>>>>>>>>>>>>>> took 6 hours or >>>>>>>>>>>>>>>>> so. I would like to help reduce these wait times by >>>>>>>>>>>>>>>>> increasing parallelism. >>>>>>>>>>>>>>>>> I need help understanding the continuous minimum of what we >>>>>>>>>>>>>>>>> use. It seems >>>>>>>>>>>>>>>>> the following is true: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - There seems to always be 16 jenkins machines on (16 >>>>>>>>>>>>>>>>> CPUs each) >>>>>>>>>>>>>>>>> - There seems to be three GKE machines always on (1 >>>>>>>>>>>>>>>>> CPU each) >>>>>>>>>>>>>>>>> - Most (if not all) unit tests run on 1 machine, and >>>>>>>>>>>>>>>>> seem to run one-at-a-time <-- I think we can safely >>>>>>>>>>>>>>>>> parallelize this to 20. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent >>>>>>>>>>>>>>>>> unit tests, we still have room for 80 other concurrent >>>>>>>>>>>>>>>>> dataflow jobs to >>>>>>>>>>>>>>>>> execute, with 75% of CPU capacity. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thoughts? Additional data? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> r >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>> Got feedback? go/pabloem-feedback >>> <https://goto.google.com/pabloem-feedback> >>> >>
