Hi Everyone, Seems that we hit quota issue again: https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull
Can someone share information on how was this triaged last time or guide me on possible follow-up actions? Regards, --Mikhail Have feedback <http://go/migryz-feedback>? On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <[email protected]> wrote: > Summary for all folks following this story -- and many thanks for > explaining configs to me and pointing me to files and such. > > - Scott made changes to the config and we can now run 3 > ValidatesRunner.Dataflow in parallel (each run is about 2 hours) > - With the latest quota changes, we peaked at ~70% capacity in concurrent > Dataflow jobs when running those > - I've been keeping an eye on quota peaks for all resources today and have > not seen any worryisome limits overall. > - Also note there are improvements planned to the ValidatesRunner.Dataflow > test so various items get batched and the test itself runs faster -- I > believe it's on Alan's radar > > Cheers, > r > > On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <[email protected]> > wrote: > >> Done! >> >> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <[email protected]> wrote: >> >>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1]. Can >>> you take a look? I've filed [BEAM-4722]: >>> https://issues.apache.org/jira/browse/BEAM-4722 >>> >>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630 >>> >>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <[email protected]> >>> wrote: >>> >>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 . Quotas >>>> should not be a problem, if they are, please file a JIRA under gcp-quota. >>>> >>>> Cheers, >>>> r >>>> >>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <[email protected]> wrote: >>>> >>>>> One thing that is nice when you do this is to be able to share your >>>>> results. Though if all you are sharing is "they passed" then I guess we >>>>> don't have to insist on evidence. >>>>> >>>>> Kenn >>>>> >>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <[email protected]> wrote: >>>>> >>>>>> A few thoughts: >>>>>> >>>>>> * The Jenkins job getting backed up >>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since >>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested >>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this >>>>>> job >>>>>> is idle more often than backlogged. >>>>>> >>>>>> * It's difficult to reason about our exact quota needs because >>>>>> Dataflow jobs get launched from various Jenkins jobs that have different >>>>>> parallelism configurations. If we have budget, we could enable concurrent >>>>>> execution of this job and increase our quota enough to give some >>>>>> breathing >>>>>> room. If we do this, I recommend limiting the max concurrency via >>>>>> throttleConcurrentBuilds [2] to some reasonable limit. >>>>>> >>>>>> * This test suite is meant to be an exhaustive post-commit validation >>>>>> of Dataflow runner, and tests a lot of different aspects of a runner. It >>>>>> would be more efficient to run locally only the tests affected by your >>>>>> change. Note that this requires having access to a GCP project with >>>>>> billing, but most Dataflow developers probably have access to this >>>>>> already. >>>>>> The command for this is: >>>>>> >>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner >>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot >>>>>> --tests "org.apache.beam.MyTestClass" >>>>>> >>>>>> [1] >>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend >>>>>> [2] >>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds >>>>>> >>>>>> >>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <[email protected]> wrote: >>>>>> >>>>>>> The validates runner test parallelism is controlled here and is >>>>>>> currently set to be "unlimited": >>>>>>> >>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115 >>>>>>> >>>>>>> Each test fork is run on a different gradle worker, so the number of >>>>>>> parallel test runs is limited to the max number of workers configured >>>>>>> which >>>>>>> is controlled here: >>>>>>> >>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50 >>>>>>> It is currently configured to 3 * number of CPU cores. >>>>>>> >>>>>>> We are already running up to 48 Dataflow jobs in parallel. >>>>>>> >>>>>>> >>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> - How many resources to ValidatesRunner tests use? >>>>>>>> - Where are those settings? >>>>>>>> >>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> The specific issue only affects Dataflow ValidatesRunner tests. We >>>>>>>>> currently allow only one of these to run at a time, to control usage >>>>>>>>> of >>>>>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from >>>>>>>>> this >>>>>>>>> issue. >>>>>>>>> >>>>>>>>> I would like to see if it's possible to increase Dataflow quota so >>>>>>>>> we can run more of these in parallel. It took me 8 hours end to end >>>>>>>>> to run >>>>>>>>> these tests (about 6 hours for the run to be scheduled). If there was >>>>>>>>> a >>>>>>>>> failure, I would have had to repeat the whole process. In the worst >>>>>>>>> case, >>>>>>>>> this process could have taken me days. While this is not as pressing >>>>>>>>> as >>>>>>>>> some other issues (as most people don't need to run the Dataflow >>>>>>>>> tests on >>>>>>>>> every PR), fixing it would make such changes much easier to manage. >>>>>>>>> >>>>>>>>> Reuven >>>>>>>>> >>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> +Reuven Lax <[email protected]> told me yesterday that he was >>>>>>>>>> waiting for some test to be scheduled and run, and it took 6 hours >>>>>>>>>> or so. I >>>>>>>>>> would like to help reduce these wait times by increasing >>>>>>>>>> parallelism. I >>>>>>>>>> need help understanding the continuous minimum of what we use. It >>>>>>>>>> seems the >>>>>>>>>> following is true: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - There seems to always be 16 jenkins machines on (16 CPUs >>>>>>>>>> each) >>>>>>>>>> - There seems to be three GKE machines always on (1 CPU each) >>>>>>>>>> - Most (if not all) unit tests run on 1 machine, and seem to >>>>>>>>>> run one-at-a-time <-- I think we can safely parallelize this to >>>>>>>>>> 20. >>>>>>>>>> >>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit >>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs to >>>>>>>>>> execute, >>>>>>>>>> with 75% of CPU capacity. >>>>>>>>>> >>>>>>>>>> Thoughts? Additional data? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> r >>>>>>>>>> >>>>>>>>>
