Hi Everyone,

Seems that we hit quota issue again:
https://builds.apache.org/job/beam_PostCommit_Go_GradleBuild/553/consoleFull

Can someone share information on how was this triaged last time or guide me
on possible follow-up actions?

Regards,
--Mikhail

Have feedback <http://go/migryz-feedback>?


On Tue, Jul 3, 2018 at 9:12 PM Rafael Fernandez <[email protected]> wrote:

> Summary for all folks following this story -- and many thanks for
> explaining configs to me and pointing me to files and such.
>
> - Scott made changes to the config and we can now run 3
> ValidatesRunner.Dataflow in parallel (each run is about 2 hours)
> - With the latest quota changes, we peaked at ~70% capacity in concurrent
> Dataflow jobs when running those
> - I've been keeping an eye on quota peaks for all resources today and have
> not seen any worryisome limits overall.
> - Also note there are improvements planned to the ValidatesRunner.Dataflow
> test so various items get batched and the test itself runs faster -- I
> believe it's on Alan's radar
>
> Cheers,
> r
>
> On Mon, Jul 2, 2018 at 4:23 PM Rafael Fernandez <[email protected]>
> wrote:
>
>> Done!
>>
>> On Mon, Jul 2, 2018 at 4:10 PM Scott Wegner <[email protected]> wrote:
>>
>>> Hey Rafael, looks like we need more 'INSTANCE_TEMPLATES' quota [1]. Can
>>> you take a look? I've filed [BEAM-4722]:
>>> https://issues.apache.org/jira/browse/BEAM-4722
>>>
>>> [1] https://github.com/apache/beam/pull/5861#issuecomment-401963630
>>>
>>> On Mon, Jul 2, 2018 at 11:33 AM Rafael Fernandez <[email protected]>
>>> wrote:
>>>
>>>> OK, Scott just sent https://github.com/apache/beam/pull/5860 . Quotas
>>>> should not be a problem, if they are, please file a JIRA under gcp-quota.
>>>>
>>>> Cheers,
>>>> r
>>>>
>>>> On Mon, Jul 2, 2018 at 10:06 AM Kenneth Knowles <[email protected]> wrote:
>>>>
>>>>> One thing that is nice when you do this is to be able to share your
>>>>> results. Though if all you are sharing is "they passed" then I guess we
>>>>> don't have to insist on evidence.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Mon, Jul 2, 2018 at 9:25 AM Scott Wegner <[email protected]> wrote:
>>>>>
>>>>>> A few thoughts:
>>>>>>
>>>>>> * The Jenkins job getting backed up
>>>>>> is beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR [1]. Since
>>>>>> Mikhail refactored Jenkins jobs, this only runs when explicitly requested
>>>>>> via "Run Dataflow ValidatesRunner", and only has 8 total runs. So this 
>>>>>> job
>>>>>> is idle more often than backlogged.
>>>>>>
>>>>>> * It's difficult to reason about our exact quota needs because
>>>>>> Dataflow jobs get launched from various Jenkins jobs that have different
>>>>>> parallelism configurations. If we have budget, we could enable concurrent
>>>>>> execution of this job and increase our quota enough to give some 
>>>>>> breathing
>>>>>> room. If we do this, I recommend limiting the max concurrency via
>>>>>> throttleConcurrentBuilds [2] to some reasonable limit.
>>>>>>
>>>>>> * This test suite is meant to be an exhaustive post-commit validation
>>>>>> of Dataflow runner, and tests a lot of different aspects of a runner. It
>>>>>> would be more efficient to run locally only the tests affected by your
>>>>>> change. Note that this requires having access to a GCP project with
>>>>>> billing, but most Dataflow developers probably have access to this 
>>>>>> already.
>>>>>> The command for this is:
>>>>>>
>>>>>> ./gradlew :beam-runners-google-cloud-dataflow-java:validatesRunner
>>>>>> -PdataflowProject=myGcpProject -PdataflowTempRoot=gs://myGcsTempRoot
>>>>>> --tests "org.apache.beam.MyTestClass"
>>>>>>
>>>>>> [1]
>>>>>> https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Gradle_PR/buildTimeTrend
>>>>>> [2]
>>>>>> https://jenkinsci.github.io/job-dsl-plugin/#method/javaposse.jobdsl.dsl.jobs.FreeStyleJob.throttleConcurrentBuilds
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 2, 2018 at 8:33 AM Lukasz Cwik <[email protected]> wrote:
>>>>>>
>>>>>>> The validates runner test parallelism is controlled here and is
>>>>>>> currently set to be "unlimited":
>>>>>>>
>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/runners/google-cloud-dataflow-java/build.gradle#L115
>>>>>>>
>>>>>>> Each test fork is run on a different gradle worker, so the number of
>>>>>>> parallel test runs is limited to the max number of workers configured 
>>>>>>> which
>>>>>>> is controlled here:
>>>>>>>
>>>>>>> https://github.com/apache/beam/blob/fbfe6ceaea9d99cb1c8964087aafaa2bc2297a03/.test-infra/jenkins/job_PostCommit_Java_ValidatesRunner_Dataflow.groovy#L50
>>>>>>> It is currently configured to 3 * number of CPU cores.
>>>>>>>
>>>>>>> We are already running up to 48 Dataflow jobs in parallel.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Jun 30, 2018 at 9:51 AM Rafael Fernandez <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> - How many resources to ValidatesRunner tests use?
>>>>>>>> - Where are those settings?
>>>>>>>>
>>>>>>>> On Sat, Jun 30, 2018 at 9:50 AM Reuven Lax <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The specific issue only affects Dataflow ValidatesRunner tests. We
>>>>>>>>> currently allow only one of these to run at a time, to control usage 
>>>>>>>>> of
>>>>>>>>> Dataflow and of GCE quota. Other types of tests do not suffer from 
>>>>>>>>> this
>>>>>>>>> issue.
>>>>>>>>>
>>>>>>>>> I would like to see if it's possible to increase Dataflow quota so
>>>>>>>>> we can run more of these in parallel. It took me 8 hours end to end 
>>>>>>>>> to run
>>>>>>>>> these tests (about 6 hours for the run to be scheduled). If there was 
>>>>>>>>> a
>>>>>>>>> failure, I would have had to repeat the whole process. In the worst 
>>>>>>>>> case,
>>>>>>>>> this process could have taken me days. While this is not as pressing 
>>>>>>>>> as
>>>>>>>>> some other issues (as most people don't need to run the Dataflow 
>>>>>>>>> tests on
>>>>>>>>> every PR), fixing it would make such changes much easier to manage.
>>>>>>>>>
>>>>>>>>> Reuven
>>>>>>>>>
>>>>>>>>> On Sat, Jun 30, 2018 at 9:32 AM Rafael Fernandez <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> +Reuven Lax <[email protected]> told me yesterday that he was
>>>>>>>>>> waiting for some test to be scheduled and run, and it took 6 hours 
>>>>>>>>>> or so. I
>>>>>>>>>> would like to help reduce these wait times by increasing 
>>>>>>>>>> parallelism. I
>>>>>>>>>> need help understanding the continuous minimum of what we use. It 
>>>>>>>>>> seems the
>>>>>>>>>> following is true:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    - There seems to always be 16 jenkins machines on (16 CPUs
>>>>>>>>>>    each)
>>>>>>>>>>    - There seems to be three GKE machines always on (1 CPU each)
>>>>>>>>>>    - Most (if not all) unit tests run on 1 machine, and seem to
>>>>>>>>>>    run one-at-a-time <-- I think we can safely parallelize this to 
>>>>>>>>>> 20.
>>>>>>>>>>
>>>>>>>>>> With current quotas, if we parallelize to 20 concurrent unit
>>>>>>>>>> tests, we still have room for 80 other concurrent dataflow jobs to 
>>>>>>>>>> execute,
>>>>>>>>>> with 75% of CPU capacity.
>>>>>>>>>>
>>>>>>>>>> Thoughts? Additional data?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> r
>>>>>>>>>>
>>>>>>>>>

Reply via email to