Re: GA again unreasonably slow (again)

Jarek Potiuk Tue, 09 Feb 2021 10:06:53 -0800

The reasoning for selective checks here:
https://github.com/apache/airflow/blob/master/PULL_REQUEST_WORKFLOW.rst
(correct link)


On Tue, Feb 9, 2021 at 7:05 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> | The real hard problem is knowing when a change requires full regression
> and integration testing of all possible platforms.
>
> And here I absolutely agree too. Even more than that - I am a hard
> practitioner of that. This is what we already implemented in Airflow (the
> whole reasoning how and why it is implemented is here:
> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#)
> (still we have a few optimisations left). We call it "selective checks".
>
> And this is what I already proposed the Pulsar team to implement too -
> just take a look at chapter 4) in their document:
> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#
>
>
> In Airflow it helped a lot at some point. We got ~70% less load on the
> queue - mainly thanks to selective checks.
>
> Unfortunately - with the shared queue of the ASF, it only helped for some
> two weeks - precisely because we were the only obese who  have done that,
> so while being gentle to others, we have not got the love back. But - this
> is not a complaint,  I think it is a natural thing when you have shared
> resources that you do not pay for. People will not optimise as it is
> huge investment  for them (not only the cost of doing it, but also
> increased complexity). It\s been mentioned several times that Airflow's CI
> is over-engineered, but I think it is simply heavily optimized (which
> brings necessary complexity).
>
> Again - there is no way (and it would be even not fair TBH) to enforce
> this optimisation of their processes for the benefits of others, if they
> have no incentives. This is simply consequence of the model of "free shared
> motorway". No matter how hard you try - you will eventually end-up with
> traffic jams.
>
> J.
>
>
> On Tue, Feb 9, 2021 at 6:40 PM Dave Fisher <wave4d...@comcast.net> wrote:
>
>> The real hard problem is knowing when a change requires full regression
>> and integration testing of all possible platforms.
>>
>> I think projects are allowing lazy engineering if those making changes
>> don’t know the level of testing needed for their changes.
>>
>> Now with easy lightweight branches all being fully tested ....
>>
>> This is my 10,000 meter view.
>>
>> But then I’m old school and on my first job the mainframe printout
>> included how much the run I made was costing my boss in $.
>>
>> Best Regards,
>> Dave
>>
>> Sent from my iPhone
>>
>> > On Feb 9, 2021, at 9:20 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
>> >
>> > Absolutely agree Matt. Throwing more hardware at "all of the projects"
>> is
>> > definitely not going to help - I was telling that from the beginning -
>> it
>> > is like building free motorways - the more you build, the more traffic
>> > flows and the traffic jams remain. That's why I think reasonable
>> > self-hosted solution that every project owns (including getting the
>> credits
>> > for that) is the only viable solution IMHO - only then you really start
>> > optimising stuff because you own both - the problem and the solution
>> > (and you do not - uncontrollably) impact other projects.
>> >
>> > We've just opened-up  today the self-hosted solution in Airflow -
>> > announcement from Ash here:
>> >
>> https://lists.apache.org/thread.html/r2e398f86479e4cbfca13c22e4499fb0becdbba20dd9d6d47e1ed30bd%40%3Cdev.airflow.apache.org%3E
>> > and we will be working out any "teething problems", once we are past
>> that,
>> >
>> > We are on our way to achieve the goal from the first paragraph - i.e. be
>> > able to control both problem and solution on a per-project basis. And
>> once
>> > we get some learnings - I am sure we will share our solution and
>> findings
>> > more widely with other projects, so that they could apply
>> > similar solutions. This is especially the missing "security piece"
>> which
>> > was a "blocker" so far, but also auto-scaling and tmpfs-optimisation
>> > results (which is a nice side-effect if we can get the 10x improvements
>> in
>> > feedback time eventually (as it seems we can get there).
>> >
>> > We love data @Airflow so we will gather some stats that everyone will be
>> > able to analyse and see how much they can gain from - not only the queue
>> > bottleneck removal but also improving the most important (in my opinion)
>> > metrics for the CI - which is feedback time. I personally think in CI
>> there
>> > are are the only two important metrics: reliability and feedback time.
>> > Nothing else (including cost) matters. But If we get all three improved.
>> > that would be something that we will be happy other projects can also
>> > benefit from.
>> >
>> > J.
>> >
>> >
>> >
>> >> On Tue, Feb 9, 2021 at 3:16 PM Matt Sicker <boa...@gmail.com> wrote:
>> >>
>> >> To be honest, this sounds exactly like the usual CI problem on every
>> >> platform. As your project scales up, CI becomes a Hard Problem. I don’t
>> >> think throwing hardware at it indefinitely works, though your research
>> here
>> >> is finding most of the useful things.
>> >>
>> >>> On Tue, Feb 9, 2021 at 02:21 Jarek Potiuk <ja...@potiuk.com> wrote:
>> >>>
>> >>> The report shows only top contenders. And yes - we know it is flawed -
>> >>> because it shows workflows not jobs (if you read the disclaimers - we
>> >>> simply have not enough API calls quota to get detailed information for
>> >> all
>> >>> projects).
>> >>>
>> >>> So this is anecdotal. I also get no queue when I submit PR at 11 pm.
>> >>> Actually whole Airflow committer team had to switch to the "night
>> shift"
>> >>> because of that. And the most "traffic-heavy" projects - Spark,
>> Pulsar,
>> >>> Superset, Beam, Airflow -  I think some of the top "traffic" projects
>> >>> experience the same issues and several hours queue when they run
>> during
>> >> the
>> >>> EMEA day/US morning.  And we all together try to help each other (for
>> >>> example I helped yesterday the Pulsar team to implement most
>> aggressive
>> >> way
>> >>> of cancelling their workflows
>> https://github.com/apache/pulsar/pull/9503
>> >>> (you can find pretty good explanation why and how it was implemented
>> this
>> >>> way), also we are working together with the Pulsar team to optimize
>> their
>> >>> workflow - there is a document
>> >>>
>> >>>
>> >>
>> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit
>> >>> where several peopel are adding their suggestions (including myself
>> based
>> >>> on Airflow experiences).
>> >>>
>> >>> And with yetus' 12 (!)  wokflow runs over the last 2 monhts (!)
>> >>> https://pasteboard.co/JNwGLiR.png - indeed you have a high chance you
>> >> have
>> >>> not experienced it, especially that you are the only person committing
>> >>> there. This is hardly representative for other projects that have
>> 100s of
>> >>> committers and 100s of PRs a day. I am not sure if you are aware of
>> >>> that, but those are the most valuable projects for the ASF - as those
>> are
>> >>> the ones that actually build community (Folowing "comunity over code
>> >>> motto). If you have 3 PRs in 3 months and there aare 200 other
>> projects
>> >>> using GA, I think yetus is not going to show up in any meaningful
>> >>> statistics.
>> >>>
>> >>> I am not sure if drawing a conclusion from a project that has 3 PRs
>> in 2
>> >>> months is the best way of drawing conclusions for the overall Apache
>> >>> organisation. I think drawing a conclusion from experiences of 5
>> actually
>> >>> active projects with sometimes even 100 PRs a day is probably better
>> >>> justified (yep - there are such projects).
>> >>> So I would probably agree it has little influence on projects that
>> have
>> >> no
>> >>> traffic. But enormous influence on projects that actually have
>> traffic.
>> >> You
>> >>> have several teams of people scrambling now to  somehow manage their
>> CI
>> >> as
>> >>> it is unbearable now. Is this serious ? I'd say so.
>> >>>
>> >>>        When you see Airflow backed up, maybe you should try
>> submitting a
>> >>> PR to another project yourself to see what happens.
>> >>>
>> >>> I am already spending a TON of my private time trying to help others
>> in
>> >> the
>> >>> community. I would really appreciate a little help from your side. So
>> >> maybe
>> >>> you just submit 2-3 PRs yourself any time Monday - Friday 12pm CET ->
>> 8pm
>> >>> CET - this is where regularly bottlenecks happen. Please let everyone
>> >> know
>> >>> your findings
>> >>>
>> >>> J,
>> >>>
>> >>>
>> >>> On Tue, Feb 9, 2021 at 8:35 AM Allen Wittenauer
>> >>> <a...@effectivemachines.com.invalid> wrote:
>> >>>
>> >>>>
>> >>>>
>> >>>>> On Feb 8, 2021, at 5:00 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
>> >>>>>
>> >>>>>> I'm not convinced this is true. I have yet to see any of my PRs for
>> >>>>> "non-big" projects getting queued while Spark, Airflow, others are.
>> >>> Thus
>> >>>>> why I think there are only a handful of projects that are getting
>> >> upset
>> >>>>> about this but the rest of us are like "meh whatever."
>> >>>>>
>> >>>>> Do you have any data on that? Or is it just anecdotal evidence?
>> >>>>
>> >>>>        Totally anecdotal.  Like when I literally ran a Yetus PR
>> during
>> >>>> the builds meeting as you were complaining about Airflow having an X
>> >> deep
>> >>>> queue. My PR ran fine, no pause.
>> >>>>
>> >>>>> You can see some analysis and actually even charts here:
>> >>>>>
>> >>>
>> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
>> >>>>
>> >>>>        Yes, and I don't even see Yetus showing up.  I wonder how many
>> >>>> other projects are getting dropped from the dataset....
>> >>>>
>> >>>>> Maybe you have a very tiny "PR traffic" and it is mostly in the time
>> >>> zone
>> >>>>> that is not affected?
>> >>>>
>> >>>>        True, it has very tiny PR traffic right now.  (Sep/Oct/Nov was
>> >>>> different though)  But if it was one big FIFO queue, our PR jobs
>> would
>> >>> also
>> >>>> get queued.  They aren't even when I go look at one of the other
>> >> projects
>> >>>> that does have queued jobs.
>> >>>>
>> >>>>        When you see Airflow backed up, maybe you should try
>> >> submitting a
>> >>>> PR to another project yourself to see what happens.
>> >>>>
>> >>>>        All I'm saying is: right now, that document feels like it is
>> >>>> _greatly_ overstating the problem and now that you point it out,
>> >> clearly
>> >>>> dropping data.  It is problem, to be sure, but not all GitHub Actions
>> >>>> projects are suffering.  (I wouldn't be surprised if smaller projects
>> >> are
>> >>>> actually fast tracked through the build queue in order to avoid a
>> >> tyranny
>> >>>> of the majority/resource starvation problem... which would be ironic
>> >>> given
>> >>>> how much of an issue that is at the ASF.)
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> +48 660 796 129
>> >>>
>> >>
>> >
>> >
>> > --
>> > +48 660 796 129
>>
>
>
> --
> +48 660 796 129
>


-- 
+48 660 796 129

Re: GA again unreasonably slow (again)

Reply via email to