Re: GA again unreasonably slow (again)

Jarek Potiuk Tue, 09 Feb 2021 10:06:01 -0800

| The real hard problem is knowing when a change requires full regression
and integration testing of all possible platforms.


And here I absolutely agree too. Even more than that - I am a hard
practitioner of that. This is what we already implemented in Airflow (the
whole reasoning how and why it is implemented is here:
https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#)
(still we have a few optimisations left). We call it "selective checks".

And this is what I already proposed the Pulsar team to implement too - just
take a look at chapter 4) in their document:
https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#


In Airflow it helped a lot at some point. We got ~70% less load on the
queue - mainly thanks to selective checks.

Unfortunately - with the shared queue of the ASF, it only helped for some
two weeks - precisely because we were the only obese who  have done that,
so while being gentle to others, we have not got the love back. But - this
is not a complaint,  I think it is a natural thing when you have shared
resources that you do not pay for. People will not optimise as it is
huge investment  for them (not only the cost of doing it, but also
increased complexity). It\s been mentioned several times that Airflow's CI
is over-engineered, but I think it is simply heavily optimized (which
brings necessary complexity).

Again - there is no way (and it would be even not fair TBH) to enforce this
optimisation of their processes for the benefits of others, if they have no
incentives. This is simply consequence of the model of "free shared
motorway". No matter how hard you try - you will eventually end-up with
traffic jams.

J.


On Tue, Feb 9, 2021 at 6:40 PM Dave Fisher <wave4d...@comcast.net> wrote:

> The real hard problem is knowing when a change requires full regression
> and integration testing of all possible platforms.
>
> I think projects are allowing lazy engineering if those making changes
> don’t know the level of testing needed for their changes.
>
> Now with easy lightweight branches all being fully tested ....
>
> This is my 10,000 meter view.
>
> But then I’m old school and on my first job the mainframe printout
> included how much the run I made was costing my boss in $.
>
> Best Regards,
> Dave
>
> Sent from my iPhone
>
> > On Feb 9, 2021, at 9:20 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > Absolutely agree Matt. Throwing more hardware at "all of the projects"
> is
> > definitely not going to help - I was telling that from the beginning - it
> > is like building free motorways - the more you build, the more traffic
> > flows and the traffic jams remain. That's why I think reasonable
> > self-hosted solution that every project owns (including getting the
> credits
> > for that) is the only viable solution IMHO - only then you really start
> > optimising stuff because you own both - the problem and the solution
> > (and you do not - uncontrollably) impact other projects.
> >
> > We've just opened-up  today the self-hosted solution in Airflow -
> > announcement from Ash here:
> >
> https://lists.apache.org/thread.html/r2e398f86479e4cbfca13c22e4499fb0becdbba20dd9d6d47e1ed30bd%40%3Cdev.airflow.apache.org%3E
> > and we will be working out any "teething problems", once we are past
> that,
> >
> > We are on our way to achieve the goal from the first paragraph - i.e. be
> > able to control both problem and solution on a per-project basis. And
> once
> > we get some learnings - I am sure we will share our solution and findings
> > more widely with other projects, so that they could apply
> > similar solutions. This is especially the missing "security piece"  which
> > was a "blocker" so far, but also auto-scaling and tmpfs-optimisation
> > results (which is a nice side-effect if we can get the 10x improvements
> in
> > feedback time eventually (as it seems we can get there).
> >
> > We love data @Airflow so we will gather some stats that everyone will be
> > able to analyse and see how much they can gain from - not only the queue
> > bottleneck removal but also improving the most important (in my opinion)
> > metrics for the CI - which is feedback time. I personally think in CI
> there
> > are are the only two important metrics: reliability and feedback time.
> > Nothing else (including cost) matters. But If we get all three improved.
> > that would be something that we will be happy other projects can also
> > benefit from.
> >
> > J.
> >
> >
> >
> >> On Tue, Feb 9, 2021 at 3:16 PM Matt Sicker <boa...@gmail.com> wrote:
> >>
> >> To be honest, this sounds exactly like the usual CI problem on every
> >> platform. As your project scales up, CI becomes a Hard Problem. I don’t
> >> think throwing hardware at it indefinitely works, though your research
> here
> >> is finding most of the useful things.
> >>
> >>> On Tue, Feb 9, 2021 at 02:21 Jarek Potiuk <ja...@potiuk.com> wrote:
> >>>
> >>> The report shows only top contenders. And yes - we know it is flawed -
> >>> because it shows workflows not jobs (if you read the disclaimers - we
> >>> simply have not enough API calls quota to get detailed information for
> >> all
> >>> projects).
> >>>
> >>> So this is anecdotal. I also get no queue when I submit PR at 11 pm.
> >>> Actually whole Airflow committer team had to switch to the "night
> shift"
> >>> because of that. And the most "traffic-heavy" projects - Spark, Pulsar,
> >>> Superset, Beam, Airflow -  I think some of the top "traffic" projects
> >>> experience the same issues and several hours queue when they run during
> >> the
> >>> EMEA day/US morning.  And we all together try to help each other (for
> >>> example I helped yesterday the Pulsar team to implement most aggressive
> >> way
> >>> of cancelling their workflows
> https://github.com/apache/pulsar/pull/9503
> >>> (you can find pretty good explanation why and how it was implemented
> this
> >>> way), also we are working together with the Pulsar team to optimize
> their
> >>> workflow - there is a document
> >>>
> >>>
> >>
> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit
> >>> where several peopel are adding their suggestions (including myself
> based
> >>> on Airflow experiences).
> >>>
> >>> And with yetus' 12 (!)  wokflow runs over the last 2 monhts (!)
> >>> https://pasteboard.co/JNwGLiR.png - indeed you have a high chance you
> >> have
> >>> not experienced it, especially that you are the only person committing
> >>> there. This is hardly representative for other projects that have 100s
> of
> >>> committers and 100s of PRs a day. I am not sure if you are aware of
> >>> that, but those are the most valuable projects for the ASF - as those
> are
> >>> the ones that actually build community (Folowing "comunity over code
> >>> motto). If you have 3 PRs in 3 months and there aare 200 other projects
> >>> using GA, I think yetus is not going to show up in any meaningful
> >>> statistics.
> >>>
> >>> I am not sure if drawing a conclusion from a project that has 3 PRs in
> 2
> >>> months is the best way of drawing conclusions for the overall Apache
> >>> organisation. I think drawing a conclusion from experiences of 5
> actually
> >>> active projects with sometimes even 100 PRs a day is probably better
> >>> justified (yep - there are such projects).
> >>> So I would probably agree it has little influence on projects that have
> >> no
> >>> traffic. But enormous influence on projects that actually have traffic.
> >> You
> >>> have several teams of people scrambling now to  somehow manage their CI
> >> as
> >>> it is unbearable now. Is this serious ? I'd say so.
> >>>
> >>>        When you see Airflow backed up, maybe you should try submitting
> a
> >>> PR to another project yourself to see what happens.
> >>>
> >>> I am already spending a TON of my private time trying to help others in
> >> the
> >>> community. I would really appreciate a little help from your side. So
> >> maybe
> >>> you just submit 2-3 PRs yourself any time Monday - Friday 12pm CET ->
> 8pm
> >>> CET - this is where regularly bottlenecks happen. Please let everyone
> >> know
> >>> your findings
> >>>
> >>> J,
> >>>
> >>>
> >>> On Tue, Feb 9, 2021 at 8:35 AM Allen Wittenauer
> >>> <a...@effectivemachines.com.invalid> wrote:
> >>>
> >>>>
> >>>>
> >>>>> On Feb 8, 2021, at 5:00 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
> >>>>>
> >>>>>> I'm not convinced this is true. I have yet to see any of my PRs for
> >>>>> "non-big" projects getting queued while Spark, Airflow, others are.
> >>> Thus
> >>>>> why I think there are only a handful of projects that are getting
> >> upset
> >>>>> about this but the rest of us are like "meh whatever."
> >>>>>
> >>>>> Do you have any data on that? Or is it just anecdotal evidence?
> >>>>
> >>>>        Totally anecdotal.  Like when I literally ran a Yetus PR during
> >>>> the builds meeting as you were complaining about Airflow having an X
> >> deep
> >>>> queue. My PR ran fine, no pause.
> >>>>
> >>>>> You can see some analysis and actually even charts here:
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> >>>>
> >>>>        Yes, and I don't even see Yetus showing up.  I wonder how many
> >>>> other projects are getting dropped from the dataset....
> >>>>
> >>>>> Maybe you have a very tiny "PR traffic" and it is mostly in the time
> >>> zone
> >>>>> that is not affected?
> >>>>
> >>>>        True, it has very tiny PR traffic right now.  (Sep/Oct/Nov was
> >>>> different though)  But if it was one big FIFO queue, our PR jobs would
> >>> also
> >>>> get queued.  They aren't even when I go look at one of the other
> >> projects
> >>>> that does have queued jobs.
> >>>>
> >>>>        When you see Airflow backed up, maybe you should try
> >> submitting a
> >>>> PR to another project yourself to see what happens.
> >>>>
> >>>>        All I'm saying is: right now, that document feels like it is
> >>>> _greatly_ overstating the problem and now that you point it out,
> >> clearly
> >>>> dropping data.  It is problem, to be sure, but not all GitHub Actions
> >>>> projects are suffering.  (I wouldn't be surprised if smaller projects
> >> are
> >>>> actually fast tracked through the build queue in order to avoid a
> >> tyranny
> >>>> of the majority/resource starvation problem... which would be ironic
> >>> given
> >>>> how much of an issue that is at the ASF.)
> >>>
> >>>
> >>>
> >>> --
> >>> +48 660 796 129
> >>>
> >>
> >
> >
> > --
> > +48 660 796 129
>


-- 
+48 660 796 129

Re: GA again unreasonably slow (again)

Reply via email to