The reasoning for selective checks here: https://github.com/apache/airflow/blob/master/PULL_REQUEST_WORKFLOW.rst (correct link)
On Tue, Feb 9, 2021 at 7:05 PM Jarek Potiuk <ja...@potiuk.com> wrote: > | The real hard problem is knowing when a change requires full regression > and integration testing of all possible platforms. > > And here I absolutely agree too. Even more than that - I am a hard > practitioner of that. This is what we already implemented in Airflow (the > whole reasoning how and why it is implemented is here: > https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#) > (still we have a few optimisations left). We call it "selective checks". > > And this is what I already proposed the Pulsar team to implement too - > just take a look at chapter 4) in their document: > https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit# > > > In Airflow it helped a lot at some point. We got ~70% less load on the > queue - mainly thanks to selective checks. > > Unfortunately - with the shared queue of the ASF, it only helped for some > two weeks - precisely because we were the only obese who have done that, > so while being gentle to others, we have not got the love back. But - this > is not a complaint, I think it is a natural thing when you have shared > resources that you do not pay for. People will not optimise as it is > huge investment for them (not only the cost of doing it, but also > increased complexity). It\s been mentioned several times that Airflow's CI > is over-engineered, but I think it is simply heavily optimized (which > brings necessary complexity). > > Again - there is no way (and it would be even not fair TBH) to enforce > this optimisation of their processes for the benefits of others, if they > have no incentives. This is simply consequence of the model of "free shared > motorway". No matter how hard you try - you will eventually end-up with > traffic jams. > > J. > > > On Tue, Feb 9, 2021 at 6:40 PM Dave Fisher <wave4d...@comcast.net> wrote: > >> The real hard problem is knowing when a change requires full regression >> and integration testing of all possible platforms. >> >> I think projects are allowing lazy engineering if those making changes >> don’t know the level of testing needed for their changes. >> >> Now with easy lightweight branches all being fully tested .... >> >> This is my 10,000 meter view. >> >> But then I’m old school and on my first job the mainframe printout >> included how much the run I made was costing my boss in $. >> >> Best Regards, >> Dave >> >> Sent from my iPhone >> >> > On Feb 9, 2021, at 9:20 AM, Jarek Potiuk <ja...@potiuk.com> wrote: >> > >> > Absolutely agree Matt. Throwing more hardware at "all of the projects" >> is >> > definitely not going to help - I was telling that from the beginning - >> it >> > is like building free motorways - the more you build, the more traffic >> > flows and the traffic jams remain. That's why I think reasonable >> > self-hosted solution that every project owns (including getting the >> credits >> > for that) is the only viable solution IMHO - only then you really start >> > optimising stuff because you own both - the problem and the solution >> > (and you do not - uncontrollably) impact other projects. >> > >> > We've just opened-up today the self-hosted solution in Airflow - >> > announcement from Ash here: >> > >> https://lists.apache.org/thread.html/r2e398f86479e4cbfca13c22e4499fb0becdbba20dd9d6d47e1ed30bd%40%3Cdev.airflow.apache.org%3E >> > and we will be working out any "teething problems", once we are past >> that, >> > >> > We are on our way to achieve the goal from the first paragraph - i.e. be >> > able to control both problem and solution on a per-project basis. And >> once >> > we get some learnings - I am sure we will share our solution and >> findings >> > more widely with other projects, so that they could apply >> > similar solutions. This is especially the missing "security piece" >> which >> > was a "blocker" so far, but also auto-scaling and tmpfs-optimisation >> > results (which is a nice side-effect if we can get the 10x improvements >> in >> > feedback time eventually (as it seems we can get there). >> > >> > We love data @Airflow so we will gather some stats that everyone will be >> > able to analyse and see how much they can gain from - not only the queue >> > bottleneck removal but also improving the most important (in my opinion) >> > metrics for the CI - which is feedback time. I personally think in CI >> there >> > are are the only two important metrics: reliability and feedback time. >> > Nothing else (including cost) matters. But If we get all three improved. >> > that would be something that we will be happy other projects can also >> > benefit from. >> > >> > J. >> > >> > >> > >> >> On Tue, Feb 9, 2021 at 3:16 PM Matt Sicker <boa...@gmail.com> wrote: >> >> >> >> To be honest, this sounds exactly like the usual CI problem on every >> >> platform. As your project scales up, CI becomes a Hard Problem. I don’t >> >> think throwing hardware at it indefinitely works, though your research >> here >> >> is finding most of the useful things. >> >> >> >>> On Tue, Feb 9, 2021 at 02:21 Jarek Potiuk <ja...@potiuk.com> wrote: >> >>> >> >>> The report shows only top contenders. And yes - we know it is flawed - >> >>> because it shows workflows not jobs (if you read the disclaimers - we >> >>> simply have not enough API calls quota to get detailed information for >> >> all >> >>> projects). >> >>> >> >>> So this is anecdotal. I also get no queue when I submit PR at 11 pm. >> >>> Actually whole Airflow committer team had to switch to the "night >> shift" >> >>> because of that. And the most "traffic-heavy" projects - Spark, >> Pulsar, >> >>> Superset, Beam, Airflow - I think some of the top "traffic" projects >> >>> experience the same issues and several hours queue when they run >> during >> >> the >> >>> EMEA day/US morning. And we all together try to help each other (for >> >>> example I helped yesterday the Pulsar team to implement most >> aggressive >> >> way >> >>> of cancelling their workflows >> https://github.com/apache/pulsar/pull/9503 >> >>> (you can find pretty good explanation why and how it was implemented >> this >> >>> way), also we are working together with the Pulsar team to optimize >> their >> >>> workflow - there is a document >> >>> >> >>> >> >> >> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit >> >>> where several peopel are adding their suggestions (including myself >> based >> >>> on Airflow experiences). >> >>> >> >>> And with yetus' 12 (!) wokflow runs over the last 2 monhts (!) >> >>> https://pasteboard.co/JNwGLiR.png - indeed you have a high chance you >> >> have >> >>> not experienced it, especially that you are the only person committing >> >>> there. This is hardly representative for other projects that have >> 100s of >> >>> committers and 100s of PRs a day. I am not sure if you are aware of >> >>> that, but those are the most valuable projects for the ASF - as those >> are >> >>> the ones that actually build community (Folowing "comunity over code >> >>> motto). If you have 3 PRs in 3 months and there aare 200 other >> projects >> >>> using GA, I think yetus is not going to show up in any meaningful >> >>> statistics. >> >>> >> >>> I am not sure if drawing a conclusion from a project that has 3 PRs >> in 2 >> >>> months is the best way of drawing conclusions for the overall Apache >> >>> organisation. I think drawing a conclusion from experiences of 5 >> actually >> >>> active projects with sometimes even 100 PRs a day is probably better >> >>> justified (yep - there are such projects). >> >>> So I would probably agree it has little influence on projects that >> have >> >> no >> >>> traffic. But enormous influence on projects that actually have >> traffic. >> >> You >> >>> have several teams of people scrambling now to somehow manage their >> CI >> >> as >> >>> it is unbearable now. Is this serious ? I'd say so. >> >>> >> >>> When you see Airflow backed up, maybe you should try >> submitting a >> >>> PR to another project yourself to see what happens. >> >>> >> >>> I am already spending a TON of my private time trying to help others >> in >> >> the >> >>> community. I would really appreciate a little help from your side. So >> >> maybe >> >>> you just submit 2-3 PRs yourself any time Monday - Friday 12pm CET -> >> 8pm >> >>> CET - this is where regularly bottlenecks happen. Please let everyone >> >> know >> >>> your findings >> >>> >> >>> J, >> >>> >> >>> >> >>> On Tue, Feb 9, 2021 at 8:35 AM Allen Wittenauer >> >>> <a...@effectivemachines.com.invalid> wrote: >> >>> >> >>>> >> >>>> >> >>>>> On Feb 8, 2021, at 5:00 PM, Jarek Potiuk <ja...@potiuk.com> wrote: >> >>>>> >> >>>>>> I'm not convinced this is true. I have yet to see any of my PRs for >> >>>>> "non-big" projects getting queued while Spark, Airflow, others are. >> >>> Thus >> >>>>> why I think there are only a handful of projects that are getting >> >> upset >> >>>>> about this but the rest of us are like "meh whatever." >> >>>>> >> >>>>> Do you have any data on that? Or is it just anecdotal evidence? >> >>>> >> >>>> Totally anecdotal. Like when I literally ran a Yetus PR >> during >> >>>> the builds meeting as you were complaining about Airflow having an X >> >> deep >> >>>> queue. My PR ran fine, no pause. >> >>>> >> >>>>> You can see some analysis and actually even charts here: >> >>>>> >> >>> >> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status >> >>>> >> >>>> Yes, and I don't even see Yetus showing up. I wonder how many >> >>>> other projects are getting dropped from the dataset.... >> >>>> >> >>>>> Maybe you have a very tiny "PR traffic" and it is mostly in the time >> >>> zone >> >>>>> that is not affected? >> >>>> >> >>>> True, it has very tiny PR traffic right now. (Sep/Oct/Nov was >> >>>> different though) But if it was one big FIFO queue, our PR jobs >> would >> >>> also >> >>>> get queued. They aren't even when I go look at one of the other >> >> projects >> >>>> that does have queued jobs. >> >>>> >> >>>> When you see Airflow backed up, maybe you should try >> >> submitting a >> >>>> PR to another project yourself to see what happens. >> >>>> >> >>>> All I'm saying is: right now, that document feels like it is >> >>>> _greatly_ overstating the problem and now that you point it out, >> >> clearly >> >>>> dropping data. It is problem, to be sure, but not all GitHub Actions >> >>>> projects are suffering. (I wouldn't be surprised if smaller projects >> >> are >> >>>> actually fast tracked through the build queue in order to avoid a >> >> tyranny >> >>>> of the majority/resource starvation problem... which would be ironic >> >>> given >> >>>> how much of an issue that is at the ASF.) >> >>> >> >>> >> >>> >> >>> -- >> >>> +48 660 796 129 >> >>> >> >> >> > >> > >> > -- >> > +48 660 796 129 >> > > > -- > +48 660 796 129 > -- +48 660 796 129