Absolutely agree Matt. Throwing more hardware at "all of the projects" is definitely not going to help - I was telling that from the beginning - it is like building free motorways - the more you build, the more traffic flows and the traffic jams remain. That's why I think reasonable self-hosted solution that every project owns (including getting the credits for that) is the only viable solution IMHO - only then you really start optimising stuff because you own both - the problem and the solution (and you do not - uncontrollably) impact other projects.
We've just opened-up today the self-hosted solution in Airflow - announcement from Ash here: https://lists.apache.org/thread.html/r2e398f86479e4cbfca13c22e4499fb0becdbba20dd9d6d47e1ed30bd%40%3Cdev.airflow.apache.org%3E and we will be working out any "teething problems", once we are past that, We are on our way to achieve the goal from the first paragraph - i.e. be able to control both problem and solution on a per-project basis. And once we get some learnings - I am sure we will share our solution and findings more widely with other projects, so that they could apply similar solutions. This is especially the missing "security piece" which was a "blocker" so far, but also auto-scaling and tmpfs-optimisation results (which is a nice side-effect if we can get the 10x improvements in feedback time eventually (as it seems we can get there). We love data @Airflow so we will gather some stats that everyone will be able to analyse and see how much they can gain from - not only the queue bottleneck removal but also improving the most important (in my opinion) metrics for the CI - which is feedback time. I personally think in CI there are are the only two important metrics: reliability and feedback time. Nothing else (including cost) matters. But If we get all three improved. that would be something that we will be happy other projects can also benefit from. J. On Tue, Feb 9, 2021 at 3:16 PM Matt Sicker <boa...@gmail.com> wrote: > To be honest, this sounds exactly like the usual CI problem on every > platform. As your project scales up, CI becomes a Hard Problem. I don’t > think throwing hardware at it indefinitely works, though your research here > is finding most of the useful things. > > On Tue, Feb 9, 2021 at 02:21 Jarek Potiuk <ja...@potiuk.com> wrote: > > > The report shows only top contenders. And yes - we know it is flawed - > > because it shows workflows not jobs (if you read the disclaimers - we > > simply have not enough API calls quota to get detailed information for > all > > projects). > > > > So this is anecdotal. I also get no queue when I submit PR at 11 pm. > > Actually whole Airflow committer team had to switch to the "night shift" > > because of that. And the most "traffic-heavy" projects - Spark, Pulsar, > > Superset, Beam, Airflow - I think some of the top "traffic" projects > > experience the same issues and several hours queue when they run during > the > > EMEA day/US morning. And we all together try to help each other (for > > example I helped yesterday the Pulsar team to implement most aggressive > way > > of cancelling their workflows https://github.com/apache/pulsar/pull/9503 > > (you can find pretty good explanation why and how it was implemented this > > way), also we are working together with the Pulsar team to optimize their > > workflow - there is a document > > > > > https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit > > where several peopel are adding their suggestions (including myself based > > on Airflow experiences). > > > > And with yetus' 12 (!) wokflow runs over the last 2 monhts (!) > > https://pasteboard.co/JNwGLiR.png - indeed you have a high chance you > have > > not experienced it, especially that you are the only person committing > > there. This is hardly representative for other projects that have 100s of > > committers and 100s of PRs a day. I am not sure if you are aware of > > that, but those are the most valuable projects for the ASF - as those are > > the ones that actually build community (Folowing "comunity over code > > motto). If you have 3 PRs in 3 months and there aare 200 other projects > > using GA, I think yetus is not going to show up in any meaningful > > statistics. > > > > I am not sure if drawing a conclusion from a project that has 3 PRs in 2 > > months is the best way of drawing conclusions for the overall Apache > > organisation. I think drawing a conclusion from experiences of 5 actually > > active projects with sometimes even 100 PRs a day is probably better > > justified (yep - there are such projects). > > So I would probably agree it has little influence on projects that have > no > > traffic. But enormous influence on projects that actually have traffic. > You > > have several teams of people scrambling now to somehow manage their CI > as > > it is unbearable now. Is this serious ? I'd say so. > > > > When you see Airflow backed up, maybe you should try submitting a > > PR to another project yourself to see what happens. > > > > I am already spending a TON of my private time trying to help others in > the > > community. I would really appreciate a little help from your side. So > maybe > > you just submit 2-3 PRs yourself any time Monday - Friday 12pm CET -> 8pm > > CET - this is where regularly bottlenecks happen. Please let everyone > know > > your findings > > > > J, > > > > > > On Tue, Feb 9, 2021 at 8:35 AM Allen Wittenauer > > <a...@effectivemachines.com.invalid> wrote: > > > > > > > > > > > > On Feb 8, 2021, at 5:00 PM, Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > > > >> I'm not convinced this is true. I have yet to see any of my PRs for > > > > "non-big" projects getting queued while Spark, Airflow, others are. > > Thus > > > > why I think there are only a handful of projects that are getting > upset > > > > about this but the rest of us are like "meh whatever." > > > > > > > > Do you have any data on that? Or is it just anecdotal evidence? > > > > > > Totally anecdotal. Like when I literally ran a Yetus PR during > > > the builds meeting as you were complaining about Airflow having an X > deep > > > queue. My PR ran fine, no pause. > > > > > > > You can see some analysis and actually even charts here: > > > > > > https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status > > > > > > Yes, and I don't even see Yetus showing up. I wonder how many > > > other projects are getting dropped from the dataset.... > > > > > > > Maybe you have a very tiny "PR traffic" and it is mostly in the time > > zone > > > > that is not affected? > > > > > > True, it has very tiny PR traffic right now. (Sep/Oct/Nov was > > > different though) But if it was one big FIFO queue, our PR jobs would > > also > > > get queued. They aren't even when I go look at one of the other > projects > > > that does have queued jobs. > > > > > > When you see Airflow backed up, maybe you should try > submitting a > > > PR to another project yourself to see what happens. > > > > > > All I'm saying is: right now, that document feels like it is > > > _greatly_ overstating the problem and now that you point it out, > clearly > > > dropping data. It is problem, to be sure, but not all GitHub Actions > > > projects are suffering. (I wouldn't be surprised if smaller projects > are > > > actually fast tracked through the build queue in order to avoid a > tyranny > > > of the majority/resource starvation problem... which would be ironic > > given > > > how much of an issue that is at the ASF.) > > > > > > > > -- > > +48 660 796 129 > > > -- +48 660 796 129