Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

Jarek Potiuk Mon, 19 Apr 2021 03:03:00 -0700

I really love what Hyukjin has done. I did not have the capacity to
participate in this actively, but this is exactly the way to go, I think
(with a caveat). Following the motorway metaphor - everyone (every
contributor/committer, not every project) has their own lane and they do
not interfere with each other.

One observation we had from implementing self-hosted runners in Airflow is
that the faster the builds were, the more used they were as well. It's just
become too easy to use. Also we've hit a different problem: we started to
pay for every build and cost became proportional to the number of
committers/PRs. So while we could optimize the builds (and we have a big
incentive to do so) - we've hit some limits that we cannot go lower than  x
USD/Build. And the more we grow as a project, the bigger the cost it will
be. With the proposal from Hyukjin/Apache Spark, the cost is distributed -
and the only common part to pay for are "merge builds" (but those can run
on free infrastructure as they can wait usually).

However, as of now, this is a big hack and it is rather complex to
implement and understand and it has some "brittle" parts - for example the
workflow should not be disabled by the contributor.

But I believe it could be - likely - long term implemented by GitHub, and
it would solve all the problems of ASF.

Gavin,

Maybe we should raise this to Github Team and maybe that is something they
could indeed think about implementing ? I think it is a great oss-friendly
feature they could implement.

How it could look like from the GitHub side : the workflow could have a
"run-in-fork" flag or similar. Setting this flag could cause any PR running
from a public fork, run in this fork's space (source repo) rather than
target repo.

Hyukjin had to implement a number of workarounds to make it works:

a) specific if clauses in the workflow
b) specifying branches to run in the fork
c) finding the PR for each build and labeling it appropriately
d) adding status check manually in the PR
e) scheduled scanning of PRs and updating status checks for those

This all could be implemented in a much more elegant way by GitHub in the
"underlying GA fabric" - then none of the workarounds above would be
needed. They are mostly needed because of the permission model implemented
in GA.

J,

On Mon, Apr 19, 2021 at 9:30 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Thanks all.
>
> Just to add a bit of note,
>
> >  * Create a wiki page collecting all the practices to reduce the hours
> > (using the pr cancel workflow discussed earlier + timeouts + ...?)
>
> We should probably also mention that Apache Spark managed to distribute the
> workflow runs to forked repositories in pull requests, see the PRs:
> - https://github.com/apache/spark/pull/32092
> - https://github.com/apache/spark/pull/32193
> and umbrella JIRA: https://issues.apache.org/jira/browse/SPARK-35119
>
> This is still a workaround but it managed to reduce the overhead
> significantly by leveraging the resources from forked repositories.
>
>
> 2021년 4월 19일 (월) 오전 12:41, Antoine Pitrou <anto...@python.org>님이 작성:
>
> >
> > Hi Marton,
> >
> > Thanks a lot for the information you have collected and presented.  This
> > is very insightful!
> >
> > Le 18/04/2021 à 11:06, Elek, Marton a écrit :
> > >
> > > There are signs of mis-configuation of some jobs. For example in some
> > > projects I found many failure jobs with >15 hours executions even if
> the
> > > slowest successful (!) execution took only a few hours. It clearly
> shows
> > > that job level timeout is not yet configured.
> >
> > Ok, I'm curious: according to the GHA docs, the default job
> > timeout is 6 hours (360 minutes):
> >
> >
> https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes
> >
> > In Arrow, we didn't change this setting... how come your stats show
> > jobs taking up to 24 hours?
> >
> > Apparently, what's named "jobhours" in your statistics is actually the
> > runtime for an entire workflow (the sum of all job runtimes for that
> > workflow).  That's at least what I conclude if I look at this workflow,
> > which your table lists as the longest Arrow "job" with 24 hours of
> > runtime: https://github.com/apache/arrow/actions/runs/699123317
> > None of the jobs in that workflow took more than 6 hours, but cumulated
> > they indeed end up around 24 hours... (because 4 jobs timed out at 6
> hours)
> >
> > > Also the 46 or 36 hours of max job execution time sounds very
> > > un-realistic (it's a job, not the full workflow).
> >
> > Well, according to the above it's the full workflow.  It's still
> > unexpected as far as Arrow is concerned, though, and we should implement
> > per-job timeouts reflecting our expectations.
> >
> > > My suggestion:
> > >
> > >    * Publish Github action usage in a central place which is clearly
> > > visible for all Apache projects (I would be happy to volunteer here)
> > >
> > >    * Identify official suggestion of fair-usage (monthly hours) per
> > > project (easiest way: available hours / projects using github actions)
> > >
> > >    * Create a wiki page collecting all the practices to reduce the
> hours
> > > (using the pr cancel workflow discussed earlier + timeouts + ...?)
> > >
> > > * After every month send a very polite reminder to the projects who
> > > overuses github actions (using dev lists) including detailed statistics
> > > and the wiki link to help them to improve/reduce the usage.
> >
> > As a member of the Arrow PMC, I say +1 to all of this.
> >
> > Best regards
> >
> > Antoine.
> >
>

-- 
+48 660 796 129

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

Reply via email to