Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

Jarek Potiuk Mon, 19 Apr 2021 04:07:57 -0700

Also some comments for the stats. This is good stuff Marton.


> Apparently, what's named "jobhours" in your statistics is actually the
> runtime for an entire workflow (the sum of all job runtimes for that
> workflow).  That's at least what I conclude if I look at this workflow,
> which your table lists as the longest Arrow "job" with 24 hours of
> runtime: https://github.com/apache/arrow/actions/runs/699123317
> None of the jobs in that workflow took more than 6 hours, but cumulated
> they indeed end up around 24 hours... (because 4 jobs timed out at 6 hours)
>

It does look like you have workflows rather than jobs - we had very similar
problems
when we (Tobiasz - one of the Airflow contributors) tried to get the stats.
The REST API limitations are super-painful, there is no way to dig down
to the job level (there is no GraphQL version to do it efficiently
unfortunately).
We found that rather than looking at jobhours, it's much better to look at
"in-progress"
and "queued" workflow from each project. It gives a much better
overview of what's going on.

Together with Gavin and the infra team we passed the request to Github to
get maybe
 some extracts of the stats, but until we have it we have a "poor-man's"
extracts that
we regularly run and store in Google Bigquery and simple DataStudio report
showing it (unfortunately we cannot
share it with everyone as it will incur some costs if it is publicly used).
but we try to
keep screenshots updated in this doc - where I keep status of the current
GA integration with ASF infra:

https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status

Here are some latest screenshots::

April stats: https://ibb.co/mCL6kZh
March and April stats: https://ibb.co/r2zjNsV

Those two above will show you the variability.

Just some summary for those who do not like to watch the graphs:
Seems like pulsar got down quite a bit in March/April where Arrow started
to be the one that uses most jobs, Spark being on the second place (but
with the changes from Hyukjin it will go down soon I believe). In the
meantime apisix-dashboard seems on the rise and pulsar is getting back.

Here you can see the peaks in a number of workflows:

https://ibb.co/QCJdLGD

But this one is the most important:  the number of ASF projects using
GA since November: https://ibb.co/RpFyQQy

The last one is most interesting, because as I see it, none of the
proposals below
will work - they might temporarily help if some projects will optimize it
but
there will be new ones coming. It seems that since November we are
continuously
fighting for jobs in peak and various projects that got fed-up with it,
finding some
workarounds or moving elsewhere. And it will continue.


> >    * Publish Github action usage in a central place which is clearly
> > visible for all Apache projects (I would be happy to volunteer here)
>

Oh yeah. If we only can get good stats, that would be great, but with
the current API limitations that seems very difficult. But If you could do
that
it would be great - however we need peak hours stats and peak hours limits
 to be precise.


> >    * Identify official suggestion of fair-usage (monthly hours) per
> > project (easiest way: available hours / projects using github actions)
>

The problem is that with the fixed amount of jobs we have and more
projects coming AND the fact that we have problems in Peaks, this stats
is a) wrong (the build hours do not matter too much - the peak hours do).
b) will continue to trend downwards with more projects coming. And it's the
peak hours we need to limit not overall hours.

And the problem is that peak hours usage is out-of-control by the projects
themselves. The problem is that those peak hours mostly come from
Contributors
contributing new PRs. There is not much each project can do to reduce those.
It's not only best practices, cancelling etc. But the main contributor is
hbw many
PRs are raised within a time window. And there isn't much we can do - other
than
give everyone their own lane (and I mean every contributor really - this is
what
Hyukjin did). No matter how hard the projects will try, this can't be
really controlled
otherwise.


> >
> >    * Create a wiki page collecting all the practices to reduce the hours
> > (using the pr cancel workflow discussed earlier + timeouts + ...?)


It's there:
https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
Good start. We can continue improving it.


> >
> > * After every month send a very polite reminder to the projects who
> > overuses github actions (using dev lists) including detailed statistics
> > and the wiki link to help them to improve/reduce the usage.


Having good stats is a good starting point for that. But there is only so
much we can
do and with the current growth of usage this is mostly about deferring the
inevitable by
couple of weeks/months even if everyone implements all optimisations.

I think distribution of "build-hours" per-committer is really the only
sustainable long-term way.

-- 
+48 660 796 129

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

Reply via email to