In many cases it can be done with choosing a bigger machine with more CPUS
and parallelising as others mentioned. This is cool if your tests are pure
unit tests and you can add just `--xdist` flag or similar (this is a pytest
extension to run your tests in parallel with as many CPUs as you can).
However there are cases where the limitation is I/O or your tests simply
cannot run in parallel because a lot of them rely on shared resources (say
database). But even then you can attempt to do something about it.

In Airflow we solved those problems by custom-parallelising our jobs,
choosing huge self-hosted runners and running everything in-memory.

Even though our tests could not be parallelized "per tests" (mostly for
historical reasons a lot of our tests are not pure unit tests and depend on
database), we split the tests into "test types" (8 of them but soon more)
and run them in parallel - with as many parallel types running as we have.
Each test uses its own database instance  - this is all orchestrated with
docker-compose.
In order to avoid inevitable I/O contention with this setup, this is all
running on a huge tmpfs storage (50 GB or so) - including a docker
instance  that runs the databases that has tmpfs backing storage, so those
databases are backed by in-memory filesystem and thus are super-stable and
super-fast.  Thanks to that, our thousands of tests can run really fast
even if some of them are not pure unit tests. We run it all on a large
self-hosted runner with 8 CPUS and 64 GB RAM and thanks to that our
complete test suite runs in 15 minutes instead of 1.5 hour.

Such setup achieves two optimisation goals: cheap and fast. Yes we need
much more costly, bigger machines but we need them for a shorter time and
we use them with  80%-90% utilisation which is pretty high for such cases
(we keep optimising it regularly and I try to continue to push it closer to
100% continuously). As the result - if your hosted runners in the cloud are
on-demand/ephemeral (usually 80%-90% cost reduction) and you have a fast
setup, you can bring them up for 10 minutes and shutdown when finished,
thus they cause a fraction of small machines that run all the time,
especially if in the project you have times where no PRs are run. Also
optimising speed of tests is even more important than optimising the cost
of them, because getting feedback faster is good for your contributors -
but with this setup we can eat cake and have it too - the cost is low and
the tests are fast.

J.



On Fri, Apr 14, 2023 at 1:37 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Just dropping a comment. Apache Spark solved it by splitting the job.
>
> As of the number of parallel jobs, Apache Spark made, in PR builder, a
> custom logic to link the GitHub workflow run in forked repositories - so we
> reuse the GitHub resources in PR authors forked repository instead of the
> one allocated to ASF itself.
>
> On Fri, Apr 14, 2023 at 8:00 AM sebb <seb...@gmail.com> wrote:
>
> > On Thu, 13 Apr 2023 at 20:58, Martin Grigorov <mgrigo...@apache.org>
> > wrote:
> > >
> > > Hi,
> > >
> > > On Thu, Apr 13, 2023 at 7:17 PM Sai Boorlagadda <
> > sai_boorlaga...@apache.org>
> > > wrote:
> > >
> > > > Hey All! I am part of Apache Geode project and we have been migrating
> > our
> > > > pipelines to Github actions and hit a roadblock that the max. job
> > execution
> > > > time on non-self-hosted GitHub workers is set a hard limit
> > > > <
> > > >
> >
> https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration
> > > > >
> > > > of
> > > > 6 hours and one of our job
> > > > <https://github.com/apache/geode/actions/runs/4639012912> is taking
> > more
> > > > than 6 hours. Are there any pointers on how someone solved this? or
> > does
> > >
> > > Github provides any increases for Apache Foundation projects?
> > > >
> > >
> > > The only way to "increase the resources" is to use a self-hosted
> runner.
> > > But instead of looking how to use more of the free pool you should try
> to
> > > optimize your build to need less!
> > > These free resources are shared with all other Apache projects, so when
> > > your project uses more another project will have to wait.
> > >
> > > You can start by using parallel build -
> > >
> >
> https://github.com/apache/geode/blob/102e24691eacd2d1d6652a070f14af9f5b42dc0d/.github/workflows/gradle.yml#L254
> > > Also tune the maxWorkers -
> > >
> >
> https://github.com/apache/geode/blob/102e24691eacd2d1d6652a070f14af9f5b42dc0d/.github/workflows/gradle.yml#L256
> > .
> > > The Linux VMs have 2 vCPUs. You can try with the macos-latest VM,it
> has 3
> > > vCPUs.
> > > Another option is to split this job into few smaller ones. Each job has
> > its
> > > own 6 hours.
> >
> > Also maybe run some of the jobs manually, rather than on every commit.
> > At present there are two instances running at the same time from
> > subsequent commits.
> > At least one of these is a waste of resources.
> >
> > > Good luck!
> > >
> > > Martin
> >
>

Reply via email to