In many cases it can be done with choosing a bigger machine with more CPUS and parallelising as others mentioned. This is cool if your tests are pure unit tests and you can add just `--xdist` flag or similar (this is a pytest extension to run your tests in parallel with as many CPUs as you can). However there are cases where the limitation is I/O or your tests simply cannot run in parallel because a lot of them rely on shared resources (say database). But even then you can attempt to do something about it.
In Airflow we solved those problems by custom-parallelising our jobs, choosing huge self-hosted runners and running everything in-memory. Even though our tests could not be parallelized "per tests" (mostly for historical reasons a lot of our tests are not pure unit tests and depend on database), we split the tests into "test types" (8 of them but soon more) and run them in parallel - with as many parallel types running as we have. Each test uses its own database instance - this is all orchestrated with docker-compose. In order to avoid inevitable I/O contention with this setup, this is all running on a huge tmpfs storage (50 GB or so) - including a docker instance that runs the databases that has tmpfs backing storage, so those databases are backed by in-memory filesystem and thus are super-stable and super-fast. Thanks to that, our thousands of tests can run really fast even if some of them are not pure unit tests. We run it all on a large self-hosted runner with 8 CPUS and 64 GB RAM and thanks to that our complete test suite runs in 15 minutes instead of 1.5 hour. Such setup achieves two optimisation goals: cheap and fast. Yes we need much more costly, bigger machines but we need them for a shorter time and we use them with 80%-90% utilisation which is pretty high for such cases (we keep optimising it regularly and I try to continue to push it closer to 100% continuously). As the result - if your hosted runners in the cloud are on-demand/ephemeral (usually 80%-90% cost reduction) and you have a fast setup, you can bring them up for 10 minutes and shutdown when finished, thus they cause a fraction of small machines that run all the time, especially if in the project you have times where no PRs are run. Also optimising speed of tests is even more important than optimising the cost of them, because getting feedback faster is good for your contributors - but with this setup we can eat cake and have it too - the cost is low and the tests are fast. J. On Fri, Apr 14, 2023 at 1:37 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: > Just dropping a comment. Apache Spark solved it by splitting the job. > > As of the number of parallel jobs, Apache Spark made, in PR builder, a > custom logic to link the GitHub workflow run in forked repositories - so we > reuse the GitHub resources in PR authors forked repository instead of the > one allocated to ASF itself. > > On Fri, Apr 14, 2023 at 8:00 AM sebb <seb...@gmail.com> wrote: > > > On Thu, 13 Apr 2023 at 20:58, Martin Grigorov <mgrigo...@apache.org> > > wrote: > > > > > > Hi, > > > > > > On Thu, Apr 13, 2023 at 7:17 PM Sai Boorlagadda < > > sai_boorlaga...@apache.org> > > > wrote: > > > > > > > Hey All! I am part of Apache Geode project and we have been migrating > > our > > > > pipelines to Github actions and hit a roadblock that the max. job > > execution > > > > time on non-self-hosted GitHub workers is set a hard limit > > > > < > > > > > > > https://docs.github.com/en/actions/learn-github-actions/usage-limits-billing-and-administration > > > > > > > > > of > > > > 6 hours and one of our job > > > > <https://github.com/apache/geode/actions/runs/4639012912> is taking > > more > > > > than 6 hours. Are there any pointers on how someone solved this? or > > does > > > > > > Github provides any increases for Apache Foundation projects? > > > > > > > > > > The only way to "increase the resources" is to use a self-hosted > runner. > > > But instead of looking how to use more of the free pool you should try > to > > > optimize your build to need less! > > > These free resources are shared with all other Apache projects, so when > > > your project uses more another project will have to wait. > > > > > > You can start by using parallel build - > > > > > > https://github.com/apache/geode/blob/102e24691eacd2d1d6652a070f14af9f5b42dc0d/.github/workflows/gradle.yml#L254 > > > Also tune the maxWorkers - > > > > > > https://github.com/apache/geode/blob/102e24691eacd2d1d6652a070f14af9f5b42dc0d/.github/workflows/gradle.yml#L256 > > . > > > The Linux VMs have 2 vCPUs. You can try with the macos-latest VM,it > has 3 > > > vCPUs. > > > Another option is to split this job into few smaller ones. Each job has > > its > > > own 6 hours. > > > > Also maybe run some of the jobs manually, rather than on every commit. > > At present there are two instances running at the same time from > > subsequent commits. > > At least one of these is a waste of resources. > > > > > Good luck! > > > > > > Martin > > >