Re: External CI Service Limitations

Jarek Potiuk Tue, 02 Jul 2019 21:57:19 -0700

We also experience huge delays for Airflow (seems that we are the third "whale" 
according to 
https://lists.apache.org/thread.html/af52e2a3e865c01596d46374e8b294f2740587dbd59d85e132429b6c@%3Cbuilds.apache.org%3E)


We are evaluating other options for funding as well (including getting some 
credits from Google for Google Cloud Build / GCP) but it will take time to get 
resources and to switch. 

In the meantime maybe INFRA can help to coordinate some effort between 
Flink/Arrow/Airflow to decrease pressure on Travis? We considered few options 
(and are going to implement some of them shortly I think). Some of them are not 
direct changes in Travis CI builds but some other workflow/infrastructure 
changes that will decrease pressure on Travis:

* We are going to decrease the matrix of builds we run - currently we have 
several combinations of Airflow builds (postgres/mysql/sqlite) x (python3.5/ 
python 3.6) - but we will only run subset of those rather than full matrix 

* we are going to combine several of our jobs into one using parallel 
processing. This is mainly for static code analysis - currently we have one job 
for each analysis which makes them run in parallel. After the change - when you 
include machine boot times and use all processors, the overall build time might 
be even faster than today - AND there will be far less vms to start for the 
builds. 

* we have separate kubernetes-related job. It currently runs only one suite of 
tests specific to Kubernetes as it requires special setup of the environment, 
but we are looking into possibility of merging Kubernetes tests into main tests 
(and faster environment setup with docker-compose) and save 1 job (25% of our 
test jobs). The main jobs will run a bit longer, but the whole overhead for 
starting extra job will be gone. 

* We introduce (PR is in the final stages of review) an easy way for 
contributors to run static code analysis on their own environment. A lot of our 
builds are PR failing because of static code analysis that is run on Travis. 
Currently it was a bit convoluted and not easily reproducible to run full 
analysis locally , but we are moving to a fully dockerised setup for builds 
that will allow contributors to easily run such checks on their machines and we 
will encourage people to run it locally, rather than submit PRs just to check 
if the code is right. 

* Even more - we are introducing and encouraging easy-to-use "pre-commit" 
framework in our developer workflow where the analysis will be run at commit 
time for only the changes being committed - this might further decrease the 
number of builds submitted by the contributors. 

* Lastly - we are introducing an easy to use "simplified development 
environment" where developers will be able to run all or subset of test suites 
easily on their machine. Currently our setup is fairly convoluted as well but 
we have a PR in progress to address it and have a very easy way (again - fully 
dockerised) to reproduce the test environment. 

Maybe the committers from Flink and Arrow can also take a look at non-obvious 
ways how their projects can decrease pressure on Travis (at least for the time 
being). Maybe there are some quick wins we can apply in short time in 
coordinated way and buy more time for switching the infrastructure ?

Re: External CI Service Limitations

Reply via email to