This is with regard to the Kubernetes Scheduler Backend and scaling the process to accept contributions. Given we're moving past upstreaming changes from our fork, and into getting *new* patches, I wanted to start this discussion sooner than later. This is more of a post-2.3 question - not something we're looking to solve right away.
While unit tests are handy, they're not nearly as good at giving us confidence as a successful run of our integration tests against single/multi-node k8s clusters. Currently, we have integration testing setup at https://github.com/apache-spark-on-k8s/spark-integration and it's running continuously against apache/spark:master in pepperdata-jenkins <http://spark-k8s-jenkins.pepperdata.org:8080/view/upstream%20spark/> (on minikube) & k8s-testgrid <https://k8s-testgrid.appspot.com/sig-big-data#spark-periodic-default-gke> (in GKE clusters). Now, the question is - how do we make integration-tests part of the PR author's workflow? 1. Keep the integration tests in the separate repo and require that contributors run them, add new tests prior to accepting their PRs as a policy. Given minikube <https://github.com/kubernetes/minikube> is easy to setup and can run on a single-node, it would certainly be possible. Friction however, stems from contributors potentially having to modify the integration test code hosted in that separate repository when adding/changing functionality in the scheduler backend. Also, it's certainly going to lead to at least brief inconsistencies between the two repositories. 2. Alternatively, we check in the integration tests alongside the actual scheduler backend code. This would work really well and is what we did in our fork. It would have to be a separate package which would take certain parameters (like cluster endpoint) and run integration test code against a local or remote cluster. It would include least some code dealing with accessing the cluster, reading results from K8s containers, test fixtures, etc. I see value in adopting (2), given it's a clearer path for contributors and lets us keep the two pieces consistent, but it seems uncommon elsewhere. How do the other backends, i.e. YARN, Mesos and Standalone deal with accepting patches and ensuring that they do not break existing clusters? Is there automation employed for this thus far? Would love to get opinions on (1) v/s (2). Thanks, Anirudh