Context: we're looking to get away from having split CircleCI and ASF CI as well
as getting ASF CI to a stable state. There's a variety of reasons why it's flaky
(orchestration, heterogenous hardware, hardware failures, flaky tests,
non-deterministic runs, noisy neighbors, etc), many of which Mick has been
making great headway on starting to address.

If you're curious see:
- Mick's 2023/01/09 email thread on CI:
    https://lists.apache.org/thread/fqdvqkjmz6w8c864vw98ymvb1995lcy4
- Mick's 2023/04/26 email thread on CI:
    https://lists.apache.org/thread/xb80v6r857dz5rlm5ckcn69xcl4shvbq
- CASSANDRA-18137: epic for "Repeatable ci-cassandra.a.o":
    https://issues.apache.org/jira/browse/CASSANDRA-18137
- CASSANDRA-18133: In-tree build scripts:
    https://issues.apache.org/jira/browse/CASSANDRA-18133

What's fallen out from this: the new reference CI will have the following 
logical layers:
1. ant
2. build/test scripts that setup the env. See run-tests.sh and
    run-python-dtests.sh here:
    
https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build
3. dockerized build/test scripts that have containerized the flow of 1 and 2. 
See:
    
https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build/docker
4. CI integrations. See generation of unified test report in build.xml:
    
https://github.com/thelastpickle/cassandra/blame/mck/18133/trunk/build.xml#L1794-L1817)
5. Optional full CI lifecycle w/Jenkins running in a container (full stack
    setup, run, teardown, pending)

**I want to let everyone know the high level structure of how this is shaping 
up,
**
**as this is a change that will directly impact the work of *all of us* on the
**
**project.**

In terms of our goals, the chief goals I'd like to call out in this context are:
* ASF CI needs to be and remain consistent
* contributors need a turnkey way to validate their work before merging that
    they can accelerate by throwing resources at it.

We as a project need to determine what is *required* to run in a CI environment
    to consider that run certified for merge. Where Mick and I landed through a 
lot
    of back and forth is that the following would be required:
1. used ant / pytest to build and run tests
2. used the reference scripts being changed in CASSANDRA-18133 (in-tree .build/)
    to setup and execute your test environment
3. constrained your runtime environment to the same hardware and time
    constraints we use in ASF CI, within reason (CPU count independent of speed,
    memory size and disk size independent of hardware specs, etc)
4. reported test results in a unified fashion that has all the information we
    normally get from a test run
5. (maybe) Parallelized the tests across the same split lines as upstream ASF
    (i.e. no weird env specific neighbor / scheduling flakes)

Last but not least is the "What do we do with CircleCI?" angle. The current
thought is we allow people to continue using it with the stated goal of
migrating the circle config over to using the unified build scripts as well and
get it in compliance with the above requirements.

For reference, here's a gdoc where we've hashed this out:
    
https://docs.google.com/document/d/1TaYMvE5ryOYX03cxzY6XzuUS651fktVER02JHmZR5FU/edit?usp=sharing

So my questions for the community here:
1. What's missing from the above conceptualization of the problem?
2. Are the constraints too strong? Too weak? Just right?

Thanks everyone, and happy Friday. ;)

~Josh

Reply via email to