Mick and I have been chatting back and forth a bit about 4.1 beta, ASF CI, 
circleci, and the current challenges we face as a project with our CI 
environment(s). Given the consistent difficulty we continue to have with the 
ASF CI env that's been blocking our beta + ga for months now, I think we should 
revisit some of the verbiage around our release lifecycle and commit guidelines 
to better align our release process with confidence that our code is correct 
rather than confidence that ASF CI is passing (not necessarily the same thing!)

The proposal:
----------------
Revise language on release lifecycle to specify (reference: 
https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle):
- For beta: 1 Green run on circle or 1 green run on ci-cassandra
   - No new flaky tests on ci-cassandra compared to older branches as a 
snapshot at a point in time
- For ga: 3 consecutive green runs of either circle or ci-cassandra
   - No new flaky tests on ci-cassandra compared to older branches as a 
snapshot at a point in time

Revise committing guidelines to specify (reference: 
https://cassandra.apache.org/_/development/how_to_commit.html):
- Must run either circleci or ci-cassandra before commit
- For CI on a patch, run the pre-commit suite and also run multiplexer with 250 
runs on new, changed, or related tests to ensure not flaky
- Include instructions clearly on how to do that in the wiki / How to Commit 
page (generate.sh etc)
- In case of ci-cassandra, ensure run doesn't have any new test failures 
compared to previous run (i.e. no new flakes attributable to your patch) <- 
only needed while we get ci-cassandra cleaned up

In the next major release lifecycle, commit to working as a project on either 
tidying up our current heterogenous ASF CI environment or spin up a new 
homogenous well provisioned environment.
---------------
I think it's worth calling out: relying on circleci test runs along with ASF CI 
as an option implies we have the same confidence in a green circleci run as we 
have in a green ASF CI run. The above revision effectively serves as a failsafe 
for us to be able to cut releases on a more predictable cadence while we 
continue to improve the ASF CI environment and eventually replace circle with 
just ASF CI for release validation. 

So - what do we think?

On Mon, Sep 26, 2022, at 12:48 PM, Mick Semb Wever wrote:
>> We discussed in JIRA + ML a balanced approach to cutting 4.1 GA, which was 
>> having 1 green run to cut beta and 3 green runs in a row to cut GA (correct 
>> me if I'm wrong here Mick). To that end, checking in on 4.1 Butler we see 
>> we're at 6 failures on run 162: 
>> https://butler.cassandra.apache.org/#/ci/upstream/compare/Cassandra-4.1/cassandra-4.1.
>>  This appears to be a pretty consistent number of failures and the history 
>> in butler shows that the failures are scattershot all over the test suites; 
>> we're probably going to have to continue to whack-a-mole these flaky 
>> failures down as long as we keep "Green ASF CI" as our gatekeeping metric as 
>> this is inherent in the current runtime environment.
> 
> 
> +1
> 
> I would like to extend that^ to say for beta: CircleCI to be green, and no 
> known flaky regressions (or blocking jira tickets).
> 
> I know there is more to come on this topic (Josh), and that it is important 
> we balance being pragmatic versus ensuring we continue to push towards green 
> ci-cassandra and no flakies. But for now this means we are ready to cut 
> 4.1-beta1 (which I shall do in the next few hours).
> 

Reply via email to