Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

Ekaterina Dimitrova Thu, 15 Feb 2024 11:15:58 -0800

>
> Perhaps this was needed during j17 stabilization but is no longer required?


No, I only switched from tests running J8+J11 to tests running J11+J17.
What we tested was something decided in the 4.0 era when JDK11 was added,
and I was not even part of the community yet :-)

> Any known java-related changes require precommit j11 + j17.

While it is correct, we need to run both when changing anything
Java-related. I think more cases are not always obvious during development.
Just a few examples on top of my mind:
- Every next Java version is closing more and more internals - this means
that someone can be tempted to use some JDK11 internals, which are still
open in 11 but closed in later Java versions, and we will be in trouble. I
would prefer to see that before committing the patch, not after. Some cases
can be considered early in development and save us time. They are not
necessarily caught in compile time.
- Not all dependencies always support all Java versions, and it is not
necessarily immediately obvious
- Not all issues are exposed as failures on every CI run - we have flakies
hard to reproduce with runtime exceptions, especially around things that
require object crawling - on top of my mind jamm and the leak detector, but
also it could come from a dependency we haven't even thought about it.

With all my due respect to everyone and the extremely valuable discussion
here around what we run in CI, I fear we diverged the discussion from what
we do about the ticket in hand, and it is feasible today, to the long-term
discussion - nightly builds, etc. I believe the long-term discussion
deserves its thread, ticket, and work.
@Berenguer, thank you for volunteering to open an additional thread and
working on a new suggestion. Does anyone have anything against moving all
long-term suggestions not immediately related to this work here into a new
discussion thread? Also, in the spirit of the repeatable CI project,
creating a table with pre-commit and post-commit suggested jobs to run will
be good. Then we can decide what we want and as a second step add/remove
jobs in Jenkins, Circle, or whatever other CI people use at the moment and
hopefully converge it soon through the repeatable CI project. Do you think
this makes sense?

Again, I don't see value in running build J11 and J17 runtime additionally
> to J11 runtime - just pick one unless we change something specific to JVM

All JDK-17 problems I've seen were exposed in both situations - run 17 with
build 11 or 17 build. So I am fine with Jacek's suggestion, too, but I
prefer us to run on every commit, whatever we ship with. In the case of 5.0
- build JDK11, run JDK11 tests, run JDK17 tests, and to help ourselves -
build with JDK17.

Branimir in this patch has already done some basic cleanup of test
> variations, so this is not a duplication of the pipeline. It's a
> significant improvement.
>  I'm ok with cassandra_latest being committed and added to the pipeline,
> *if* the authors genuinely believe there's significant time and effort
> saved in doing so.

I share this sentiment if people are okay with us adding pre-commit now new
Python and Java distributed test jobs with the new configuration, and this
is not going to raise a lot the resource consumption. (Python tests are the
most resource-heavy tests. though we do not look at upgrade tests)

The plan is to ensure both configurations are green pre-commit. This should
> not increase the CI cost as this replaces extra configurations we were
> running before (e.g. test-tries).

Branimir, Did you also replace any Python tests? I am not worried about
unit test consumption but about the Python tests primarily.  Those are
running on the bigger containers in CircleCI, which burn more credits.
Also, Stefan, valid point - does Jenkins currently have enough resources to
cover the load? Was this tested?


On Thu, 15 Feb 2024 at 13:59, David Capwell <dcapw...@apple.com> wrote:

> This thread got large quick, yay!
>
> is there a reason all guardrails and reliability (aka repair retries)
> configs are off by default?  They are off by default in the normal config
> for backwards compatibility reasons, but if we are defining a config saying
> what we recommend, we should enable these things by default IMO.
>
> This is one more question to be answered by this discussion. Are there
> other options that should be enabled by the "latest" configuration? To what
> values should they be set?
> Is there something that is currently enabled that should not be?
>
>
> Very likely, we should try to figure that out.  We should also answer how
> conservative do we want to be by default?  There are many configs we need
> to flesh out here, glad to help with the configs I authored (prob best for
> JIRA rather than this thread)
>
>
> Should we merge the configs breaking these tests?  No…. When we have
> failing tests people do not spend the time to figure out if their logic
> caused a regression and merge, making things more unstable… so when we
> merge failing tests that leads to people merging even more failing tests...
>
> In this case this also means that people will not see at all failures that
> they introduce in any of the advanced features, as they are not tested at
> all. Also, since CASSANDRA-19167 and 19168 already have fixes, the
> non-latest test suite will remain clean after merge. Note that these two
> problems demonstrate that we have failures in the configuration we ship
> with, because we are not actually testing it at all. IMHO this is a problem
> that we should not delay fixing.
>
>
> I am not arguing we should not get this into CI, but more we should fix
> the known issues before getting into CI… its what we normally do, I don’t
> see a reason to special case this work.
>
> I am 100% cool blocking 5.0 on these bugs found (even if they are test
> failures), but don’t feel we should enable in CI until these issues are
> resolved; we can add the yaml now, but not the CI pipelines.
>
>
> 1) If there’s an “old compatible default” and “latest recommended
> settings”, when does the value in “old compatible default” get updated?
> Never?
>
>
> How about replacing cassandra.yaml with cassandra_latest.yaml on trunk
> when cutting cassandra-6.0 branch? Any new default changes on trunk go to
> cassandra_latest.yaml.
>
>
> I feel its dangerous to define this at the file level and should do at the
> config level… I personally see us adding new features disabled by default
> in cassandra.yaml and the recommended values in Cassandra-latest.yaml… If I
> add a config in 5.1.2 should it get enabled by default in 6.0?  I don’t
> feel thats wise.
>
> Maybe it makes sense to annotate the configs with the target version for
> the default change?
>
> Let's distinguish the packages of tests that need to be run with CDC
> enabled / disabled, with commitlog compression enabled / disabled, tests
> that verify sstable formats (mostly io and index I guess), and leave other
> parameters set as with the latest configuration - this is the easiest way I
> think.
>
>
> Yes please!  I really hate having a pipeline per config, we should
> annotate this some how in the tests that matter… junit can param the tests
> for us so we cover the different configs the test supports… I have written
> many tests that are costly and run on all these other pipelines but have 0
> change in the config… just wasting resources rerunning…
>
> Pushing this to the test also is a better author/maintainer experience…
> running the test in your IDE and seeing all the params and their results is
> so much better than monkeying around with yaml files and ant…. My repair
> simulation tests have a hack flag to try to switch the yaml to make it
> easier to test against the other configs and I loath it so much…
>
> To me running no-vnodes makes no sense because no-vnodes is just a special
> case of vnodes=1
>
>
> And to many in the community the only config for production =).  In this
> debate we have 3 camps: no-vnode, vnode <= 4 tokens, vnode > 4 tokens (I am
> simplifying….)… For those in the no-vnode camp their tests are focused on
> this case and get disabled when vnodes are enabled (so not running this
> config lowers coverage).
>
> I don’t think we are going to solve this debate in this thread, but if we
> can push this to the test to run as a param I think thats best… avoid
> having 2 pipelines and push this to the tests that actually support both
> configs...
>
> On Feb 15, 2024, at 10:20 AM, Jon Haddad <j...@jonhaddad.com> wrote:
>
> For the sake of argument, if we picked one, would you (anyone, not just
> Stefan) be OK with the JVM that's selected being the one you don't use?
> I'm willing to bet most folks will have a preference for the JVM they run
> in production.  I think both should be run on some frequent basis but I can
> definitely see the reasoning behind not wanting it to block folks on work,
> it sounds like a lot of wasted days waiting on CI especially during a
> bigger multi-cycle review.
>
> I suppose that it wouldn't necessarily need to be consistent - for example
> some folks might use 17 and others 11.  If this was the route the project
> goes down, it seems like it would be reasonable for someone to run
> whichever JVM version they felt like.  Hopefully at least a few regular
> committers would run 17, and that might be enough.
>
> Maybe instead of running the full suite on post - commit, it could be run
> periodically, like once a night or if it's longer than 24h run once a
> week.  If both JVMs get hit b/c different teams opt to use different
> versions, maybe it ends up being enough coverage.
>
> I'm curious if anyone can think of an issue that affected one JVM and not
> others, because that would probably help determine the usefulness of 2 test
> suites.
>
>
> On Thu, Feb 15, 2024 at 10:01 AM Štefan Miklošovič <
> stefan.mikloso...@gmail.com> wrote:
>
>> Only requiring building on supported JDKs and running all tests only on a
>> pre-defined version is definitely an improvement in terms of build time.
>> Building it is cheap, one worker and 5 minutes.
>>
>> As I already said, just want to reiterate that, instead of _running with
>> all Java's_ we might run with one Java version, we would just change it for
>> runs of two yamls (default and latest).
>>
>> However, this would put more stress on Jenkins based on what you
>> described in Post-commit point. Just saying it aloud.
>>
>> On Thu, Feb 15, 2024 at 6:12 PM Josh McKenzie <jmcken...@apache.org>
>> wrote:
>>
>>> Would it make sense to only block commits on the test strategy you've
>>> listed, and shift the entire massive test suite to post-commit?
>>>
>>>
>>> Lots and lots of other emails
>>>
>>>
>>> ;)
>>>
>>> There's an interesting broad question of: What config do we consider
>>> "recommended" going forward, the "conservative" (i.e. old) or the
>>> "performant" (i.e. new)? And what JDK do we consider "recommended" going
>>> forward, the oldest we support or the newest?
>>>
>>> Since those recommendations apply for new clusters, people need to
>>> qualify their setups, and we have a high bar of quality on testing
>>> pre-merge, my gut tells me "performant + newest JDK". This would impact
>>> what we'd test pre-commit IMO.
>>>
>>> Having been doing a lot of CI stuff lately, some observations:
>>>
>>>    - Our True North needs to be releasing a database that's free of
>>>    defects that violate our core properties we commit to our users. No data
>>>    loss, no data resurrection, transient or otherwise, due to defects in our
>>>    code (meteors, tsunamis, etc notwithstanding).
>>>    - The relationship of time spent on CI and stability of final full
>>>    *post-commit* runs is asymptotic. It's not even 90/10; we're
>>>    probably somewhere like 98% value gained from 10% of work, and the other 
>>> 2%
>>>    "stability" (i.e. green test suites, not "our database works") is a
>>>    long-tail slog. Especially in the current ASF CI heterogenous env w/its
>>>    current orchestration.
>>>    - Thus: Pre-commit and post-commit should be different. The
>>>    following points all apply to pre-commit:
>>>    - The goal of pre-commit tests should be some number of 9's of no
>>>    test failures post-commit (i.e. for every 20 green pre-commit we 
>>> introduce
>>>    1 flake post-commit). Not full perfection; it's not worth the compute and
>>>    complexity.
>>>    - We should *build *all branches on all supported JDK's (8 + 11 for
>>>    older, 11 + 17 for newer, etc).
>>>    - We should *run *all test suites with the *recommended *
>>>    *configuration* against the *highest versioned JDK a branch
>>>    supports. *And we should formally recommend our users run on that
>>>    JDK.
>>>    - We should *at least* run all jvm-based configurations on the
>>>    highest supported JDK version with the "not recommended but still
>>>    supported" configuration.
>>>    - I'm open to being persuaded that we should at least run jvm-unit
>>>    tests on the older JDK w/the conservative config pre-commit, but not much
>>>    beyond that.
>>>
>>> That would leave us with the following distilled:
>>>
>>> *Pre-commit:*
>>>
>>>    - Build on all supported jdks
>>>    - All test suites on highest supported jdk using recommended config
>>>    - Repeat testing on new or changed tests on highest supported JDK
>>>    w/recommended config
>>>    - JDK-based test suites on highest supported jdk using other config
>>>
>>> *Post-commit:*
>>>
>>>    - Run everything. All suites, all supported JDK's, both config files.
>>>
>>> With Butler + the *jenkins-jira* integration script
>>> <https://github.com/apache/cassandra-builds/blob/trunk/jenkins-jira-integration/jenkins_jira_integration.py>(need
>>> to dust that off but it should remain good to go), we should have a pretty
>>> clear view as to when any consistent regressions are introduced and why.
>>> We'd remain exposed to JDK-specific flake introductions and flakes in
>>> unchanged tests, but there's no getting around the 2nd one and I expect the
>>> former to be rare enough to not warrant the compute to prevent it.
>>>
>>> On Thu, Feb 15, 2024, at 10:02 AM, Jon Haddad wrote:
>>>
>>> Would it make sense to only block commits on the test strategy you've
>>> listed, and shift the entire massive test suite to post-commit?  If there
>>> really is only a small % of times the entire suite is useful this seems
>>> like it could unblock the dev cycle but still have the benefit of the full
>>> test suite.
>>>
>>>
>>>
>>> On Thu, Feb 15, 2024 at 3:18 AM Berenguer Blasi <
>>> berenguerbl...@gmail.com> wrote:
>>>
>>>
>>> On reducing circle ci usage during dev while iterating, not with the
>>> intention to replace the pre-commit CI (yet), we could do away with testing
>>> only dtests, jvm-dtests, units and cqlsh for a _single_ configuration imo.
>>> That would greatly reduce usage. I hacked it quickly here for illustration
>>> purposes:
>>> https://app.circleci.com/pipelines/github/bereng/cassandra/1164/workflows/3a47c9ef-6456-4190-b5a5-aea2aff641f1
>>> The good thing is that we have the tooling to dial in whatever we decide
>>> atm.
>>>
>>> Changing pre-commit is a different discussion, to which I agree btw. But
>>> the above could save time and $ big time during dev and be done and merged
>>> in a matter of days imo.
>>>
>>> I can open a DISCUSS thread if we feel it's worth it.
>>> On 15/2/24 10:24, Mick Semb Wever wrote:
>>>
>>>
>>>
>>> Mick and Ekaterina (and everyone really) - any thoughts on what test
>>> coverage, if any, we should commit to for this new configuration?
>>> Acknowledging that we already have *a lot* of CI that we run.
>>>
>>>
>>>
>>>
>>> Branimir in this patch has already done some basic cleanup of test
>>> variations, so this is not a duplication of the pipeline.  It's a
>>> significant improvement.
>>>
>>> I'm ok with cassandra_latest being committed and added to the pipeline,
>>> *if* the authors genuinely believe there's significant time and effort
>>> saved in doing so.
>>>
>>> How many broken tests are we talking about ?
>>> Are they consistently broken or flaky ?
>>> Are they ticketed up and 5.0-rc blockers ?
>>>
>>> Having to deal with flakies and broken tests is an unfortunate reality
>>> to having a pipeline of 170k tests.
>>>
>>> Despite real frustrations I don't believe the broken windows analogy is
>>> appropriate here – it's more of a leave the campground cleaner…   That
>>> being said, knowingly introducing a few broken tests is not that either,
>>> but still having to deal with a handful of consistently breaking tests
>>> for a short period of time is not the same cognitive burden as flakies.
>>> There are currently other broken tests in 5.0: VectorUpdateDeleteTest,
>>> upgrade_through_versions_test; are these compounding to the frustrations ?
>>>
>>> It's also been questioned about why we don't just enable settings we
>>> recommend.  These are settings we recommend for new clusters.  Our existing
>>> cassandra.yaml needs to be tailored for existing clusters being upgraded,
>>> where we are very conservative about changing defaults.
>>>
>>>
>>>
>

Re: [Discuss] "Latest" configuration for testing and evaluation (CASSANDRA-18753)

Reply via email to