is there a reason all guardrails and reliability (aka repair retries) configs are off by default? They are off by default in the normal config for backwards compatibility reasons, but if we are defining a config saying what we recommend, we should enable these things by default IMO.
This is one more question to be answered by this discussion. Are there other options that should be enabled by the "latest" configuration? To what values should they be set? Is there something that is currently enabled that should not be? Should we merge the configs breaking these tests? No…. When we have failing tests people do not spend the time to figure out if their logic caused a regression and merge, making things more unstable… so when we merge failing tests that leads to people merging even more failing tests... In this case this also means that people will not see at all failures that they introduce in any of the advanced features, as they are not tested at all. Also, since CASSANDRA-19167 and 19168 already have fixes, the non-latest test suite will remain clean after merge. Note that these two problems demonstrate that we have failures in the configuration we ship with, because we are not actually testing it at all. IMHO this is a problem that we should not delay fixing. Regards, Branimir On Wed, Feb 14, 2024 at 1:07 AM David Capwell <dcapw...@apple.com> wrote: > so can cause repairs to deadlock forever > > > Small correction, I finished fixing the tests in CASSANDRA-19042 and we > don’t deadlock, we timeout and fail repair if any of those messages are > dropped. > > On Feb 13, 2024, at 11:04 AM, David Capwell <dcapw...@apple.com> wrote: > > and to point potential users that are evaluating the technology to an > optimized set of defaults > > > Left this comment in the GH… is there a reason all guardrails and > reliability (aka repair retries) configs are off by default? They are > off by default in the normal config for backwards compatibility reasons, > but if we are defining a config saying what we recommend, we should enable > these things by default IMO. > > There are currently a number of test failures when the new options are > selected, some of which appear to be genuine problems. Is the community > okay with committing the patch before all of these are addressed? > > > I was tagged on CASSANDRA-19042, the paxos repair message handing does > not have the repair reliably improvements that 5.0 have, so can cause > repairs to deadlock forever (same as current 4.x repairs). Bringing these > up to par with the rest of repair would be very much welcome (they are also > lacking visibility, so need to fallback to heap dumps to see what’s going > on; same as 4.0.x but not 4.1.x), but I doubt I have cycles to do that…. > This refactor is not 100% trivial as it has fun subtle concurrency issues > to address (message retries and dedupping), and making sure this logic > works with the existing repair simulation tests does require refactoring > how the paxos cleanup state is tracked, which could have subtle consequents. > > I do think this should be fixed, but should it block 5.0? Not sure… will > leave to others…. > > Should we merge the configs breaking these tests? No…. When we have > failing tests people do not spend the time to figure out if their logic > caused a regression and merge, making things more unstable… so when we > merge failing tests that leads to people merging even more failing tests... > > On Feb 13, 2024, at 8:41 AM, Branimir Lambov <blam...@apache.org> wrote: > > Hi All, > > CASSANDRA-18753 introduces a second set of defaults (in a separate > "cassandra_latest.yaml") that enable new features of Cassandra. The > objective is two-fold: to be able to test the database in this > configuration, and to point potential users that are evaluating the > technology to an optimized set of defaults that give a clearer picture of > the expected performance of the database for a new user. The objective is > to get this configuration into 5.0 to have the extra bit of confidence that > we are not releasing (and recommending) options that have not gone through > thorough CI. > > The implementation has already gone through review, but I'd like to get > people's opinion on two things: > - There are currently a number of test failures when the new options are > selected, some of which appear to be genuine problems. Is the community > okay with committing the patch before all of these are addressed? This > should prevent the introduction of new failures and make sure we don't > release before clearing the existing ones. > - I'd like to get an opinion on what's suitable wording and documentation > for the new defaults set. Currently, the patch proposes adding the > following text to the yaml (see > https://github.com/apache/cassandra/pull/2896/files): > # NOTE: > # This file is provided in two versions: > # - cassandra.yaml: Contains configuration defaults for a "compatible" > # configuration that operates using settings that are > backwards-compatible > # and interoperable with machines running older versions of > Cassandra. > # This version is provided to facilitate pain-free upgrades for > existing > # users of Cassandra running in production who want to gradually and > # carefully introduce new features. > # - cassandra_latest.yaml: Contains configuration defaults that enable > # the latest features of Cassandra, including improved functionality > as > # well as higher performance. This version is provided for new users > of > # Cassandra who want to get the most out of their cluster, and for > users > # evaluating the technology. > # To use this version, simply copy this file over cassandra.yaml, or > specify > # it using the -Dcassandra.config system property, e.g. by running > # cassandra > -Dcassandra.config=file:/$CASSANDRA_HOME/conf/cassandra_latest.yaml > # /NOTE > Does this sound sensible? Should we add a pointer to this defaults set > elsewhere in the documentation? > > Regards, > Branimir > > > >