Re: [DISCUSS] CEP-10: Cluster and Code Simulations

Paulo Motta Tue, 13 Jul 2021 06:21:13 -0700

> "Where do we do that?" is a more tricky question.

I am fully aware of the importance of this testing infra to fix
CASSANDRA-12126 with a higher confidence and of Benedict's ability to
deliver a correct and safe patch.


The question is whether we want to be repeating old practices of including
potentially disruptive changes in minor versions or if we are committed to
changing our culture, no matter how confident we are the change is correct.
In my view, if we open a precedent to this change, we are basically saying
we will stick to the old practices and not be committed to providing long
term stability to our users.

In my view CEP-10 is not a strict blocker to CASSANDRA-12126 since we can
verify it with other means and add additional verification on 4.1 as
Jeremiah suggested. But even if it was, the community has worked around the
limitations of LWT for several years, will one more year until we fix these
limitations really make a difference?

Em ter., 13 de jul. de 2021 às 10:15, Jeremiah D Jordan <
jeremiah.jor...@gmail.com> escreveu:

> I tend to agree with Paulo that a major refactoring of some internal
> interfaces sounds like something to be explicitly avoided in a patch
> release.  I thought this was the type of change we all agreed we should
> stop letting in to patch releases, and that we would attempt to release
> more often (once a year) so changes that only go to trunk would get out
> faster?  Are we really wanting to break that promise to ourselves before we
> even release 4.0?  To me “I think we need this feature released faster” is
> not a reason to put it in 4.0, it could be a reason to release 4.1 sooner.
> This is where having a releasable trunk helps, as if we decided as a
> project that some change was worth a new major being released early the
> effort of doing that release is much smaller when trunk is releasable.
>
> Any fix we make in 4.0 would be merged forward into trunk and could be
> fully verified there?  Probably not the best, but would give more
> confidence in a fix than otherwise without adding other major changes to
> 4.0?
>
> -Jeremiah
>
> > On Jul 13, 2021, at 7:59 AM, Benjamin Lerer <b.le...@gmail.com> wrote:
> >
> >>
> >> Furthermore, we introduced a significant performance regression in all
> >> lines of the software by increasing the number of LWT round-trips.
> Unless
> >> we intend to leave this regression for a further year without _any_
> release
> >> offering a solution, we will need suitable verification mechanisms for
> >> whatever fixes we deliver.
> >>
> >> My view is that it is unacceptable to leave such a significant
> regression
> >> unaddressed in all lines of software we intend to release for the
> >> foreseeable future.
> >
> >
> > I would like to expand a bit on this as I believe it might be important
> for
> > people to have the full picture. The fix for  CASSANDRA-12126
> > <https://issues.apache.org/jira/browse/CASSANDRA-12126> introduced a
> > regression by increasing the number of LWT round-trips. Nevertheless, the
> > patch introduced a flag to allow users to revert to the previous behavior
> > (previous performance + consistency issue).
> >
> > Also the patch did not address all paxos consistency issues. There are
> > still some issues during topologie changes (may be in some other
> scenarios).
> >
> > My understanding of Benedict's proposal is to fix paxos once and for all
> > without any performance regression.
> >
> > That goal makes total sense to me. "Where do we do that?" is a more
> tricky
> > question.
> >
> > Le mar. 13 juil. 2021 à 14:46, bened...@apache.org <bened...@apache.org>
> a
> > écrit :
> >
> >> Hmm. It occurs to me I’m not entirely sure how our new release process
> is
> >> going to work.
> >>
> >> Will we be releasing 4.1 builds immediately, as part of shippable trunk?
> >> Or will 4.0 be our only active line of software for the next year?
> >>
> >> Either way, I bet my bottom dollar there will come some regret if we
> >> introduce such divergence between the two most active branches we
> maintain,
> >> so early in their lifecycles. If we invest significant resources in
> >> improved testing using this framework (which I very much expect) then
> >> branches that are not compatible will not benefit, likely reducing their
> >> quality; and the risk of backports will increase, due to divergence.
> >>
> >> Altogether, I think it would be a huge mistake. But if we will be
> shipping
> >> releases soon that can fix these aforementioned regressions, I won’t
> >> campaign for it.
> >>
> >>
> >>
> >> From: bened...@apache.org <bened...@apache.org>
> >> Date: Tuesday, 13 July 2021 at 13:31
> >> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> >> Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
> >> No change is without risk; we have introduced serious regressions with
> bug
> >> fixes to patch releases. The overall risk to the release lifecycle is
> >> reduced significantly in my opinion, as we reduce the likelihood of
> >> introducing regressions, and can use the same test infrastructure across
> >> all of the actively developed releases, increasing our confidence in
> 4.0.x
> >> releases.
> >>
> >> Furthermore, we introduced a significant performance regression in all
> >> lines of the software by increasing the number of LWT round-trips.
> Unless
> >> we intend to leave this regression for a further year without _any_
> release
> >> offering a solution, we will need suitable verification mechanisms for
> >> whatever fixes we deliver.
> >>
> >> My view is that it is unacceptable to leave such a significant
> regression
> >> unaddressed in all lines of software we intend to release for the
> >> foreseeable future.
> >>
> >>
> >> From: Paulo Motta <pauloricard...@gmail.com>
> >> Date: Tuesday, 13 July 2021 at 13:21
> >> To: Cassandra DEV <dev@cassandra.apache.org>
> >> Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
> >>> No, in my opinion the target should be 4.0.x. We are reaching for a
> >> shippable trunk and this has no public API impacts. This work is IMO
> >> central to achieving a shippable trunk, either way. The only reason I do
> >> not target 3.x is that it would be too burdensome.
> >>
> >> In my limited view of the proposal, a major refactor of internal
> >> concurrency APIs to support the testing facility potentially risks the
> >> stability of a minor release, something we've been wanting to avoid with
> >> our focus on stability. So I'd prefer this to go in  trunk/4.1,
> otherwise
> >> we will create precedence to including non-bugfix changes in minor
> >> versions, something I think we should avoid.
> >>
> >> In the past we've been lenient to including seemingly harmless internal
> >> changes that caused client impact and we should be careful to avoid
> this in
> >> the future. To prevent this I think we should take a strict approach and
> >> only accept bug fixes in minor (ie. 4.0.x) versions moving forward.
> >>
> >> I'd go one step further and propose that any CEPs, which are generally
> >> about new features, major API changes or internal refactorings, should
> only
> >> be allowed in subsequent major versions, unless an explicit exception is
> >> granted.
> >>
> >> Em ter., 13 de jul. de 2021 às 07:11, bened...@apache.org <
> >> bened...@apache.org> escreveu:
> >>
> >>> Perhaps it’s worth looking forward at the roadmap that we plan to
> >> develop,
> >>> and consider whether such a facility would be welcome for proving their
> >>> safety, and we can then worry about evolving the specifics of any
> API(s)
> >>> together as we deploy the capability? Looking ahead, there are very few
> >>> major features I wouldn’t want to see exercised with this approach,
> given
> >>> the choice.
> >>>
> >>> The LWT Verifier by itself is an integration test that covers many of
> the
> >>> affected subsystems, including sstables, memtables and repair. But we
> >> will
> >>> have the ability to introduce dedicated verification for each of these
> >>> features and systems, and we will necessarily produce more robust code
> >>> (repair is a great example of a brittle system that would be impossible
> >> to
> >>> produce with such an adversarial test system)
> >>>
> >>>
> >>> *Query side improvements:*
> >>>
> >>>  * Storage Attached Index or SAI. The CEP can be found at
> >>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-7%3A+Storage+Attached+Index
> >>>  * Add support for OR predicates in the CQL where clause
> >>>  * Allow to aggregate by time intervals (CASSANDRA-11871) and allow
> UDFs
> >>> in GROUP BY clause
> >>>  * Ability to read the TTL and WRITE TIME of an element in a collection
> >>> (CASSANDRA-8877)
> >>>  * Multi-Partition LWTs
> >>>  * Materialized views hardening: Addressing the different Materialized
> >>> Views issues (see CASSANDRA-15921 and [1] for some of the work
> involved)
> >>>
> >>> *Security improvements:*
> >>>
> >>>  * SSTables encryption (CASSANDRA-9633)
> >>>  * Add support for Dynamic Data Masking (CEP pending)
> >>>  * Allow the creation of roles that have the ability to assign
> arbitrary
> >>> privileges, or scoped privileges without also granting those roles
> access
> >>> to database objects.
> >>>  * Filter rows from system and system_schema based on users permissions
> >>> (CASSANDRA-15871)
> >>>
> >>> *Performance improvements:*
> >>>
> >>>  * Trie-based index format (CEP pending)
> >>>  * Trie-based memtables (CEP pending)
> >>>  * Paxos improvements: Paxos / LWT implementation that would enable the
> >>> database to serve serial writes with two round-trips and serial reads
> >> with
> >>> one round-trip in the uncontended case
> >>>
> >>> *Safety/Usability improvements:*
> >>>
> >>>  * Guardrails. The CEP can be found at
> >>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/%28DRAFT%29+-+CEP-3%3A+Guardrails
> >>>  * Add ability to track state in repair (CASSANDRA-15399)
> >>>  * Repair coordinator improvements (CASSANDRA-15399)
> >>>  * Make incremental backup configurable per keyspace and table
> >>> (CASSANDRA-15402)
> >>>  * Add ability to blacklist a CQL partition so all requests are ignored
> >>> (CASSANDRA-12106)
> >>>  * Add default and required keyspace replication options
> >> (CASSANDRA-14557)
> >>>  * Transactional Cluster Metadata: Use of transactions to propagate
> >>> cluster metadata
> >>>  * Downgrade-ability: Ability to downgrade to downgrade in the event
> >> that
> >>> a serious issue has been identified
> >>>
> >>> *Pluggability improvements:*
> >>>
> >>>  * Pluggable schema manager (CEP pending)
> >>>  * Pluggable filesystem (CEP pending)
> >>>  * Pluggable authenticator for CQLSH (CASSANDRA-16456). A CEP draft can
> >> be
> >>> found at
> >>>
> >>>
> >>
> https://docs.google.com/document/d/1_G-OZCAEmDyuQuAN2wQUYUtZBEJpMkHWnkYELLhqvKc/edit
> >>>  * Memtable API (CEP pending). The goal being to allow improvements
> such
> >>> as CASSANDRA-13981 to be easily plugged into Cassandra
> >>>
> >>> *Memtable pluggable implementation:*
> >>>
> >>>  * Enable Cassandra for Persistent Memory (CASSANDRA-13981)
> >>>
> >>>
> >>>
> >>>
> >>> From: bened...@apache.org <bened...@apache.org>
> >>> Date: Tuesday, 13 July 2021 at 10:51
> >>> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> >>> Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
> >>> Ach, editing code in the email editor isn’t smart when editors all have
> >>> different meanings for key combinations (accidentally hit send), but
> you
> >>> get the idea. The simulator would intercept these thread executions,
> the
> >>> memory accesses for the annotated field, and evaluate them so that in
> >> some
> >>> cases the assertions would fail.
> >>>
> >>> This is obviously a toy example that is not very interesting, but the
> >> main
> >>> real example we have is too complicated to produce a snippet to
> >>> demonstrate. In my view, the long term outcome of this work is likely
> the
> >>> enablement of many unit tests that are a little more complicated than
> >> this,
> >>> on less obvious code.
> >>>
> >>> But the headline goal of the CEP is not. By itself, the LWT Verifier
> >>> demonstrates the power and utility of the work. I don’t believe it is
> >>> terribly helpful to focus on secondary justifications like the example
> I
> >>> gave. For me, the _ability_ to prove the correctness of difficult but
> >>> critical systems is justification enough, whether or not we deliver a
> >>> simple API as part of the CEP.
> >>>
> >>>
> >>>
> >>> From: bened...@apache.org <bened...@apache.org>
> >>> Date: Tuesday, 13 July 2021 at 10:43
> >>> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> >>> Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
> >>>> Should target release be 4.1. (not 4.0.x) ?
> >>>
> >>>
> >>>
> >>> No, in my opinion the target should be 4.0.x. We are reaching for a
> >>> shippable trunk and this has no public API impacts. This work is IMO
> >>> central to achieving a shippable trunk, either way. The only reason I
> do
> >>> not target 3.x is that it would be too burdensome.
> >>>
> >>>> My concern is that changing code and tests at the same time risks
> >>> regressions…
> >>>
> >>>
> >>>
> >>> I’ve never heard this position before. Would you care to elaborate? It
> is
> >>> quite normal for us to update tests alongside changes to the code.
> >>>
> >>>> And seconding Benjamin's comments… some documentation on how to write
> a
> >>> test, and a simple test example, that this CEP then allows us to write
> >>> would help a lot (a la "working backwards").
> >>>
> >>> 1) This work is to _enable_ the development of tests, with the only
> test
> >>> originally planned to arrive alongside it the fairly sophisticated LWT
> >>> Verifier. This is something we have sorely needed as a project, as we
> >> have
> >>> had serious correctness violations for multiple years. This broad
> >> category
> >>> of integrated test for verifying correctness is the main goal of the
> work
> >>> and is not easily condensed into an example snippet.
> >>> 2) It is _possible_ that some simple and fluid APIs will be introduced
> in
> >>> a later phase of this work, but they haven’t been designed yet, so I
> >> cannot
> >>> share snippets.
> >>>
> >>> In principle, however, you would be able to do something like:
> >>>
> >>> @Nemesis volatile int x = 0;
> >>> int foo() {
> >>>    x = x + 1;
> >>>    return x;
> >>> }
> >>>
> >>> @Test
> >>> void test() {
> >>>    Future<?> f1 = executor.submit(() -> foo());
> >>>    Future<?> f2 = executor.submit(() -> foo());
> >>>    Assert.assertTrue(f1.get() == 1 || f2.get() == 1);
> >>> }
> >>>
> >>>
> >>> From: Mick Semb Wever <m...@apache.org>
> >>> Date: Tuesday, 13 July 2021 at 10:28
> >>> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> >>> Subject: Re: [DISCUSS] CEP-10: Cluster and Code Simulations
> >>>>
> >>>> To achieve this, significant modifications will be required to the
> >>> codebase, mostly cleaning up existing abstractions. Specifically, we
> will
> >>> need to be able to mock executors, any blocking concurrency primitives,
> >>> time, filesystem access and internode streaming.
> >>>>
> >>>> The work is – in large part – already complete, with JIRA and PRs to
> >>> follow in the coming weeks. Of course, the work is subject to the usual
> >>> community input and review, so this does not preclude changes to the
> work
> >>> (even significant ones, if they are warranted). I know a lot of
> incoming
> >>> CEP are likely to be backed up by significant off-list development as a
> >>> result of the focus on a shippable 4.0. Hopefully this is just a
> >> temporary
> >>> growing pain, particularly as we move towards a shippable trunk.
> >>>>
> >>>> I hope this work will be of huge value to the project, particularly as
> >>> we race to catch up on years of limited feature development.
> >>>>
> >>>> JIRA and PRs will follow, but I wanted to kick-off discussion in
> >> advance.
> >>>>
> >>>
> >>>
> >>>
> >>> Should target release be 4.1. (not 4.0.x) ?
> >>>
> >>> I'd be interested in seeing a rough timeline/plan of how the proposed
> >>> changes are to be defined in JIRAs and ordered.
> >>>
> >>> I'd like to hear a bit more about the test plan. Not so much about how
> >>> the CEP itself improves testability of the project, but for example
> >>> the testing required to be in place to introduce the changes of the
> >>> CEP (and if it already exists, where). My concern is that changing
> >>> code and tests at the same time risks regressions…
> >>>
> >>> And seconding Benjamin's comments… some documentation on how to write
> >>> a test, and a simple test example, that this CEP then allows us to
> >>> write would help a lot (a la "working backwards").
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> >>> For additional commands, e-mail: dev-h...@cassandra.apache.org
> >>>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Re: [DISCUSS] CEP-10: Cluster and Code Simulations

Reply via email to