It sounds like we are all pretty interested in seeing this feature land and the branch maintenance is causing overhead that could be spent on finalisation. +1 on merging, particularly given the feature flag work.
Once more unto the breach 💪 On Fri, 7 Mar 2025 at 6:56 PM, Benedict <bened...@apache.org> wrote: > There are essentially three possible timelines to choose from here: > > 1) We agree in the next few days to merge to trunk. We will then > prioritise rebasing onto trunk and resolving any pre-merge items starting > next week. > 2) There’s some more debate and agreement to merge to trunk in a week or > two. In the meantime we will shift to internal-first development but we’ll > likely prioritise the above work as soon as we can, which may be in a few > weeks, so we can shift to trunk first development. > 3) We don’t agree to merge accord anytime soon, so we shift to > internal-first development for the time being. I’m not sure when we will > prioritise any of the above. > > Our resources are finite and we’ve exhausted them (literally), so it’s > pretty much pick one of the above. I don’t really mind which you pick, but > I won’t personally be prioritising merge after this third attempt. > > On 6 Mar 2025, at 22:01, Jon Haddad <j...@rustyrazorblade.com> wrote: > >  > > Hmm... I took a look at the cep-15-accord branch in GitHub, it looks like > it's several hundred commits behind trunk. Since you'll need to rebase > again before merge *anyways*, would it make sense to do it once more, and I > can publish easy-cass-lab with the latest branch? If folks have concerns, > it's easy to fire up a cluster (I do it constantly) and try it out. > > I think if we were to do this, out of consideration we should time box the > amount of time for an evaluation and unless someone raises an objection, > consider lazy consensus achieved. > > Jon > > > > On Thu, Mar 6, 2025 at 12:46 PM Benedict Elliott Smith < > bened...@apache.org> wrote: > >> Because we want to validate against the latest code in trunk, else we are >> validating stale behaviours. The cost of rebasing is high, so we do not do >> it frequently. That means we will likely stop developing OSS-first, as the >> focus will have to move to our internal branch that satisfies these >> criteria. >> >> Exactly what this might be for upstreaming I cannot say. Personally, I >> aim to work exclusively on the branch we are stabilising. If that is not >> trunk, the latency for my contributions being made public might be high, as >> I have a huge imbalance of over-investment to recoup, and anything >> unnecessary will be deferred. >> >> Since the feature is disabled, and the code is almost entirely isolated, >> I cannot imagine the cost to the community to removing this work would be >> very high. But, I do not intend to argue Accord’s case here. I will let you >> all decide. >> >> Please decide soon though, as it shapes our work planning. The positive >> reception so far had lead me to consider prioritising a move to trunk-first >> development within the next week or two, and the associated work that >> entails. However, if that was optimistic we will have to shift our plans. >> >> >> >> On 6 Mar 2025, at 20:16, Jordan West <jw...@apache.org> wrote: >> >> The work and effort in accord has been amazing. And I’m sure it sets a >> new standard for code quality and correctness testing which I’m also >> entirely behind. I also trust the folks working on it want to take it to >> the a fully production ready solution. But I’m worried about circumstances >> out of our control leaving us with a very complex feature that isn’t >> complete. >> >> I do have some questions. Could folks help me better understand why >> testing real workloads necessitates a merge (my understanding from the >> original reason is this is the impetus for why we would merge now)? Also I >> think the performance and scheme change caveats are rather large ones. One >> of accords promise was better performance and I think making schema changes >> with nodes down not being supported is a big gap. Could we have some >> criteria like “supports all the operations PaxosV2 supports” or “performs >> as well or better than PaxosV2 on [workload(s)]”? >> >> I understand waiting asks a lot of the authors in terms of baring the >> burden of a more complex merge. But I think we also need to consider what >> merging is asking the community to bear if the worst happens and we are >> unable to take the feature from its current state to something that can be >> widely used in production. >> >> >> Jordan >> >> >> On Wed, Mar 5, 2025 at 15:52 Blake Eggleston <bl...@ultrablake.com> >> wrote: >> >>> +1 to merging it >>> >>> On Wed, Mar 5, 2025, at 12:22 PM, Patrick McFadin wrote: >>> >>> You have my +1 >>> >>> On Wed, Mar 5, 2025 at 12:16 PM Benedict <bened...@apache.org> wrote: >>> > >>> > Correct, these caveats should only apply to tables that have opted-in >>> to accord. >>> > >>> > On 5 Mar 2025, at 20:08, Jeremiah Jordan <jerem...@apache.org> wrote: >>> > >>> >  >>> > So great to see all this hard work about to pay off! >>> > >>> > On the questions/concerns front, the only concern I would have towards >>> merging this to trunk is if any of the caveats apply when someone is not >>> using Accord. Assuming they only apply when the feature flag is enabled, I >>> see no reason not to get this merged into trunk once everyone involved is >>> happy with the state of it. >>> > >>> > -Jeremiah >>> > >>> > On Mar 5, 2025 at 12:15:23 PM, Benedict Elliott Smith < >>> bened...@apache.org> wrote: >>> >> >>> >> That depends on all of you lovely people :D >>> >> >>> >> I think we should have finished merging everything we want before QA >>> by ~Monday; certainly not much later. >>> >> >>> >> I think we have some upgrade and python dtest failures to address as >>> well. >>> >> >>> >> So it could be pretty soon if the community is supportive. >>> >> >>> >> On 5 Mar 2025, at 17:22, Patrick McFadin <pmcfa...@gmail.com> wrote: >>> >> >>> >> >>> >> What is the timing for starting the merge process? I'm asking because >>> >> >>> >> I have (yet another) presentation and this would be a cool update. >>> >> >>> >> >>> >> On Wed, Mar 5, 2025 at 1:22 AM Benedict Elliott Smith >>> >> >>> >> <bened...@apache.org> wrote: >>> >> >>> >> > >>> >> >>> >> > Thanks everyone. >>> >> >>> >> > >>> >> >>> >> > Jon - your help will be greatly appreciated. We’ll let you know >>> when we’ve got the cycles to invest in performance work (hopefully fairly >>> soon). I expect the first step will be improving visibility so we can >>> better understand what the system is doing (particularly the caching >>> layers), but we can dig in together when ready. >>> >> >>> >> > >>> >> >>> >> > On 4 Mar 2025, at 18:15, Jon Haddad <j...@rustyrazorblade.com> >>> wrote: >>> >> >>> >> > >>> >> >>> >> > Very exciting! >>> >> >>> >> > >>> >> >>> >> > I have a client that's very interested in Accord, so I should have >>> budget to dig into it, especially on the performance side of things. >>> >> >>> >> > >>> >> >>> >> > Jon >>> >> >>> >> > >>> >> >>> >> > On Tue, Mar 4, 2025 at 9:57 AM Dmitry Konstantinov < >>> netud...@gmail.com> wrote: >>> >> >>> >> >> >>> >> >>> >> >> Thank you to all Accord and TCM contributors, it is really >>> exciting to see a development of such huge and wonderful features moving >>> forward and opening the door to the new Cassandra epoch! >>> >> >>> >> >> >>> >> >>> >> >> On Tue, 4 Mar 2025 at 20:45, Blake Eggleston <bl...@ultrablake.com> >>> wrote: >>> >> >>> >> >>> >>> >> >>> >> >>> Thanks Benedict! >>> >> >>> >> >>> >>> >> >>> >> >>> I’m really excited to see accord reach this milestone, even with >>> these caveats. You seem to have left yourself off the list of contributors >>> though, even though you’ve been a central figure in its development :) So >>> thanks to all accord & tcm contributors, including Benedict, for making >>> this possible! >>> >> >>> >> >>> >>> >> >>> >> >>> On Tue, Mar 4, 2025, at 8:00 AM, Benedict Elliott Smith wrote: >>> >> >>> >> >>> >>> >> >>> >> >>> Hi everyone, >>> >> >>> >> >>> >>> >> >>> >> >>> It’s been exactly 3.5 years since the first commit to >>> cassandra-accord. Yes, really, it’s been that long. >>> >> >>> >> >>> >>> >> >>> >> >>> We will be starting to validate the feature against real >>> workloads in the near future, so we can’t sensibly push off merging much >>> longer. The following is a brief run-down of the state of play. There are >>> no known bugs, but there remain a number of caveats we will be >>> incrementally addressing in the run-up to a full release: >>> >> >>> >> >>> >>> >> >>> >> >>> [1] Accord is likely to be SLOW until further optimisations are >>> implemented >>> >> >>> >> >>> [2] Schema changes have a number of hard edges >>> >> >>> >> >>> [3] Validation is ongoing, so there are likely still a number of >>> bugs to shake out >>> >> >>> >> >>> [4] Many operator visibility/tooling/documentation improvements >>> are pending >>> >> >>> >> >>> >>> >> >>> >> >>> To expand a little: >>> >> >>> >> >>> >>> >> >>> >> >>> [1] As of the last experiment we conducted, accord’s throughput >>> was poor - also leading to higher LAN latencies. We have done no WAN >>> experiments to date, but the protocol guarantees should already achieve >>> better round-trip performance, in particular under contention. Improving >>> throughput will be the main focus of attention once we are satisfied the >>> protocol is otherwise stable, but our focus remains validation for the >>> moment. >>> >> >>> >> >>> [2] Schema changes have not yet been well integrated with TCM. >>> Dropping a table for instance will currently cause problems if nodes are >>> offline. >>> >> >>> >> >>> [3] We have a range of validations we are already performing >>> against cassandra-accord directly, and against its integration with >>> Cassandra in cep-15-accord. We have run hundreds of billions of simulated >>> transactions, and are still discovering some minor fault every few billion >>> simulated transactions or so. There remains a lot more simulated validation >>> to explore, as well as with real clusters serving real workloads. >>> >> >>> >> >>> [4] There are already a range of virtual tables for exploring >>> internal state in Accord, and reasonably good metric support. However, >>> tracing is not yet supported, and our metric and virtual table integrations >>> need some further development. >>> >> >>> >> >>> [5] There are also other edge cases to address such as ensuring >>> we do not reuse HLCs after restart, supporting ByteOrderPartitioner, and >>> live migration from/to Paxos is undergoing fine-tuning and validation; >>> probably there are some other things I am forgetting. >>> >> >>> >> >>> >>> >> >>> >> >>> Altogether the feature is fairly mature, despite these caveats. >>> This is the fruit of the labour of a long list of contributors, including >>> Aleksey Yeschenko, Alex Petrov, Ariel Weisberg, Blake Eggleston, Caleb >>> Rackliffe and David Capwell, and represents a huge undertaking. It also >>> wouldn’t have been possible without the work of Alex Petrov, Marcus >>> Eriksson and Sam Tunnicliffe on delivering transactional cluster metadata. >>> I hope you will join me in thanking them all for their contributions. >>> >> >>> >> >>> >>> >> >>> >> >>> Alex has also kindly produced some initial overview documentation >>> for developers, that can be found here: >>> https://github.com/apache/cassandra/blob/cep-15-accord/doc/modules/cassandra/pages/developing/accord/index.adoc. >>> This will be expanded as time permits. >>> >> >>> >> >>> >>> >> >>> >> >>> Does anyone have any questions or concerns? >>> >> >>> >> >>> >>> >> >>> >> >>> >>> >> >>> >> >> >>> >> >>> >> >> >>> >> >>> >> >> -- >>> >> >>> >> >> Dmitry Konstantinov >>> >> >>> >> > >>> >> >>> >> > >>> >> >>> >> >>> >>> >>> >>