I have merged 2 commits to 5.0 to backport CASSANDRA-20092 and CASSANDRA-20396 as discussed in my previous email.
Links are in the JIRAs as well but to make it easier they are here as well: https://github.com/apache/cassandra/commit/f327b63db09a907206749a3c88aba38a4554e548 https://github.com/apache/cassandra/commit/17cb89208c804680ffd4445d6a826171a67edb79 Jordan On Mon, Apr 14, 2025 at 5:00 AM Chris Lohfink <clohfin...@gmail.com> wrote: > +1 > > On Sun, Apr 13, 2025 at 12:32 PM Jordan West <jw...@apache.org> wrote: > >> Hi Folks, >> >> A bit delayed but I have the backport for 20092 ready. The branch can be >> found here: >> https://github.com/apache/cassandra/compare/cassandra-5.0...jrwest:cassandra:jwest/20092-5.0-backport. >> I've run tests and all looked good. I plan to do one more run post a recent >> rebase. Links are in CASSANDRA-20092. Before merging I wanted to sunshine >> and get community input on the following: >> >> * I've also included CASSANDRA-20396 in the backport. This looks to be a >> fix on top of CASSANDRA-20092 that we would want in the backport. I plan to >> leave these as two separate commits on cassandra-5.0, not one merged >> commit. If you feel differently please let me know. >> * CHANGES.txt: i've included both patches in 5.0.5 section of >> CHANGES.txt. They also appear under 5.1 in trunk. Since trunk isn't >> released I don't see any issue here but if folks have strong opinions on >> this lmk. I don't and can make any requested changes. >> >> * Along the way I discovered CASSANDRA-20538, a minor 1-line change >> needed to actually take advantage of 20092+15452 when BTI is enabled. I >> have included the patch for this in the backport as its largely a ninja >> fix. If folks feel strongly about addressing both 5.0 and trunk with 20538 >> instead of one coming via backport and one coming via the JIRA I can remove >> that commit from my backport (I am also not sure the best way to commit >> this in the backport, so I sort of lean to immediately addressing it in >> 20538. its largely moot since both would be included in the release either >> way). >> >> * For the backport I had to adjust one test ( >> https://github.com/apache/cassandra/commit/0039ebf915b66a88234325a74ef6edd18bde6da0). >> From what I can tell this is just because the compression code is less >> likely to find an aligned key on 5.0 (I haven't dug too much into why now >> that the test reliably passes with this minor tweak). Plan to squash this >> into the backport commit for CASSANDRA-20092 unless folks object. >> >> Jordan >> >> On Wed, Feb 19, 2025 at 1:35 PM Ariel Weisberg <ar...@weisberg.ws> wrote: >> >>> Whoops, 5 months ago, not six months ago. Much more reasonable to be >>> making this kind of fix. >>> >>> On Wed, Feb 19, 2025, at 3:56 PM, Ariel Weisberg wrote: >>> >>> Hi, >>> >>> This does not constitute a review, but I looked at both of them to >>> convince myself how they go about solving their respective problems is a >>> good idea. I am weakly +1. The risk reward is there, but 13 months since >>> 5.0 was released feels a little late to trying to improve node density >>> instead of just saying it is what it is. >>> >>> The +1 stems from the fact that if you really operate this at scale and >>> you are committed to node density then it's less of a nice to have and more >>> of a "some cluster falls over periodically now". >>> >>> Ariel >>> >>> On Fri, Feb 14, 2025, at 5:23 PM, Jordan West wrote: >>> >>> Thanks for the write up Mick. I think its is a great evaluation of >>> 15452. A few notes below: >>> >>> * CI links for 15452 might be burried and I may need to link the most >>> recent run (I’ve been using CircleCI since it’s what I’m familiar with — >>> happy to have runs on ASf hardware as well). >>> >>> * 15452 is configurable by YAML property as you mentioned. I’m traveling >>> but I’m 99% sure I made it a hot property as well (personally I believe you >>> shouldn’t have to restart a node to disable something like this, so if I >>> forgot to make it hot I will before merge). I’m traveling without my laptop >>> so I will double check next week (if someone has time to before then it’s >>> much appreciated) >>> >>> * I think “safe-ish” is a fair description of 15452. The new code is >>> fairly isolated but it’s in a very critical path with several call paths >>> leading to it. I think the testing we’ve done (both manual and automated) >>> as well as finding and fixing several bugs with those tests is what gives >>> me the confidence. And as you mentioned we have the fail safe of turning it >>> off otherwise. >>> >>> Jordan >>> >>> On Fri, Feb 14, 2025 at 12:42 Mick Semb Wever <m...@apache.org> wrote: >>> >>> Solid write up Jon! >>> >>> Hoping the committers and PMC members are keeping in mind this (very) >>> recent thread: >>> https://lists.apache.org/thread/h38g6q9d8h1q92h6qzs5tqdxpn2vmnyy >>> >>> This thread needs to also be about evaluating the risk these commits >>> are to a patch version. >>> I'm +1 and here's my thinking over it… >>> >>> Are the performance benefits clear ? >>> Yes (thank you Jon for doing a solid job at demonstrating just how >>> awesome this will be). >>> >>> Do we want this in 5.0 ? >>> That's the dumb question :) of course. >>> >>> Does it fix a bug or a regression, or is an operational improvement ? >>> Kinda, Branimir did mention a performance regression when the chunk >>> cache was disabled. >>> >>> How are the patches with unforeseen risks ? >>> These patches are medium sized and low-level, though isolated to (and >>> creating a better isolation of) the compaction (scan) reader. 15452 >>> looks safe-ish. I can't speak to 20092, it would be nice to hear >>> Branimir and Sylvain chime in. Are we confident that if any bugs do >>> arise in user's 5.0 production deployments we will be quick to provide >>> fixes and releases? I was thinking it's worth putting in a system >>> property that can disable the scan reader, so if anything happens >>> folks can just restart the node with the system property to return to >>> pre-patch behaviour, but that's already there! :) >>> >>> How well tested have they been ? >>> We have one (or a few) reports of it running in production, that's >>> super positive. Reports of lots of manual testing, super positive >>> too. But there's no CI results available for either patch. (Even the >>> one committed to trunk doesn't have pre-commit CI results available.) >>> I think CI results attached to the ticket are a must – I'm working on >>> adding them. >>> >>> >>> >>>