Re: Merging compaction improvements to 5.0

Chris Lohfink Tue, 15 Apr 2025 09:07:30 -0700

+1

On Sun, Apr 13, 2025 at 12:32 PM Jordan West <jw...@apache.org> wrote:


> Hi Folks,
>
> A bit delayed but I have the backport for 20092 ready. The branch can be
> found here:
> https://github.com/apache/cassandra/compare/cassandra-5.0...jrwest:cassandra:jwest/20092-5.0-backport.
> I've run tests and all looked good. I plan to do one more run post a recent
> rebase. Links are in CASSANDRA-20092. Before merging I wanted to sunshine
> and get community input on the following:
>
> * I've also included CASSANDRA-20396 in the backport. This looks to be a
> fix on top of CASSANDRA-20092 that we would want in the backport. I plan to
> leave these as two separate commits on cassandra-5.0, not one merged
> commit. If you feel differently please let me know.
> * CHANGES.txt: i've included both patches in 5.0.5 section of CHANGES.txt.
> They also appear under 5.1 in trunk. Since trunk isn't released I don't see
> any issue here but if folks have strong opinions on this lmk. I don't and
> can make any requested changes.
>
> * Along the way I discovered CASSANDRA-20538, a minor 1-line change needed
> to actually take advantage of 20092+15452 when BTI is enabled. I have
> included the patch for this in the backport as its largely a ninja fix. If
> folks feel strongly about addressing both 5.0 and trunk with 20538 instead
> of one coming via backport and one coming via the JIRA I can remove that
> commit from my backport (I am also not sure the best way to commit this in
> the backport, so I sort of lean to immediately addressing it in 20538. its
> largely moot since both would be included in the release either way).
>
> * For the backport I had to adjust one test (
> https://github.com/apache/cassandra/commit/0039ebf915b66a88234325a74ef6edd18bde6da0).
> From what I can tell this is just because the compression code is less
> likely to find an aligned key on 5.0 (I haven't dug too much into why now
> that the test reliably passes with this minor tweak). Plan to squash this
> into the backport commit for CASSANDRA-20092 unless folks object.
>
> Jordan
>
> On Wed, Feb 19, 2025 at 1:35 PM Ariel Weisberg <ar...@weisberg.ws> wrote:
>
>> Whoops, 5 months ago, not six months ago. Much more reasonable to be
>> making this kind of fix.
>>
>> On Wed, Feb 19, 2025, at 3:56 PM, Ariel Weisberg wrote:
>>
>> Hi,
>>
>> This does not constitute a review, but I looked at both of them to
>> convince myself how they go about solving their respective problems is a
>> good idea. I am weakly +1. The risk reward is there, but 13 months since
>> 5.0 was released feels a little late to trying to improve node density
>> instead of just saying it is what it is.
>>
>> The +1 stems from the fact that if you really operate this at scale and
>> you are committed to node density then it's less of a nice to have and more
>> of a "some cluster falls over periodically now".
>>
>> Ariel
>>
>> On Fri, Feb 14, 2025, at 5:23 PM, Jordan West wrote:
>>
>> Thanks for the write up Mick. I think its is a great evaluation of 15452.
>> A few notes below:
>>
>> * CI links for 15452 might be burried and I may need to link the most
>> recent run (I’ve been using CircleCI since it’s what I’m familiar with —
>> happy to have runs on ASf hardware as well).
>>
>> * 15452 is configurable by YAML property as you mentioned. I’m traveling
>> but I’m 99% sure I made it a hot property as well (personally I believe you
>> shouldn’t have to restart a node to disable something like this, so if I
>> forgot to make it hot I will before merge). I’m traveling without my laptop
>> so I will double check next week (if someone has time to before then it’s
>> much appreciated)
>>
>>  * I think “safe-ish” is a fair description of 15452. The new code is
>> fairly isolated but it’s in a very critical path with several call paths
>> leading to it. I think the testing we’ve done (both manual and automated)
>> as well as finding and fixing several bugs with those tests is what gives
>> me the confidence. And as you mentioned we have the fail safe of turning it
>> off otherwise.
>>
>> Jordan
>>
>> On Fri, Feb 14, 2025 at 12:42 Mick Semb Wever <m...@apache.org> wrote:
>>
>> Solid write up Jon!
>>
>> Hoping the committers and PMC members are keeping in mind this (very)
>> recent thread:
>> https://lists.apache.org/thread/h38g6q9d8h1q92h6qzs5tqdxpn2vmnyy
>>
>> This thread needs to also be about evaluating the risk these commits
>> are to a patch version.
>> I'm +1 and here's my thinking over it…
>>
>> Are the performance benefits clear ?
>> Yes (thank you Jon for doing a solid job at demonstrating just how
>> awesome this will be).
>>
>> Do we want this in 5.0 ?
>> That's the dumb question :) of course.
>>
>> Does it fix a bug or a regression, or is an operational improvement ?
>> Kinda, Branimir did mention a performance regression when the chunk
>> cache was disabled.
>>
>> How are the patches with unforeseen risks ?
>> These patches are medium sized and low-level, though isolated to (and
>> creating a better isolation of) the compaction (scan) reader.  15452
>> looks safe-ish.  I can't speak to 20092, it would be nice to hear
>> Branimir and Sylvain chime in.  Are we confident that if any bugs do
>> arise in user's 5.0 production deployments we will be quick to provide
>> fixes and releases?  I was thinking it's worth putting in a system
>> property that can disable the scan reader, so if anything happens
>> folks can just restart the node with the system property to return to
>> pre-patch behaviour, but that's already there! :)
>>
>> How well tested have they been ?
>> We have one (or a few) reports of it running in production, that's
>> super positive.  Reports of lots of manual testing, super positive
>> too.  But there's no CI results available for either patch.  (Even the
>> one committed to trunk doesn't have pre-commit CI results available.)
>> I think CI results attached to the ticket are a must – I'm working on
>> adding them.
>>
>>
>>
>>

Re: Merging compaction improvements to 5.0

Reply via email to