+1 On Sun, Apr 13, 2025 at 12:32 PM Jordan West <jw...@apache.org> wrote:
> Hi Folks, > > A bit delayed but I have the backport for 20092 ready. The branch can be > found here: > https://github.com/apache/cassandra/compare/cassandra-5.0...jrwest:cassandra:jwest/20092-5.0-backport. > I've run tests and all looked good. I plan to do one more run post a recent > rebase. Links are in CASSANDRA-20092. Before merging I wanted to sunshine > and get community input on the following: > > * I've also included CASSANDRA-20396 in the backport. This looks to be a > fix on top of CASSANDRA-20092 that we would want in the backport. I plan to > leave these as two separate commits on cassandra-5.0, not one merged > commit. If you feel differently please let me know. > * CHANGES.txt: i've included both patches in 5.0.5 section of CHANGES.txt. > They also appear under 5.1 in trunk. Since trunk isn't released I don't see > any issue here but if folks have strong opinions on this lmk. I don't and > can make any requested changes. > > * Along the way I discovered CASSANDRA-20538, a minor 1-line change needed > to actually take advantage of 20092+15452 when BTI is enabled. I have > included the patch for this in the backport as its largely a ninja fix. If > folks feel strongly about addressing both 5.0 and trunk with 20538 instead > of one coming via backport and one coming via the JIRA I can remove that > commit from my backport (I am also not sure the best way to commit this in > the backport, so I sort of lean to immediately addressing it in 20538. its > largely moot since both would be included in the release either way). > > * For the backport I had to adjust one test ( > https://github.com/apache/cassandra/commit/0039ebf915b66a88234325a74ef6edd18bde6da0). > From what I can tell this is just because the compression code is less > likely to find an aligned key on 5.0 (I haven't dug too much into why now > that the test reliably passes with this minor tweak). Plan to squash this > into the backport commit for CASSANDRA-20092 unless folks object. > > Jordan > > On Wed, Feb 19, 2025 at 1:35 PM Ariel Weisberg <ar...@weisberg.ws> wrote: > >> Whoops, 5 months ago, not six months ago. Much more reasonable to be >> making this kind of fix. >> >> On Wed, Feb 19, 2025, at 3:56 PM, Ariel Weisberg wrote: >> >> Hi, >> >> This does not constitute a review, but I looked at both of them to >> convince myself how they go about solving their respective problems is a >> good idea. I am weakly +1. The risk reward is there, but 13 months since >> 5.0 was released feels a little late to trying to improve node density >> instead of just saying it is what it is. >> >> The +1 stems from the fact that if you really operate this at scale and >> you are committed to node density then it's less of a nice to have and more >> of a "some cluster falls over periodically now". >> >> Ariel >> >> On Fri, Feb 14, 2025, at 5:23 PM, Jordan West wrote: >> >> Thanks for the write up Mick. I think its is a great evaluation of 15452. >> A few notes below: >> >> * CI links for 15452 might be burried and I may need to link the most >> recent run (I’ve been using CircleCI since it’s what I’m familiar with — >> happy to have runs on ASf hardware as well). >> >> * 15452 is configurable by YAML property as you mentioned. I’m traveling >> but I’m 99% sure I made it a hot property as well (personally I believe you >> shouldn’t have to restart a node to disable something like this, so if I >> forgot to make it hot I will before merge). I’m traveling without my laptop >> so I will double check next week (if someone has time to before then it’s >> much appreciated) >> >> * I think “safe-ish” is a fair description of 15452. The new code is >> fairly isolated but it’s in a very critical path with several call paths >> leading to it. I think the testing we’ve done (both manual and automated) >> as well as finding and fixing several bugs with those tests is what gives >> me the confidence. And as you mentioned we have the fail safe of turning it >> off otherwise. >> >> Jordan >> >> On Fri, Feb 14, 2025 at 12:42 Mick Semb Wever <m...@apache.org> wrote: >> >> Solid write up Jon! >> >> Hoping the committers and PMC members are keeping in mind this (very) >> recent thread: >> https://lists.apache.org/thread/h38g6q9d8h1q92h6qzs5tqdxpn2vmnyy >> >> This thread needs to also be about evaluating the risk these commits >> are to a patch version. >> I'm +1 and here's my thinking over it… >> >> Are the performance benefits clear ? >> Yes (thank you Jon for doing a solid job at demonstrating just how >> awesome this will be). >> >> Do we want this in 5.0 ? >> That's the dumb question :) of course. >> >> Does it fix a bug or a regression, or is an operational improvement ? >> Kinda, Branimir did mention a performance regression when the chunk >> cache was disabled. >> >> How are the patches with unforeseen risks ? >> These patches are medium sized and low-level, though isolated to (and >> creating a better isolation of) the compaction (scan) reader. 15452 >> looks safe-ish. I can't speak to 20092, it would be nice to hear >> Branimir and Sylvain chime in. Are we confident that if any bugs do >> arise in user's 5.0 production deployments we will be quick to provide >> fixes and releases? I was thinking it's worth putting in a system >> property that can disable the scan reader, so if anything happens >> folks can just restart the node with the system property to return to >> pre-patch behaviour, but that's already there! :) >> >> How well tested have they been ? >> We have one (or a few) reports of it running in production, that's >> super positive. Reports of lots of manual testing, super positive >> too. But there's no CI results available for either patch. (Even the >> one committed to trunk doesn't have pre-commit CI results available.) >> I think CI results attached to the ticket are a must – I'm working on >> adding them. >> >> >> >>