Hi, This does not constitute a review, but I looked at both of them to convince myself how they go about solving their respective problems is a good idea. I am weakly +1. The risk reward is there, but 13 months since 5.0 was released feels a little late to trying to improve node density instead of just saying it is what it is.
The +1 stems from the fact that if you really operate this at scale and you are committed to node density then it's less of a nice to have and more of a "some cluster falls over periodically now". Ariel On Fri, Feb 14, 2025, at 5:23 PM, Jordan West wrote: > Thanks for the write up Mick. I think its is a great evaluation of 15452. A > few notes below: > > * CI links for 15452 might be burried and I may need to link the most recent > run (I’ve been using CircleCI since it’s what I’m familiar with — happy to > have runs on ASf hardware as well). > > * 15452 is configurable by YAML property as you mentioned. I’m traveling but > I’m 99% sure I made it a hot property as well (personally I believe you > shouldn’t have to restart a node to disable something like this, so if I > forgot to make it hot I will before merge). I’m traveling without my laptop > so I will double check next week (if someone has time to before then it’s > much appreciated) > > * I think “safe-ish” is a fair description of 15452. The new code is fairly > isolated but it’s in a very critical path with several call paths leading to > it. I think the testing we’ve done (both manual and automated) as well as > finding and fixing several bugs with those tests is what gives me the > confidence. And as you mentioned we have the fail safe of turning it off > otherwise. > > Jordan > > On Fri, Feb 14, 2025 at 12:42 Mick Semb Wever <m...@apache.org> wrote: >> Solid write up Jon! >> >> Hoping the committers and PMC members are keeping in mind this (very) >> recent thread: >> https://lists.apache.org/thread/h38g6q9d8h1q92h6qzs5tqdxpn2vmnyy >> >> This thread needs to also be about evaluating the risk these commits >> are to a patch version. >> I'm +1 and here's my thinking over it… >> >> Are the performance benefits clear ? >> Yes (thank you Jon for doing a solid job at demonstrating just how >> awesome this will be). >> >> Do we want this in 5.0 ? >> That's the dumb question :) of course. >> >> Does it fix a bug or a regression, or is an operational improvement ? >> Kinda, Branimir did mention a performance regression when the chunk >> cache was disabled. >> >> How are the patches with unforeseen risks ? >> These patches are medium sized and low-level, though isolated to (and >> creating a better isolation of) the compaction (scan) reader. 15452 >> looks safe-ish. I can't speak to 20092, it would be nice to hear >> Branimir and Sylvain chime in. Are we confident that if any bugs do >> arise in user's 5.0 production deployments we will be quick to provide >> fixes and releases? I was thinking it's worth putting in a system >> property that can disable the scan reader, so if anything happens >> folks can just restart the node with the system property to return to >> pre-patch behaviour, but that's already there! :) >> >> How well tested have they been ? >> We have one (or a few) reports of it running in production, that's >> super positive. Reports of lots of manual testing, super positive >> too. But there's no CI results available for either patch. (Even the >> one committed to trunk doesn't have pre-commit CI results available.) >> I think CI results attached to the ticket are a must – I'm working on >> adding them.