Re: Should we change 4.1 to G1 and offheap_objects ?

C. Scott Andreas Thu, 17 Nov 2022 07:35:45 -0800
Jon, thanks for flagging that I didn't get a reply to your question on the thread.My main point in this thread is that I don't 
think post-beta is an appropriate time for a major prop change like this in the release cycle. Ideally at this point in the 
release cycle, major contributors and large users of Cassandra are running the build at minimum in pre-production environments, 
and hopefully in production environments too. Prop changes reset much of what's been learned by exercising the beta shortly 
before RC.Adding some detail on your question re: G1 – which mostly boils down to some experience to the contrary. I don't have 
data from past tests easily accessible to me, so I'm writing from memory and deductive reasoning here.The problem of garbage 
collection is minimizing a function of "memory overhead required to safely operate, program pause time, and CPU time 
burnt." ParNew+CMS are throughput-oriented collectors that commonly have higher throughput, lower CPU usage, and higher 
pause times than newer collectors like G1 and Shenandoah. This is a poor tradeoff for most applications.Cassandra is unique 
here: internode requests speculate, masking latency within cluster that can be incurred by the pause phase of a collection. The 
Java Driver is also great at speculating, masking latency of a coordinator that may be pausing for a collection as well. While 
ParNew+CMS are an objectively poor choice for many systems, Cassandra's architecture as a majority-quorum database that can 
speculate both at the client and coordinator level avoids the worst of those pitfalls.In cases where I and my colleagues have 
evaluated other collectors like G1 and Shenandoah, we've found lower pause times, ~unchanged or slightly higher client latency, 
and lower throughput. G1 testing may predate me, so I'll offer a more recent Shenandoah example. In a ~12-instance cluster that 
runs hot - averaging about 80% CPU - enabling Shenandoah resulted in about 5-10% lower request throughput after a couple days 
and a roughly equal increase in latency. While its micro-pause behavior was nice relative to ParNew's ~100-200ms pauses, it 
didn't make much of a difference due to internode and client speculation around it.Again, my point in this thread is that I 
wouldn't alter defaults on the eve of an RC in a release cycle. We do know this will need to change soon. CMS is gone in JDK17, 
so consider this email an elegy :). As part of JDK17 readiness, our collector defaults must change. If someone is interested in 
picking up the work, I think now would be a great time to perform that measurement and propose new defaults for the project 
based on it - and I don't even have an objection to those landing in a patchlevel release if the measurements look really 
good.But I wouldn't change the defaults on the eve of RC.– ScottOn Nov 17, 2022, at 7:26 AM, Joseph Lynch 
<joe.e.ly...@gmail.com> wrote:I'm surprised we released 4.0 without changing the default to G1 giventhat many Cassandra 
deployments have changed the project's defaultbecause it is incorrect. I know that 7486 broke a user 7 years ago,but I think we 
have had a ton of testing since then in the communityto build our confidence. Not to mention that Java 9+ (released 2017)made G1 
the default and Java 14 (2020) removes CMS entirely.I have personally done targeted AB testing of G1GC vs CMS in acontrolled 
fashion using NDBench and our team had enough confidence in~2019 to roll it to Netflix's entire fleet of O(1k) clusters 
andO(10k) instances running Java 8. We found it vastly superior to CMS inpractically every way (no more 10s+ compacting STW 
phases after heapfragmentation, better tail latency at a coordinator/replica level,better average throughput, etc ...), and only 
identified a single veryminor p99 regression on one cluster (~5%) which we didn't considersevere enough to roll back.Right now 
our project defaults are hurting 99 users to help 1; letthat one user change the defaults? 4.1 seems like a great place to 
fixthe bug, absent being able to do that let's at least fix it in trunk?-JoeyOn Thu, Nov 17, 2022 at 8:27 AM Jon Haddad 
<rustyrazorbl...@apache.org> wrote:I noticed nobody answered my actual question - what would it take for you to be 
comfortable?It seems that the need to do a release is now more important than the best interests of the new user's experience - 
despite having plenty of *production* experience showing that what we ship isn't even remotely close to usable.I tried to offer 
a compromise, and it's not cool with me that it was ignored by everyone objecting.JonOn 2022/11/17 08:34:53 Mick Semb Wever 
wrote:> Ok, wrt G1 default, this is won't go ahead for 4.1-rc1>> We can revisit it for 4.1.x>> We have a lot of 
voices here adamantly positive for it, and those of us> that have done the performance testing over the years know why. But 
being> called to prove it is totally valid, if you have data to any such tests> please add them to the ticket 18027>
Re: Should we change 4.1 to G1 and offheap_objects ?

Reply via email to