Re: Should we change 4.1 to G1 and offheap_objects ?

Josh McKenzie Thu, 17 Nov 2022 13:00:10 -0800

> -1 on providing a bunch of choices and forcing users to pick one. We should 
> have a default and it should be “good enough” for most people.
These are 2 different things (providing choices and whether we provide a 
default).


Sounds like you're against both not having a default *and* providing choices 
independently; I assume you're not in favor of having something "good enough" 
as the default but also providing other tuning options should operators be 
interested in testing them out?

I could see there being potentially 3 tiers of operator expertise / interest in 
this space:
1) No interest. Give me a good enough default; I don't want to think about this.
2) Moderate expertise. Give me a one line config change where I can bounce 3 
nodes in a replica set to 3 different pre-configured profiles and see how it 
works for my workloads and pick one.
3) Expert: Leave me alone. I tune my own GC

So the above is possibly moot if we don't have the resources on the project to 
*test and provide* alternative GC profiles, but it sounds to me like we're not 
actually short on differently tuned GC config but are instead butting up 
against timing relative to release + view on what the right default should be.

On Thu, Nov 17, 2022, at 3:47 PM, J. D. Jordan wrote:
> -1 on providing a bunch of choices and forcing users to pick one. We should 
> have a default and it should be “good enough” for most people. The people who 
> want to dig in and try other gc settings can still do it, and we could 
> provide them some profiles to start from, but there needs to be a default.  
> We need to be asking new operators less questions on install, not more.
> 
> Re:experience with Shenandoah under high load, I have in the past seen the 
> exact same thing for both Shenandoah and ZGC. Both of them have issues at 
> high loads while performing great at moderate loads. I have not seen G1 ever 
> have such issues. So I would not be fine with a switch to Shenandoah or ZGC 
> as the default without extensive testing on current JVM versions that have 
> hopefully improved the behavior under load.
> 
> > On Nov 17, 2022, at 9:39 AM, Joseph Lynch <joe.e.ly...@gmail.com> wrote:
> > It seems like this is a choice most users might not know how to make?
> > 
> > On Thu, Nov 17, 2022 at 7:06 AM Josh McKenzie <jmcken...@apache.org> wrote:
> >> 
> >> Have we ever discussed including multiple profiles that are simple to swap 
> >> between and documented for their tested / intended use cases?
> >> 
> >> Then the burden of having a “sane” default for the wild variance of 
> >> workloads people use it for would be somewhat mitigated. Sure, there’s 
> >> always going to be folks that run the default and never think to change it 
> >> but the UX could be as simple as a one line config change to swap between 
> >> GC profiles and we could add and deprecate / remove over time.
> >> 
> >> Concretely, having config files such as:
> >> 
> >> jvm11-CMS-write.options
> >> jvm11-CMS-mixed.options
> >> jvm11-CMS-read.options
> >> jvm11-G1.options
> >> jvm11-ZGC.options
> >> jvm11-Shen.options
> >> 
> >> 
> >> Arguably we could take it a step further and not actually allow a C* node 
> >> to startup without pointing to one of the config files from your primary 
> >> config, and provide a clean mechanism to integrate that selection on 
> >> headless installs.
> >> 
> >> Notably, this could be a terrible idea. But it does seem like we keep 
> >> butting up against the complexity and mixed pressures of having the One 
> >> True Way to GC via the default config and the lift to change that.
> >> 
> >> On Wed, Nov 16, 2022, at 9:49 PM, Derek Chen-Becker wrote:
> >> 
> >> I'm fine with not including G1 in 4.1, but would we consider inclusion
> >> for 4.1.X down the road once validation has been done?
> >> 
> >> Derek
> >> 
> >> 
> >> On Wed, Nov 16, 2022 at 4:39 PM David Capwell <dcapw...@apple.com> wrote:
> >>> Getting poked in Slack to be more explicit in this thread…
> >>> Switching to G1 on trunk, +1
> >>> Switching to G1 on 4.1, -1.  4.1 is about to be released and this isn’t a 
> >>> bug fix but a perf improvement ticket and as such should go through 
> >>> validation that the perf improvements are seen, there is not enough time 
> >>> left for that added performance work burden so strongly feel it should be 
> >>> pushed to 4.2/5.0 where it has plenty of time to be validated against.  
> >>> The ticket even asks to avoid validating the claims; saying 'Hoping we 
> >>> can skip due diligence on this ticket because the data is "in the past” 
> >>> already”'.  Others have attempted both shenandoah and ZGC and found mixed 
> >>> results, so nothing leads me to believe that won’t be true here either.
> >>>> On Nov 16, 2022, at 9:15 AM, J. D. Jordan <jeremiah.jor...@gmail.com> 
> >>>> wrote:
> >>>> Heap -
> >>>> +1 for G1 in trunk
> >>>> +0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I 
> >>>> understand pushback against changing this so late in the game.
> >>>> Memtable -
> >>>> -1 for off heap in 4.1. I think this needs more testing and isn’t 
> >>>> something to change at the last minute.
> >>>> +1 for running performance/fuzz tests against the alternate memtable 
> >>>> choices in trunk and switching if they don’t show regressions.
> >>>>> On Nov 16, 2022, at 10:48 AM, Josh McKenzie <jmcken...@apache.org> 
> >>>>> wrote:
> >>>>> 
> >>>>> To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to 
> >>>>> prioritize digging into G1's behavior on small heaps vs. CMS w/our 
> >>>>> default tuning sooner rather than later. With that info I'd likely be a 
> >>>>> strong +1 on the shift.
> >>>>> -1 on switching to offheap_objects for 4.1 RC; again, think this is 
> >>>>> just a small step away from being a +1 w/some more rigor around seeing 
> >>>>> the current state of the technology's intersections.
> >>>>> On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:
> >>>>>> All right. I’ll clarify then.
> >>>>>> -0 on switching the default to G1 *this late* just before RC1.
> >>>>>> -1 on switching the default offheap_objects *for 4.1 RC1*, but all for 
> >>>>>> it in principle, for 4.2, after we run some more test and resolve the 
> >>>>>> concerns raised by Jeff.
> >>>>>> Let’s please try to avoid this kind of super late defaults switch 
> >>>>>> going forward?
> >>>>>> —
> >>>>>> AY
> >>>>>>> On 16 Nov 2022, at 03:27, Derek Chen-Becker <de...@chen-becker.org> 
> >>>>>>> wrote:
> >>>>>>> For the record, I'm +100 on G1. Take it with whatever sized grain of
> >>>>>>> salt you think appropriate for a relative newcomer to the list, but
> >>>>>>> I've spent my last 7-8 years dealing with the intersection of
> >>>>>>> high-throughput, low latency systems and their interaction with GC and
> >>>>>>> in my personal experience G1 outperforms CMS in all cases and with
> >>>>>>> significantly less work (zero work, in many cases). The only things
> >>>>>>> I've seen perform better *with a similar heap footprint* are GenShen
> >>>>>>> (currently experimental) and Rust (beyond the scope of this topic).
> >>>>>>> Derek
> >>>>>>> On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad 
> >>>>>>> <rustyrazorbl...@apache.org> wrote:
> >>>>>>>> I'm curious what it would take for folks to be OK with merging this 
> >>>>>>>> into 4.1?  How much additional time would you want to feel 
> >>>>>>>> comfortable?
> >>>>>>>> I should probably have been a little more vigorous in my +1 of 
> >>>>>>>> Mick's PR.  For a little background - I worked on several hundred 
> >>>>>>>> clusters while at TLP, mostly dealing with stability and performance 
> >>>>>>>> issues.  A lot of them stemmed partially or wholly from the GC 
> >>>>>>>> settings we ship in the project. Par New with CMS and small new gen 
> >>>>>>>> results in a lot of premature promotion leading to high pause times 
> >>>>>>>> into the hundreds of ms which pushes p99 latency through the roof.
> >>>>>>>> I'm a big +1 in favor of G1 because it's not just better for most 
> >>>>>>>> people but it's better for _every_ new Cassandra user.  The first 
> >>>>>>>> experience that people have with the project is important, and our 
> >>>>>>>> current GC settings are quite bad - so bad they lead to problems 
> >>>>>>>> with stability in production.  The G1 settings are mostly hands off, 
> >>>>>>>> result in shorter pause times and are a big improvement over the 
> >>>>>>>> status quo.
> >>>>>>>> Most folks don't do GC tuning, they use what we supply, and what we 
> >>>>>>>> currently supply leads to a poor initial experience with the 
> >>>>>>>> database.  I think we owe the community our best effort even if it 
> >>>>>>>> means pushing the release back little bit.
> >>>>>>>> Just for some additional context, we're (Netflix) running 25K nodes 
> >>>>>>>> on G1 across a variety of hardware in AWS with wildly varying 
> >>>>>>>> workloads, and I haven't seen G1 be the root cause of a problem even 
> >>>>>>>> once.  The settings that Mick is proposing are almost identical to 
> >>>>>>>> what we use (we use half of heap up to 30GB).
> >>>>>>>> I'd really appreciate it if we took a second to consider the 
> >>>>>>>> community effect of another release that ships settings that cause 
> >>>>>>>> significant pain for our users.
> >>>>>>>> Jon
> >>>>>>>> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
> >>>>>>>>>> In case of GC, reasonably extensive performance testing should be 
> >>>>>>>>>> the
> >>>>>>>>>> expectations. Potentially revisiting some of the G1 params for the 
> >>>>>>>>>> 4.1
> >>>>>>>>>> reality - quite a lot has changed since those optional defaults 
> >>>>>>>>>> where
> >>>>>>>>>> picked.
> >>>>>>>>> I've put our battle-tested g1 opts (from consultants at TLP and 
> >>>>>>>>> DataStax)
> >>>>>>>>> in the patch for CASSANDRA-18027
> >>>>>>>>> In reality it is really not much of a change, g1 does make it 
> >>>>>>>>> simple.
> >>>>>>>>> Picking the correct ParallelGCThreads and ConcGCThreads and the 
> >>>>>>>>> floor to
> >>>>>>>>> the new heap (XX:NewSize) is still required, though we could do a 
> >>>>>>>>> much
> >>>>>>>>> better job of dynamic defaults to them.
> >>>>>>>>> Alex Dejanovski's blog is a starting point:
> >>>>>>>>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
> >>>>>>>>> where this gc opt set was used (though it doesn't prove why those 
> >>>>>>>>> options
> >>>>>>>>> are chosen)
> >>>>>>>>> The bar for objection to sneaking these into 4.1 was intended to be 
> >>>>>>>>> low,
> >>>>>>>>> and I stand by those that raise concerns.
> >>>>>>> --
> >>>>>>> +---------------------------------------------------------------+
> >>>>>>> | Derek Chen-Becker                                             |
> >>>>>>> | GPG Key available at https://keybase.io/dchenbecker and       |
> >>>>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> >>>>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> >>>>>>> +---------------------------------------------------------------+
> >> 
> >> 
> >> --
> >> +---------------------------------------------------------------+
> >> | Derek Chen-Becker                                             |
> >> | GPG Key available at https://keybase.io/dchenbecker and       |
> >> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
> >> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
> >> +---------------------------------------------------------------+
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Reply via email to