Re: Should we change 4.1 to G1 and offheap_objects ?

C. Scott Andreas Wed, 16 Nov 2022 17:40:31 -0800

I share David and Aleksey’s views on this.

We shouldn’t make major defaults changes right before RC. Might be worth adding 
a release note recommending users try them, and that they may become default in 
a future release though.


— Scott

> On Nov 16, 2022, at 3:38 PM, David Capwell <dcapw...@apple.com> wrote:
> 
> Getting poked in Slack to be more explicit in this thread… 
> 
> Switching to G1 on trunk, +1
> Switching to G1 on 4.1, -1.  4.1 is about to be released and this isn’t a bug 
> fix but a perf improvement ticket and as such should go through validation 
> that the perf improvements are seen, there is not enough time left for that 
> added performance work burden so strongly feel it should be pushed to 4.2/5.0 
> where it has plenty of time to be validated against.  The ticket even asks to 
> avoid validating the claims; saying 'Hoping we can skip due diligence on this 
> ticket because the data is "in the past” already”'.  Others have attempted 
> both shenandoah and ZGC and found mixed results, so nothing leads me to 
> believe that won’t be true here either.
> 
>> On Nov 16, 2022, at 9:15 AM, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:
>> 
>> Heap -
>> +1 for G1 in trunk
>> +0 for G1 in 4.1 - I think it’s worthwhile and fairly well tested but I 
>> understand pushback against changing this so late in the game.
>> 
>> Memtable -
>> -1 for off heap in 4.1. I think this needs more testing and isn’t something 
>> to change at the last minute.
>> +1 for running performance/fuzz tests against the alternate memtable choices 
>> in trunk and switching if they don’t show regressions.
>> 
>>>> On Nov 16, 2022, at 10:48 AM, Josh McKenzie <jmcken...@apache.org> wrote:
>>> 
>>> 
>>> To clarify: -0 here on G1 as default for 4.1 as well; I'd like us to 
>>> prioritize digging into G1's behavior on small heaps vs. CMS w/our default 
>>> tuning sooner rather than later. With that info I'd likely be a strong +1 
>>> on the shift.
>>> 
>>> -1 on switching to offheap_objects for 4.1 RC; again, think this is just a 
>>> small step away from being a +1 w/some more rigor around seeing the current 
>>> state of the technology's intersections.
>>> 
>>> On Wed, Nov 16, 2022, at 7:47 AM, Aleksey Yeshchenko wrote:
>>>> All right. I’ll clarify then.
>>>> 
>>>> -0 on switching the default to G1 *this late* just before RC1.
>>>> -1 on switching the default offheap_objects *for 4.1 RC1*, but all for it 
>>>> in principle, for 4.2, after we run some more test and resolve the 
>>>> concerns raised by Jeff.
>>>> 
>>>> Let’s please try to avoid this kind of super late defaults switch going 
>>>> forward?
>>>> 
>>>> —
>>>> AY
>>>> 
>>>>> On 16 Nov 2022, at 03:27, Derek Chen-Becker <de...@chen-becker.org> wrote:
>>>>> 
>>>>> For the record, I'm +100 on G1. Take it with whatever sized grain of
>>>>> salt you think appropriate for a relative newcomer to the list, but
>>>>> I've spent my last 7-8 years dealing with the intersection of
>>>>> high-throughput, low latency systems and their interaction with GC and
>>>>> in my personal experience G1 outperforms CMS in all cases and with
>>>>> significantly less work (zero work, in many cases). The only things
>>>>> I've seen perform better *with a similar heap footprint* are GenShen
>>>>> (currently experimental) and Rust (beyond the scope of this topic).
>>>>> 
>>>>> Derek
>>>>> 
>>>>> On Tue, Nov 15, 2022 at 4:51 PM Jon Haddad <rustyrazorbl...@apache.org> 
>>>>> wrote:
>>>>>> 
>>>>>> I'm curious what it would take for folks to be OK with merging this into 
>>>>>> 4.1?  How much additional time would you want to feel comfortable?
>>>>>> 
>>>>>> I should probably have been a little more vigorous in my +1 of Mick's 
>>>>>> PR.  For a little background - I worked on several hundred clusters 
>>>>>> while at TLP, mostly dealing with stability and performance issues.  A 
>>>>>> lot of them stemmed partially or wholly from the GC settings we ship in 
>>>>>> the project. Par New with CMS and small new gen results in a lot of 
>>>>>> premature promotion leading to high pause times into the hundreds of ms 
>>>>>> which pushes p99 latency through the roof.
>>>>>> 
>>>>>> I'm a big +1 in favor of G1 because it's not just better for most people 
>>>>>> but it's better for _every_ new Cassandra user.  The first experience 
>>>>>> that people have with the project is important, and our current GC 
>>>>>> settings are quite bad - so bad they lead to problems with stability in 
>>>>>> production.  The G1 settings are mostly hands off, result in shorter 
>>>>>> pause times and are a big improvement over the status quo.
>>>>>> 
>>>>>> Most folks don't do GC tuning, they use what we supply, and what we 
>>>>>> currently supply leads to a poor initial experience with the database.  
>>>>>> I think we owe the community our best effort even if it means pushing 
>>>>>> the release back little bit.
>>>>>> 
>>>>>> Just for some additional context, we're (Netflix) running 25K nodes on 
>>>>>> G1 across a variety of hardware in AWS with wildly varying workloads, 
>>>>>> and I haven't seen G1 be the root cause of a problem even once.  The 
>>>>>> settings that Mick is proposing are almost identical to what we use (we 
>>>>>> use half of heap up to 30GB).
>>>>>> 
>>>>>> I'd really appreciate it if we took a second to consider the community 
>>>>>> effect of another release that ships settings that cause significant 
>>>>>> pain for our users.
>>>>>> 
>>>>>> Jon
>>>>>> 
>>>>>> On 2022/11/10 21:49:36 Mick Semb Wever wrote:
>>>>>>>> 
>>>>>>>> In case of GC, reasonably extensive performance testing should be the
>>>>>>>> expectations. Potentially revisiting some of the G1 params for the 4.1
>>>>>>>> reality - quite a lot has changed since those optional defaults where
>>>>>>>> picked.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I've put our battle-tested g1 opts (from consultants at TLP and 
>>>>>>> DataStax)
>>>>>>> in the patch for CASSANDRA-18027
>>>>>>> 
>>>>>>> In reality it is really not much of a change, g1 does make it simple.
>>>>>>> Picking the correct ParallelGCThreads and ConcGCThreads and the floor to
>>>>>>> the new heap (XX:NewSize) is still required, though we could do a much
>>>>>>> better job of dynamic defaults to them.
>>>>>>> 
>>>>>>> Alex Dejanovski's blog is a starting point:
>>>>>>> https://thelastpickle.com/blog/2020/06/29/cassandra_4-0_garbage_collectors_performance_benchmarks.html
>>>>>>> where this gc opt set was used (though it doesn't prove why those 
>>>>>>> options
>>>>>>> are chosen)
>>>>>>> 
>>>>>>> The bar for objection to sneaking these into 4.1 was intended to be low,
>>>>>>> and I stand by those that raise concerns.
>>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> +---------------------------------------------------------------+
>>>>> | Derek Chen-Becker                                             |
>>>>> | GPG Key available at https://keybase.io/dchenbecker and       |
>>>>> | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
>>>>> | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
>>>>> +---------------------------------------------------------------+
>>>> 
>>>> 
>>> 
>

Re: Should we change 4.1 to G1 and offheap_objects ?

Reply via email to