Yes native_objects is the way to go… you can tell if memtables are you problem because you’ll see promotion failures of objects sized 131074 dwords.
If your h/w is fast enough make your young gen as big as possible - we can collect 8G in sub second always, and this gives you your best chance of transient objects (especially if you still have thrift clients) leaking into the old gen. Moving to 2.1.x (and off heap memtables) from 2.0.x we have reduced our old gen down from 16gig to 12gig and will keep shrinking it, but have had no promotion failures yet, and it’s been several months. Note we are running a patched 2.1.3, but 2.1.5 has the equivalent important bugs fixed (that might have given you memory issues) > On Jun 1, 2015, at 3:00 PM, Carl Hu <m...@carlhu.com> wrote: > > Thank you for the suggestion. After analysis of your settings, the basic > hypothesis here is to promote very quickly to Old Gen because of a rapid > accumulation of heap usage due to memtables. We happen to be running on 2.1, > and I thought a more conservative approach that your (quite aggressive gc > settings) is to try the new memtable_allocation_type with offheap_objects and > see if the memtable pressure is relieved sufficiently such that the standard > gc settings can keep up. > > The experiment is in progress and I will report back with the results. > > On Mon, Jun 1, 2015 at 10:20 AM, Anuj Wadehra <anujw_2...@yahoo.co.in > <mailto:anujw_2...@yahoo.co.in>> wrote: > We have write heavy workload and used to face promotion failures/long gc > pauses with Cassandra 2.0.x. I am not into code yet but I think that memtable > and compaction related objects have mid-life and write heavy workload is not > suitable for generation collection by default. So, we tuned JVM to make sure > that minimum objects are promoted to Old Gen and achieved great success in > that: > MAX_HEAP_SIZE="12G" > HEAP_NEWSIZE="3G" > -XX:SurvivorRatio=2 > -XX:MaxTenuringThreshold=20 > -XX:CMSInitiatingOccupancyFraction=70 > JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20" > JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" > JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" > JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" > JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768" > JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" > JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000" > JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" > JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways" > JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled" > JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking" > We also think that default total_memtable_space_in_mb=1/4 heap is too much > for write heavy loads. By default, young gen is also 1/4 heap.We reduced it > to 1000mb in order to make sure that memtable related objects dont stay in > memory for too long. Combining this with SurvivorRatio=2 and > MaxTenuringThreshold=20 did the job well. GC was very consistent. No Full GC > observed. > > Environment: 3 node cluster with each node having 24cores,64G RAM and SSDs in > RAID5. > We are making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300 > reads/sec on each node of 3 node cluster. 2 CFs have wide rows with max data > of around 100mb per row > > Yes. Node marking down has cascading effect. Within seconds all nodes in our > cluster are marked down. > > Thanks > Anuj Wadehra > > > > On Monday, 1 June 2015 7:12 PM, Carl Hu <m...@carlhu.com > <mailto:m...@carlhu.com>> wrote: > > > We are running Cassandra version 2.1.5.469 on 15 nodes and are experiencing a > problem where the entire cluster slows down for 2.5 minutes when one node > experiences a 17 second stop-the-world gc. These gc's happen once every 2 > hours. I did find a ticket that seems related to this: > https://issues.apache.org/jira/browse/CASSANDRA-3853 > <https://issues.apache.org/jira/browse/CASSANDRA-3853>, but Jonathan Ellis > has resolved this ticket. > > We are running standard gc settings, but this ticket is not so much concerned > with the 17 second gc on a single node (after all, we have 14 others), but > that the cascading performance problem. > > We running standard values of dynamic_snitch_badness_threshold (0.1) and > phi_convict_threshold (8). (These values are relevant for the dynamic snitch > routing requests away from the frozen node or the failure detector marking > the node as 'down'). > > We use the python client in default round robin mode, so all clients hits the > coordinators at all nodes in round robin. One theory is that since the > coordinator on all nodes must hit the frozen node at some point in the 17 > seconds, each node's request queues fills up, and the entire cluster thus > freezes up. That would explain a 17 second freeze but would not explain the > 2.5 minute slowdown (10x increase in request latency @P50). > > I'd love your thoughts. I've provided the GC chart here. > > Carl > > <d2c95dce-0848-11e5-91f7-6b223349fc14.png> > > >
smime.p7s
Description: S/MIME cryptographic signature