Another reason for memtable to be kept in memory if there's wide rows.
Maybe someone can chime in and confirm or not, but I believe wide rows (in
the thrift sense) need to synced entirely across nodes. So from the number
you gave a node can send ~100 Mb over the network for a single row. With
compaction and other stuff, it may be an issue, as these object can stay
long enough in the heap to survive a collection.

Think about the row cache too, as with wide rows, Cassandra will hold a bit
longer the tables to serialize the data in the off-heap row cache (in
2.0.x, not sure about other versions). See this page :
http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_configuring_caches_c.html




-- Brice

On Wed, Apr 22, 2015 at 2:47 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
wrote:

> Any other suggestions on the JVM Tuning and Cassandra config we did to
> solve the promotion failures during gc?
>
> I would appreciate if someone can try to answer our queries mentioned in
> initial mail?
>
> Thanks
> Anuj Wadehra
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
> ------------------------------
>   *From*:"Anuj Wadehra" <anujw_2...@yahoo.co.in>
> *Date*:Wed, 22 Apr, 2015 at 6:12 pm
>
> *Subject*:Re: Handle Write Heavy Loads in Cassandra 2.0.3
>
> Thanks Brice for all the comments..
>
> We analyzed gc logs and heap dump before tuning JVM n gc. With new JVM
> config I specified we were able to remove promotion failures seen with
> default config. With Heap dump I got an idea that memetables and compaction
> are biggest culprits.
>
> CAASSANDRA-6142 talks about multithreaded_compaction but we are using
> concurrent_compactors. I think they are different. On nodes with many cores
> it is usually recommend to run core/2 concurrent compactors. I dont think
> 10 or 12 would  make big difference.
>
> For now, we have kept compaction throughput to 24 as we already have
> scenarios which create heap pressure due to heavy read write load. Yes we
> can think of increasing it on SSD.
>
> We have already enabled trickle fsync.
>
> Justification behind increasing MaxTenuringThreshold ,young gen size and
> creating large survivor space is to gc most memtables in Yong gen itself.
> For making sure that memtables are smaller and not kept too long in heap
> ,we have reduced total_memtable_space_in_mb to 1g from heap size/4 which is
> default. We flush a memtable to disk approx every 15 sec and our minor
> collection runs evry 3-7 secs.So its highly probable that most memtables
> will be collected in young gen. Idea is that most short lived and middle
> life time objects should not reach old gen otherwise CMC old gen
> collections would be very frequent,more expensive as they may not collect
> memtables and fragmentation would be higher.
>
> I think wide rows less than 100mb should nt be prob. Cassandra infact
> provides very good wide rows format suitable for time series and other
> scenarios. The problem is that when my in_memory_compaction_in_mb limit is
> 125 mb why Cassandra is printing "compacting large rows" when row is less
> than 100mb.
>
>
>
> Thanks
> Anuj Wadehra
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
> ------------------------------
>   *From*:"Brice Dutheil" <brice.duth...@gmail.com>
> *Date*:Wed, 22 Apr, 2015 at 3:52 am
> *Subject*:Re: Handle Write Heavy Loads in Cassandra 2.0.3
>
> Hi, I cannot really answer your question as some rock solid truth.
>
> When we had problems, we did mainly two things
>
>    - Analyzed the GC logs (with censum from jClarity, this tool IS really
>    awesome, it’s good investment even better if the production is running
>    other java applications)
>    - Heap dumped cassandra when there was a GC, this helped in narrowing
>    down the actual issue
>
> I don’t know precisely how to answer, but :
>
>    - concurrent_compactors could be lowered to 10, it seems from another
>    thread here that it can be harmful, see
>    https://issues.apache.org/jira/browse/CASSANDRA-6142
>    - memtable_flush_writers we set it to 2
>    - compaction_throughput_mb_per_sec could probably be increased, on
>    SSDs that should help
>    - trickle_fsync don’t forget this one too if you’re on SSDs
>
> Touching JVM heap parameters can be hazardous, increasing heap may seem
> like a nice thing, but it can increase GC time in the worst case scenario.
>
> Also increasing the MaxTenuringThreshold is probably wrong too, as you
> probably know it means objects will be copied from Eden to Survivor 0/1 and
> to the other Survivor on the next collection until that threshold is
> reached, then it will be copied in Old generation. That means that’s being
> applied to Memtables, so it *may* mean several copies to be done on each
> GCs, and memtables are not small objects that could take a little while for
> an *available* system. Another fact to take account for is that upon each
> collection the active survivor S0/S1 has to be big enough for the memtable
> to fit there, and there’s other objects too.
>
> So I would rather work on the real cause. rather than GC. One thing
> brought my attention
>
> Though still getting logs saying “compacting large row”.
>
> Could it be that the model is based on wide rows ? That could be a
> problem, for several reasons not limited to compactions. If that is so I’d
> advise to revise the datamodel
> ​
>
> -- Brice
>
> On Tue, Apr 21, 2015 at 7:53 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
> wrote:
>
>> Thanks Brice!!
>>
>> We are using Red Hat Linux 6.4..24 cores...64Gb Ram..SSDs in RAID5..CPU
>> are not overloaded even in peak load..I dont think IO is an issue as iostat
>> shows await<17 all times..util attrbute in iostat usually increases from 0
>> to 100..and comes back immediately..m not an expert on analyzing IO but
>> things look ok..We are using STCS..and not using Logged batches..We are
>> making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300
>> reads/sec on each node of 3 node cluster. 2 CFs have wide rows with max
>> data of around 100mb per row.   We have further reduced
>> in_memory_compaction_limit_in_mb to 125.Though still getting logs saying
>> "compacting large row".
>>
>> We are planning to upgrade to 2.0.14 as 2.1 is not yet production ready.
>>
>> I would appreciate if you could answer the queries posted in initial mail.
>>
>> Thanks
>> Anuj Wadehra
>>
>> Sent from Yahoo Mail on Android
>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>> ------------------------------
>> *From*:"Brice Dutheil" <brice.duth...@gmail.com>
>> *Date*:Tue, 21 Apr, 2015 at 10:22 pm
>>
>> *Subject*:Re: Handle Write Heavy Loads in Cassandra 2.0.3
>>
>> This is an intricate matter, I cannot say for sure what are good
>> parameters from the wrong ones, too many things changed at once.
>>
>> However there’s many things to consider
>>
>>    - What is your OS ?
>>    - Do your nodes have SSDs or mechanical drives ? How many cores do
>>    you have ?
>>    - Is it the CPUs or IOs that are overloaded ?
>>    - What is the write request/s per node and cluster wide ?
>>    - What is the compaction strategy of the tables you are writing into ?
>>    - Are you using LOGGED BATCH statement.
>>
>> With heavy writes, it is *NOT* recommend to use LOGGED BATCH statements.
>>
>> In our 2.0.14 cluster we have experimented node unavailability due to
>> long Full GC pauses. We discovered bogus legacy data, a single outlier was
>> so wrong that it updated hundred thousand time the same CQL rows with
>> duplicate data. Given the tables we were writing to were configured to use
>> LCS, this resulted in keeping Memtables in memory long enough to promote
>> them in the old generation (the MaxTenuringThreshold default is 1).
>> Handling this data proved to be the thing to fix, with default GC
>> settings the cluster (10 nodes) handle 39 write requests/s.
>>
>> Note Memtables are allocated on heap with 2.0.x. With 2.1.x they will be
>> allocated off-heap.
>> ​
>>
>> -- Brice
>>
>> On Tue, Apr 21, 2015 at 5:12 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
>> wrote:
>>
>>> Any suggestions or comments on this one??
>>>
>>> Thanks
>>> Anuj Wadhera
>>>
>>> Sent from Yahoo Mail on Android
>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>> ------------------------------
>>> *From*:"Anuj Wadehra" <anujw_2...@yahoo.co.in>
>>> *Date*:Mon, 20 Apr, 2015 at 11:51 pm
>>> *Subject*:Re: Handle Write Heavy Loads in Cassandra 2.0.3
>>>
>>> Small correction: we are making writes in 5 cf an reading frm one at
>>> high speeds.
>>>
>>>
>>>
>>> Thanks
>>> Anuj Wadehra
>>>
>>> Sent from Yahoo Mail on Android
>>> <https://overview.mail.yahoo.com/mobile/?.src=Android>
>>> ------------------------------
>>> *From*:"Anuj Wadehra" <anujw_2...@yahoo.co.in>
>>> *Date*:Mon, 20 Apr, 2015 at 7:53 pm
>>> *Subject*:Handle Write Heavy Loads in Cassandra 2.0.3
>>>
>>> Hi,
>>>
>>> Recently, we discovered that  millions of mutations were getting dropped
>>> on our cluster. Eventually, we solved this problem by increasing the value
>>> of memtable_flush_writers from 1 to 3. We usually write 3 CFs
>>> simultaneously an one of them has 4 Secondary Indexes.
>>>
>>> New changes also include:
>>> concurrent_compactors: 12 (earlier it was default)
>>> compaction_throughput_mb_per_sec: 32(earlier it was default)
>>> in_memory_compaction_limit_in_mb: 400 ((earlier it was default 64)
>>> memtable_flush_writers: 3 (earlier 1)
>>>
>>> After, making above changes, our write heavy workload scenarios started
>>> giving "promotion failed" exceptions in  gc logs.
>>>
>>> We have done JVM tuning and Cassandra config changes to solve this:
>>>
>>> MAX_HEAP_SIZE="12G" (Increased Heap to from 8G to reduce fragmentation)
>>> HEAP_NEWSIZE="3G"
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2" (We observed that even at
>>> SurvivorRatio=4, our survivor space was getting 100% utilized under heavy
>>> write load and we thought that minor collections were directly promoting
>>> objects to Tenured generation)
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=20" (Lots of objects were
>>> moving from Eden to Tenured on each minor collection..may be related to
>>> medium life objects related to Memtables and compactions as suggested by
>>> heapdump)
>>>
>>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20"
>>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768"
>>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>>> JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000"
>>> JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" //though it's default
>>> value
>>> JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways"
>>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled"
>>> JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking"
>>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" (to avoid
>>> concurrent failures we reduced value)
>>>
>>> Cassandra config:
>>> compaction_throughput_mb_per_sec: 24
>>> memtable_total_space_in_mb: 1000 (to make memtable flush
>>> frequent.default is 1/4 heap which creates more long lived objects)
>>>
>>> Questions:
>>> 1. Why increasing memtable_flush_writers and
>>> in_memory_compaction_limit_in_mb caused promotion failures in JVM? Does
>>> more memtable_flush_writers mean more memtables in memory?
>>>
>>> 2. Still, objects are getting promoted at high speed to Tenured space.
>>> CMS is running on Old gen every 4-5 minutes  under heavy write load. Around
>>> 750+ minor collections of upto 300ms happened in 45 mins. Do you see any
>>> problems with new JVM tuning and Cassandra config? Is the justification
>>> given against those changes sounds logical? Any suggestions?
>>> 3. What is the best practice for reducing heap fragmentation/promotion
>>> failure when allocation and promotion rates are high?
>>>
>>> Thanks
>>> Anuj
>>>
>>>
>>>
>>>
>>>
>>
>

Reply via email to