Another reason for memtable to be kept in memory if there's wide rows. Maybe someone can chime in and confirm or not, but I believe wide rows (in the thrift sense) need to synced entirely across nodes. So from the number you gave a node can send ~100 Mb over the network for a single row. With compaction and other stuff, it may be an issue, as these object can stay long enough in the heap to survive a collection.
Think about the row cache too, as with wide rows, Cassandra will hold a bit longer the tables to serialize the data in the off-heap row cache (in 2.0.x, not sure about other versions). See this page : http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_configuring_caches_c.html -- Brice On Wed, Apr 22, 2015 at 2:47 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: > Any other suggestions on the JVM Tuning and Cassandra config we did to > solve the promotion failures during gc? > > I would appreciate if someone can try to answer our queries mentioned in > initial mail? > > Thanks > Anuj Wadehra > > Sent from Yahoo Mail on Android > <https://overview.mail.yahoo.com/mobile/?.src=Android> > ------------------------------ > *From*:"Anuj Wadehra" <anujw_2...@yahoo.co.in> > *Date*:Wed, 22 Apr, 2015 at 6:12 pm > > *Subject*:Re: Handle Write Heavy Loads in Cassandra 2.0.3 > > Thanks Brice for all the comments.. > > We analyzed gc logs and heap dump before tuning JVM n gc. With new JVM > config I specified we were able to remove promotion failures seen with > default config. With Heap dump I got an idea that memetables and compaction > are biggest culprits. > > CAASSANDRA-6142 talks about multithreaded_compaction but we are using > concurrent_compactors. I think they are different. On nodes with many cores > it is usually recommend to run core/2 concurrent compactors. I dont think > 10 or 12 would make big difference. > > For now, we have kept compaction throughput to 24 as we already have > scenarios which create heap pressure due to heavy read write load. Yes we > can think of increasing it on SSD. > > We have already enabled trickle fsync. > > Justification behind increasing MaxTenuringThreshold ,young gen size and > creating large survivor space is to gc most memtables in Yong gen itself. > For making sure that memtables are smaller and not kept too long in heap > ,we have reduced total_memtable_space_in_mb to 1g from heap size/4 which is > default. We flush a memtable to disk approx every 15 sec and our minor > collection runs evry 3-7 secs.So its highly probable that most memtables > will be collected in young gen. Idea is that most short lived and middle > life time objects should not reach old gen otherwise CMC old gen > collections would be very frequent,more expensive as they may not collect > memtables and fragmentation would be higher. > > I think wide rows less than 100mb should nt be prob. Cassandra infact > provides very good wide rows format suitable for time series and other > scenarios. The problem is that when my in_memory_compaction_in_mb limit is > 125 mb why Cassandra is printing "compacting large rows" when row is less > than 100mb. > > > > Thanks > Anuj Wadehra > > Sent from Yahoo Mail on Android > <https://overview.mail.yahoo.com/mobile/?.src=Android> > ------------------------------ > *From*:"Brice Dutheil" <brice.duth...@gmail.com> > *Date*:Wed, 22 Apr, 2015 at 3:52 am > *Subject*:Re: Handle Write Heavy Loads in Cassandra 2.0.3 > > Hi, I cannot really answer your question as some rock solid truth. > > When we had problems, we did mainly two things > > - Analyzed the GC logs (with censum from jClarity, this tool IS really > awesome, it’s good investment even better if the production is running > other java applications) > - Heap dumped cassandra when there was a GC, this helped in narrowing > down the actual issue > > I don’t know precisely how to answer, but : > > - concurrent_compactors could be lowered to 10, it seems from another > thread here that it can be harmful, see > https://issues.apache.org/jira/browse/CASSANDRA-6142 > - memtable_flush_writers we set it to 2 > - compaction_throughput_mb_per_sec could probably be increased, on > SSDs that should help > - trickle_fsync don’t forget this one too if you’re on SSDs > > Touching JVM heap parameters can be hazardous, increasing heap may seem > like a nice thing, but it can increase GC time in the worst case scenario. > > Also increasing the MaxTenuringThreshold is probably wrong too, as you > probably know it means objects will be copied from Eden to Survivor 0/1 and > to the other Survivor on the next collection until that threshold is > reached, then it will be copied in Old generation. That means that’s being > applied to Memtables, so it *may* mean several copies to be done on each > GCs, and memtables are not small objects that could take a little while for > an *available* system. Another fact to take account for is that upon each > collection the active survivor S0/S1 has to be big enough for the memtable > to fit there, and there’s other objects too. > > So I would rather work on the real cause. rather than GC. One thing > brought my attention > > Though still getting logs saying “compacting large row”. > > Could it be that the model is based on wide rows ? That could be a > problem, for several reasons not limited to compactions. If that is so I’d > advise to revise the datamodel > > > -- Brice > > On Tue, Apr 21, 2015 at 7:53 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> > wrote: > >> Thanks Brice!! >> >> We are using Red Hat Linux 6.4..24 cores...64Gb Ram..SSDs in RAID5..CPU >> are not overloaded even in peak load..I dont think IO is an issue as iostat >> shows await<17 all times..util attrbute in iostat usually increases from 0 >> to 100..and comes back immediately..m not an expert on analyzing IO but >> things look ok..We are using STCS..and not using Logged batches..We are >> making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300 >> reads/sec on each node of 3 node cluster. 2 CFs have wide rows with max >> data of around 100mb per row. We have further reduced >> in_memory_compaction_limit_in_mb to 125.Though still getting logs saying >> "compacting large row". >> >> We are planning to upgrade to 2.0.14 as 2.1 is not yet production ready. >> >> I would appreciate if you could answer the queries posted in initial mail. >> >> Thanks >> Anuj Wadehra >> >> Sent from Yahoo Mail on Android >> <https://overview.mail.yahoo.com/mobile/?.src=Android> >> ------------------------------ >> *From*:"Brice Dutheil" <brice.duth...@gmail.com> >> *Date*:Tue, 21 Apr, 2015 at 10:22 pm >> >> *Subject*:Re: Handle Write Heavy Loads in Cassandra 2.0.3 >> >> This is an intricate matter, I cannot say for sure what are good >> parameters from the wrong ones, too many things changed at once. >> >> However there’s many things to consider >> >> - What is your OS ? >> - Do your nodes have SSDs or mechanical drives ? How many cores do >> you have ? >> - Is it the CPUs or IOs that are overloaded ? >> - What is the write request/s per node and cluster wide ? >> - What is the compaction strategy of the tables you are writing into ? >> - Are you using LOGGED BATCH statement. >> >> With heavy writes, it is *NOT* recommend to use LOGGED BATCH statements. >> >> In our 2.0.14 cluster we have experimented node unavailability due to >> long Full GC pauses. We discovered bogus legacy data, a single outlier was >> so wrong that it updated hundred thousand time the same CQL rows with >> duplicate data. Given the tables we were writing to were configured to use >> LCS, this resulted in keeping Memtables in memory long enough to promote >> them in the old generation (the MaxTenuringThreshold default is 1). >> Handling this data proved to be the thing to fix, with default GC >> settings the cluster (10 nodes) handle 39 write requests/s. >> >> Note Memtables are allocated on heap with 2.0.x. With 2.1.x they will be >> allocated off-heap. >> >> >> -- Brice >> >> On Tue, Apr 21, 2015 at 5:12 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> >> wrote: >> >>> Any suggestions or comments on this one?? >>> >>> Thanks >>> Anuj Wadhera >>> >>> Sent from Yahoo Mail on Android >>> <https://overview.mail.yahoo.com/mobile/?.src=Android> >>> ------------------------------ >>> *From*:"Anuj Wadehra" <anujw_2...@yahoo.co.in> >>> *Date*:Mon, 20 Apr, 2015 at 11:51 pm >>> *Subject*:Re: Handle Write Heavy Loads in Cassandra 2.0.3 >>> >>> Small correction: we are making writes in 5 cf an reading frm one at >>> high speeds. >>> >>> >>> >>> Thanks >>> Anuj Wadehra >>> >>> Sent from Yahoo Mail on Android >>> <https://overview.mail.yahoo.com/mobile/?.src=Android> >>> ------------------------------ >>> *From*:"Anuj Wadehra" <anujw_2...@yahoo.co.in> >>> *Date*:Mon, 20 Apr, 2015 at 7:53 pm >>> *Subject*:Handle Write Heavy Loads in Cassandra 2.0.3 >>> >>> Hi, >>> >>> Recently, we discovered that millions of mutations were getting dropped >>> on our cluster. Eventually, we solved this problem by increasing the value >>> of memtable_flush_writers from 1 to 3. We usually write 3 CFs >>> simultaneously an one of them has 4 Secondary Indexes. >>> >>> New changes also include: >>> concurrent_compactors: 12 (earlier it was default) >>> compaction_throughput_mb_per_sec: 32(earlier it was default) >>> in_memory_compaction_limit_in_mb: 400 ((earlier it was default 64) >>> memtable_flush_writers: 3 (earlier 1) >>> >>> After, making above changes, our write heavy workload scenarios started >>> giving "promotion failed" exceptions in gc logs. >>> >>> We have done JVM tuning and Cassandra config changes to solve this: >>> >>> MAX_HEAP_SIZE="12G" (Increased Heap to from 8G to reduce fragmentation) >>> HEAP_NEWSIZE="3G" >>> >>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2" (We observed that even at >>> SurvivorRatio=4, our survivor space was getting 100% utilized under heavy >>> write load and we thought that minor collections were directly promoting >>> objects to Tenured generation) >>> >>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=20" (Lots of objects were >>> moving from Eden to Tenured on each minor collection..may be related to >>> medium life objects related to Memtables and compactions as suggested by >>> heapdump) >>> >>> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=20" >>> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" >>> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity" >>> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs" >>> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768" >>> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" >>> JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=30000" >>> JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=2000" //though it's default >>> value >>> JVM_OPTS="$JVM_OPTS -XX:+CMSEdenChunksRecordAlways" >>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled" >>> JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking" >>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70" (to avoid >>> concurrent failures we reduced value) >>> >>> Cassandra config: >>> compaction_throughput_mb_per_sec: 24 >>> memtable_total_space_in_mb: 1000 (to make memtable flush >>> frequent.default is 1/4 heap which creates more long lived objects) >>> >>> Questions: >>> 1. Why increasing memtable_flush_writers and >>> in_memory_compaction_limit_in_mb caused promotion failures in JVM? Does >>> more memtable_flush_writers mean more memtables in memory? >>> >>> 2. Still, objects are getting promoted at high speed to Tenured space. >>> CMS is running on Old gen every 4-5 minutes under heavy write load. Around >>> 750+ minor collections of upto 300ms happened in 45 mins. Do you see any >>> problems with new JVM tuning and Cassandra config? Is the justification >>> given against those changes sounds logical? Any suggestions? >>> 3. What is the best practice for reducing heap fragmentation/promotion >>> failure when allocation and promotion rates are high? >>> >>> Thanks >>> Anuj >>> >>> >>> >>> >>> >> >