I just did a wide set of selects and ran across no tombstones. But while on the subject of gc_grace_seconds, any reason, on a small cluster not to set it to something low like a single day. It seems like 10 days is only need to large clusters undergoing long partition splits, or am i misunderstanding gc_grace_seconds.
Now, given all that, does any of this explain a high load when the cluster is idle? Is it compaction catching up and would manual forced compaction alleviate that? thanks, arne On Dec 16, 2014, at 3:28 PM, Ryan Svihla <rsvi...@datastax.com> wrote: > so a delete is really another write for gc_grace_seconds (default 10 days), > if you get enough tombstones it can make managing your cluster a challenge as > is. open up cqlsh, turn on tracing and try a few queries..how many tombstones > are scanned for a given query? It's possible the heap problems you're seeing > are actually happening on the query side and not on the ingest side, the > severity of this depends on driver and cassandra version, but older drivers > and versions of cassandra could easily overload heap with expensive selects, > when layered over tombstones it's certainly becomes a possibility this is > your root cause. > > Now this will primarily create more load on compaction and depending on your > cassandra version there maybe some other issue at work, but something I can > tell you is every time I see 1 dropped mutation I see a cluster that was > overloaded enough it had to shed load. If I see 200k I see a > cluster/configuration/hardware that is badly overloaded. > > I suggest the following > trace some of the queries used in prod > monitor your ingest rate, see at what levels you run into issues (GCInspector > log messages, dropped mutations, etc) > heap configuration we mentioned earlier..go ahead and monitor heap usage, if > it hits 75% repeated this is an indication of heavy load > monitor dropped mutations..any dropped mutation is evidence of an overloaded > server, again the root cause can be many other problems that are solvable > with current hardware, and LOTS of people runs with nodes with similar > configuration. > > On Tue, Dec 16, 2014 at 5:08 PM, Arne Claassen <a...@emotient.com> wrote: > Not using any secondary indicies and memtable_flush_queue_size is the default > 4. > > But let me tell you how data is "mutated" right now, maybe that will give you > an insight on how this is happening > > Basically the frame data table has the following primary key: PRIMARY KEY > ((id), trackid, "timestamp") > > Generally data is inserted once. So day to day writes are all new rows. > However, when out process for generating analytics for these rows changes, we > run the media back through again, causing overwrites. > > Up until last night, this was just a new insert because the PK never changed > so it was always 1-to-1 overwrite of every row. > > Last night was the first time that a new change went in where the PK could > actually change so now the process is always, DELETE by partition key, insert > all rows for partition key, repeat. > > We two tables that have similar frame data projections and some other > aggregates with much smaller row count per partition key. > > hope that helps, > arne > > On Dec 16, 2014, at 2:46 PM, Ryan Svihla <rsvi...@datastax.com> wrote: > >> so you've got some blocked flush writers but you have a incredibly large >> number of dropped mutations, are you using secondary indexes? and if so how >> many? what is your flush queue set to? >> >> On Tue, Dec 16, 2014 at 4:43 PM, Arne Claassen <a...@emotient.com> wrote: >> Of course QA decided to start a test batch (still relatively low traffic), >> so I hope it doesn't throw the tpstats off too much >> >> Node 1: >> Pool Name Active Pending Completed Blocked All >> time blocked >> MutationStage 0 0 13804928 0 >> 0 >> ReadStage 0 0 10975 0 >> 0 >> RequestResponseStage 0 0 7725378 0 >> 0 >> ReadRepairStage 0 0 1247 0 >> 0 >> ReplicateOnWriteStage 0 0 0 0 >> 0 >> MiscStage 0 0 0 0 >> 0 >> HintedHandoff 1 1 50 0 >> 0 >> FlushWriter 0 0 306 0 >> 31 >> MemoryMeter 0 0 719 0 >> 0 >> GossipStage 0 0 286505 0 >> 0 >> CacheCleanupExecutor 0 0 0 0 >> 0 >> InternalResponseStage 0 0 0 0 >> 0 >> CompactionExecutor 4 14 159 0 >> 0 >> ValidationExecutor 0 0 0 0 >> 0 >> MigrationStage 0 0 0 0 >> 0 >> commitlog_archiver 0 0 0 0 >> 0 >> AntiEntropyStage 0 0 0 0 >> 0 >> PendingRangeCalculator 0 0 11 0 >> 0 >> MemtablePostFlusher 0 0 1781 0 >> 0 >> >> Message type Dropped >> READ 0 >> RANGE_SLICE 0 >> _TRACE 0 >> MUTATION 391041 >> COUNTER_MUTATION 0 >> BINARY 0 >> REQUEST_RESPONSE 0 >> PAGED_RANGE 0 >> READ_REPAIR 0 >> >> Node 2: >> Pool Name Active Pending Completed Blocked All >> time blocked >> MutationStage 0 0 997042 0 >> 0 >> ReadStage 0 0 2623 0 >> 0 >> RequestResponseStage 0 0 706650 0 >> 0 >> ReadRepairStage 0 0 275 0 >> 0 >> ReplicateOnWriteStage 0 0 0 0 >> 0 >> MiscStage 0 0 0 0 >> 0 >> HintedHandoff 2 2 12 0 >> 0 >> FlushWriter 0 0 37 0 >> 4 >> MemoryMeter 0 0 70 0 >> 0 >> GossipStage 0 0 14927 0 >> 0 >> CacheCleanupExecutor 0 0 0 0 >> 0 >> InternalResponseStage 0 0 0 0 >> 0 >> CompactionExecutor 4 7 94 0 >> 0 >> ValidationExecutor 0 0 0 0 >> 0 >> MigrationStage 0 0 0 0 >> 0 >> commitlog_archiver 0 0 0 0 >> 0 >> AntiEntropyStage 0 0 0 0 >> 0 >> PendingRangeCalculator 0 0 3 0 >> 0 >> MemtablePostFlusher 0 0 114 0 >> 0 >> >> Message type Dropped >> READ 0 >> RANGE_SLICE 0 >> _TRACE 0 >> MUTATION 0 >> COUNTER_MUTATION 0 >> BINARY 0 >> REQUEST_RESPONSE 0 >> PAGED_RANGE 0 >> READ_REPAIR 0 >> >> Node 3: >> Pool Name Active Pending Completed Blocked All >> time blocked >> MutationStage 0 0 1539324 0 >> 0 >> ReadStage 0 0 2571 0 >> 0 >> RequestResponseStage 0 0 373300 0 >> 0 >> ReadRepairStage 0 0 325 0 >> 0 >> ReplicateOnWriteStage 0 0 0 0 >> 0 >> MiscStage 0 0 0 0 >> 0 >> HintedHandoff 1 1 21 0 >> 0 >> FlushWriter 0 0 38 0 >> 5 >> MemoryMeter 0 0 59 0 >> 0 >> GossipStage 0 0 21491 0 >> 0 >> CacheCleanupExecutor 0 0 0 0 >> 0 >> InternalResponseStage 0 0 0 0 >> 0 >> CompactionExecutor 4 9 85 0 >> 0 >> ValidationExecutor 0 0 0 0 >> 0 >> MigrationStage 0 0 0 0 >> 0 >> commitlog_archiver 0 0 0 0 >> 0 >> AntiEntropyStage 0 0 0 0 >> 0 >> PendingRangeCalculator 0 0 6 0 >> 0 >> MemtablePostFlusher 0 0 164 0 >> 0 >> >> Message type Dropped >> READ 0 >> RANGE_SLICE 0 >> _TRACE 0 >> MUTATION 205259 >> COUNTER_MUTATION 0 >> BINARY 0 >> REQUEST_RESPONSE 0 >> PAGED_RANGE 0 >> READ_REPAIR 18 >> >> >> Compaction seems like the only thing consistently active and pending >> >> On Tue, Dec 16, 2014 at 2:18 PM, Ryan Svihla <rsvi...@datastax.com> wrote: >> Ok based on those numbers I have a theory.. >> >> can you show me nodetool tptats for all 3 nodes? >> >> On Tue, Dec 16, 2014 at 4:04 PM, Arne Claassen <a...@emotient.com> wrote: >> No problem with the follow up questions. I'm on a crash course here trying >> to understand what makes C* tick so I appreciate all feedback. >> >> We reprocessed all media (1200 partition keys) last night where partition >> keys had somewhere between 4k and 200k "rows". After that completed, no >> traffic went to cluster at all for ~8 hours and throughout today, we may get >> a couple (less than 10) queries per second and maybe 3-4 write batches per >> hour. >> >> I assume the last value in the Partition Size histogram is the largest row: >> >> 20924300 bytes: 79 >> 25109160 bytes: 57 >> >> The majority seems clustered around 200000 bytes. >> >> I will look at switching my inserts to unlogged batches since they are >> always for one partition key. >> >> On Tue, Dec 16, 2014 at 1:47 PM, Ryan Svihla <rsvi...@datastax.com> wrote: >> Can you define what is "virtual no traffic" sorry to be repetitive about >> that, but I've worked on a lot of clusters in the past year and people have >> wildly different ideas what that means. >> >> unlogged batches of the same partition key are definitely a performance >> optimization. Typically async is much faster and easier on the cluster when >> you're using multip partition key batches. >> >> nodetool cfhistograms <keyspace> <tablename> >> >> On Tue, Dec 16, 2014 at 3:42 PM, Arne Claassen <a...@emotient.com> wrote: >> Actually not sure why the machine was originally configured at 6GB since we >> even started it on an r3.large with 15GB. >> >> Re: Batches >> >> Not using batches. I actually have that as a separate question on the list. >> Currently I fan out async single inserts and I'm wondering if batches are >> better since my data is inherently inserted in blocks of ordered rows for a >> single partition key. >> >> >> Re: Traffic >> >> There isn't all that much traffic. Inserts come in as blocks per partition >> key, but then can be 5k-200k rows for that partition key. Each of these rows >> is less than 100k. It's small, lots of ordered rows. It's frame and >> sub-frame information for media. and rows for one piece of media is inserted >> at once (the partition key). >> >> For the last 12 hours, where the load on all these machine has been stuck >> there's been virtually no traffic at all. This is the nodes basically >> sitting idle, except that they had load of 4 each. >> >> BTW, how do you determine widest row or for that matter number of tombstones >> in a row? >> >> thanks, >> arne >> >> On Tue, Dec 16, 2014 at 1:24 PM, Ryan Svihla <rsvi...@datastax.com> wrote: >> So 1024 is still a good 2.5 times what I'm suggesting, 6GB is hardly enough >> to run Cassandra well in, especially if you're going full bore on loads. >> However, you maybe just flat out be CPU bound on your write throughput, how >> many TPS and what size writes do you have? Also what is your widest row? >> >> Final question what is compaction throughput at? >> >> >> On Tue, Dec 16, 2014 at 3:20 PM, Arne Claassen <a...@emotient.com> wrote: >> The starting configuration I had, which is still running on two of the >> nodes, was 6GB Heap, 1024MB parnew which is close to what you are suggesting >> and those have been pegged at load 4 for the over 12 hours with hardly and >> read or write traffic. I will set one to 8GB/400MB and see if its load >> changes. >> >> On Tue, Dec 16, 2014 at 1:12 PM, Ryan Svihla <rsvi...@datastax.com> wrote: >> So heap of that size without some tuning will create a number of problems >> (high cpu usage one of them), I suggest either 8GB heap and 400mb parnew >> (which I'd only set that low for that low cpu count) , or attempt the >> tunings as indicated in https://issues.apache.org/jira/browse/CASSANDRA-8150 >> >> On Tue, Dec 16, 2014 at 3:06 PM, Arne Claassen <a...@emotient.com> wrote: >> Changed the 15GB node to 25GB heap and the nice CPU is down to ~20% now. >> Checked my dev cluster to see if the ParNew log entries are just par for the >> course, but not seeing them there. However, both have the following every 30 >> seconds: >> >> DEBUG [BatchlogTasks:1] 2014-12-16 21:00:44,898 BatchlogManager.java (line >> 165) Started replayAllFailedBatches >> DEBUG [MemtablePostFlusher:1] 2014-12-16 21:00:44,899 ColumnFamilyStore.java >> (line 866) forceFlush requested but everything is clean in batchlog >> DEBUG [BatchlogTasks:1] 2014-12-16 21:00:44,899 BatchlogManager.java (line >> 200) Finished replayAllFailedBatches >> >> Is that just routine scheduled house-keeping or a sign of something else? >> >> On Tue, Dec 16, 2014 at 12:52 PM, Arne Claassen <a...@emotient.com> wrote: >> Sorry, I meant 15GB heap on the one machine that has less nice CPU% now. The >> others are 6GB >> >> On Tue, Dec 16, 2014 at 12:50 PM, Arne Claassen <a...@emotient.com> wrote: >> AWS r3.xlarge, 30GB, but only using a Heap of 10GB, new 2GB because we might >> go c3.2xlarge instead if CPU is more important than RAM >> Storage is optimized EBS SSD (but iostat shows no real IO going on) >> Each node only has about 10GB with ownership of 67%, 64.7% & 68.3%. >> >> The node on which I set the Heap to 10GB from 6GB the utlilization has >> dropped to 46%nice now, but the ParNew log messages still continue at the >> same pace. I'm gonna up the HEAP to 20GB for a bit, see if that brings that >> nice CPU further down. >> >> No TombstoneOverflowingExceptions. >> >> On Tue, Dec 16, 2014 at 11:50 AM, Ryan Svihla <rsvi...@datastax.com> wrote: >> What's CPU, RAM, Storage layer, and data density per node? Exact heap >> settings would be nice. In the logs look for TombstoneOverflowingException >> >> >> On Tue, Dec 16, 2014 at 1:36 PM, Arne Claassen <a...@emotient.com> wrote: >> I'm running 2.0.10. >> >> The data is all time series data and as we change our pipeline, we've been >> periodically been reprocessing the data sources, which causes each time >> series to be overwritten, i.e. every row per partition key is deleted and >> re-written, so I assume i've been collecting a bunch of tombstones. >> >> Also, the presence of the ever present and never completing compaction >> types, i assumed were an artifact of tombstoning, but i fully admit to >> conjecture based on about ~20 blog posts and stackoverflow questions i've >> surveyed. >> >> I doubled the Heap on one node and it changed nothing regarding the load or >> the ParNew log statements. New Generation Usage is 50%, Eden itself is 56%. >> >> Anything else i should look at and report, let me know. >> >> On Tue, Dec 16, 2014 at 11:14 AM, Jonathan Lacefield >> <jlacefi...@datastax.com> wrote: >> Hello, >> >> What version of Cassandra are you running? >> >> If it's 2.0, we recently experienced something similar with 8447 [1], >> which 8485 [2] should hopefully resolve. >> >> Please note that 8447 is not related to tombstones. Tombstone processing >> can put a lot of pressure on the heap as well. Why do you think you have a >> lot of tombstones in that one particular table? >> >> [1] https://issues.apache.org/jira/browse/CASSANDRA-8447 >> [2] https://issues.apache.org/jira/browse/CASSANDRA-8485 >> >> Jonathan >> >> >> Jonathan Lacefield >> Solution Architect | (404) 822 3487 | jlacefi...@datastax.com >> >> >> >> On Tue, Dec 16, 2014 at 2:04 PM, Arne Claassen <a...@emotient.com> wrote: >> I have a three node cluster that has been sitting at a load of 4 (for each >> node), 100% CPI utilization (although 92% nice) for that last 12 hours, ever >> since some significant writes finished. I'm trying to determine what tuning >> I should be doing to get it out of this state. The debug log is just an >> endless series of: >> >> DEBUG [ScheduledTasks:1] 2014-12-16 19:03:35,042 GCInspector.java (line 118) >> GC for ParNew: 166 ms for 10 collections, 4400928736 used; max is 8000634880 >> DEBUG [ScheduledTasks:1] 2014-12-16 19:03:36,043 GCInspector.java (line 118) >> GC for ParNew: 165 ms for 10 collections, 4440011176 used; max is 8000634880 >> DEBUG [ScheduledTasks:1] 2014-12-16 19:03:37,043 GCInspector.java (line 118) >> GC for ParNew: 135 ms for 8 collections, 4402220568 used; max is 8000634880 >> >> iostat shows virtually no I/O. >> >> Compaction may enter into this, but i don't really know what to make of >> compaction stats since they never change: >> >> [root@cassandra-37919c3a ~]# nodetool compactionstats >> pending tasks: 10 >> compaction type keyspace table completed >> total unit progress >> Compaction mediamedia_tracks_raw 271651482 >> 563615497 bytes 48.20% >> Compaction mediamedia_tracks_raw 30308910 >> 21676695677 bytes 0.14% >> Compaction mediamedia_tracks_raw 1198384080 >> 1815603161 bytes 66.00% >> Active compaction remaining time : 0h22m24s >> >> 5 minutes later: >> >> [root@cassandra-37919c3a ~]# nodetool compactionstats >> pending tasks: 9 >> compaction type keyspace table completed >> total unit progress >> Compaction mediamedia_tracks_raw 271651482 >> 563615497 bytes 48.20% >> Compaction mediamedia_tracks_raw 30308910 >> 21676695677 bytes 0.14% >> Compaction mediamedia_tracks_raw 1198384080 >> 1815603161 bytes 66.00% >> Active compaction remaining time : 0h22m24s >> >> Sure the pending tasks went down by one, but the rest is identical. >> media_tracks_raw likely has a bunch of tombstones (can't figure out how to >> get stats on that). >> >> Is this behavior something that indicates that i need more Heap, larger new >> generation? Should I be manually running compaction on tables with lots of >> tombstones? >> >> Any suggestions or places to educate myself better on performance tuning >> would be appreciated. >> >> arne >> >> >> -- >> >> Ryan Svihla >> Solution Architect >> >> >> >> DataStax is the fastest, most scalable distributed database technology, >> delivering Apache Cassandra to the world’s most innovative enterprises. >> Datastax is built to be agile, always-on, and predictably scalable to any >> size. With more than 500 customers in 45 countries, DataStax is the database >> technology and transactional backbone of choice for the worlds most >> innovative companies such as Netflix, Adobe, Intuit, and eBay. >> >> >> >> -- >> >> Ryan Svihla >> Solution Architect >> >> >> >> DataStax is the fastest, most scalable distributed database technology, >> delivering Apache Cassandra to the world’s most innovative enterprises. >> Datastax is built to be agile, always-on, and predictably scalable to any >> size. With more than 500 customers in 45 countries, DataStax is the database >> technology and transactional backbone of choice for the worlds most >> innovative companies such as Netflix, Adobe, Intuit, and eBay. >> >> >> >> -- >> >> Ryan Svihla >> Solution Architect >> >> >> >> DataStax is the fastest, most scalable distributed database technology, >> delivering Apache Cassandra to the world’s most innovative enterprises. >> Datastax is built to be agile, always-on, and predictably scalable to any >> size. With more than 500 customers in 45 countries, DataStax is the database >> technology and transactional backbone of choice for the worlds most >> innovative companies such as Netflix, Adobe, Intuit, and eBay. >> >> >> >> -- >> >> Ryan Svihla >> Solution Architect >> >> >> >> DataStax is the fastest, most scalable distributed database technology, >> delivering Apache Cassandra to the world’s most innovative enterprises. >> Datastax is built to be agile, always-on, and predictably scalable to any >> size. With more than 500 customers in 45 countries, DataStax is the database >> technology and transactional backbone of choice for the worlds most >> innovative companies such as Netflix, Adobe, Intuit, and eBay. >> >> >> >> -- >> >> Ryan Svihla >> Solution Architect >> >> >> >> DataStax is the fastest, most scalable distributed database technology, >> delivering Apache Cassandra to the world’s most innovative enterprises. >> Datastax is built to be agile, always-on, and predictably scalable to any >> size. With more than 500 customers in 45 countries, DataStax is the database >> technology and transactional backbone of choice for the worlds most >> innovative companies such as Netflix, Adobe, Intuit, and eBay. >> >> >> >> -- >> >> Ryan Svihla >> Solution Architect >> >> >> >> DataStax is the fastest, most scalable distributed database technology, >> delivering Apache Cassandra to the world’s most innovative enterprises. >> Datastax is built to be agile, always-on, and predictably scalable to any >> size. With more than 500 customers in 45 countries, DataStax is the database >> technology and transactional backbone of choice for the worlds most >> innovative companies such as Netflix, Adobe, Intuit, and eBay. >> > > > > -- > > Ryan Svihla > Solution Architect > > > > DataStax is the fastest, most scalable distributed database technology, > delivering Apache Cassandra to the world’s most innovative enterprises. > Datastax is built to be agile, always-on, and predictably scalable to any > size. With more than 500 customers in 45 countries, DataStax is the database > technology and transactional backbone of choice for the worlds most > innovative companies such as Netflix, Adobe, Intuit, and eBay. >