Hi list, Would appreciate some insight into some irregular performance we're seeing.
We have a column family that has become problematic recently. We've noticed a few queries take enormous amounts of time, and seem to clog up read resources on the machine (read pending tasks pile up, then immediately are relieved). I've included the output of cfhistograms on this keyspace[1]. The latencies sampled do not include one of these problematic partitions, but show two things: 1) the vast majority of queries to this table seem to be healthy, and 2) that the maximum partition size is absurd (4139110981 bytes). This particular cf is not expected to be updated beyond an initial set of writes, but can be read many times. The data model includes several hashes that amount to a few KB at most, a set<ascii> that can hit ~30-40 entries, and three list<int> that reach a hundred or so entries at most. There doesn't appear to be any material difference in the size or character of the data saved between "good" and "bad" partitions. Often, the same extremely slow partition queried with consistency ONE returns cleanly and very quickly against other replicas. I've included a trace of one of these slow returns[2], which I find very strange: The vast majority of operations are very quick, but the final step is extremely slow. Nothing exceeds 2ms until the final "Read 1 live and 0 tombstone cells" which takes a whopping 69 seconds [!!]. We've checked our garbage collection in this time period and have not noticed any significant collections. As far as I can tell, the trace doesn't raise any red flags, and we're largely stumped. We've got two main questions: 1) What's up with the megapartition? What's the best way to debug this? Our data model is largely write once, we don't do any updates. We do DELETE, but the partitions that are giving us issues haven't been removed. We had some suspicions on https://issues.apache.org/jira/browse/CASSANDRA-10547, but that seems to largely be triggered by UPDATE operations. 2) What could cause the Read to take such an absurd amount of time when it's a pair of sstables and the memtable being examined, and its just a single cell being read? We originally suspected just memory pressure from huge sstables, but without a corresponding GC this seems unlikely? Any ideas? Thanks in advance! --Bryan [1] Percentile SSTables Write Latency Read Latency Partition Size Cell Count (micros) (micros) (bytes) 50% 1.00 35.00 72.00 1109 14 75% 1.00 50.00 149.00 1331 17 95% 1.00 72.00 924.00 4768 35 98% 2.00 103.00 1597.00 9887 72 99% 2.00 149.00 1597.00 14237 103 Min 0.00 15.00 25.00 43 0 Max 2.00 258.00 6866.00 4139110981 20501 [2] [ { "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d", "Eventid": "4f524890-ba2f-11e5-8729-e1d125cb9b2d", "Activity": "Parsing select * from pooltx where hash = 0x5f805c68d66e7d271361e7774a7eeec0591eb5197d4f420126cea83171f0a8ff;", "Source": "172.31.54.46", "SourceElapsed": 26 }, { "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d", "Eventid": "4f526fa0-ba2f-11e5-8729-e1d125cb9b2d", "Activity": "Preparing statement", "Source": "172.31.54.46", "SourceElapsed": 79 }, { "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d", "Eventid": "4f52bdc0-ba2f-11e5-8729-e1d125cb9b2d", "Activity": "Executing single-partition query on pooltx", "Source": "172.31.54.46", "SourceElapsed": 1014 }, { "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d", "Eventid": "4f52e4d0-ba2f-11e5-8729-e1d125cb9b2d", "Activity": "Acquiring sstable references", "Source": "172.31.54.46", "SourceElapsed": 1016 }, { "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d", "Eventid": "4f530be0-ba2f-11e5-8729-e1d125cb9b2d", "Activity": "Merging memtable tombstones", "Source": "172.31.54.46", "SourceElapsed": 1029 }, { "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d", "Eventid": "4f5332f0-ba2f-11e5-8729-e1d125cb9b2d", "Activity": "Bloom filter allows skipping sstable 387133", "Source": "172.31.54.46", "SourceElapsed": 1040 }, { "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d", "Eventid": "4f535a00-ba2f-11e5-8729-e1d125cb9b2d", "Activity": "Key cache hit for sstable 386331", "Source": "172.31.54.46", "SourceElapsed": 1046 }, { "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d", "Eventid": "4f538110-ba2f-11e5-8729-e1d125cb9b2d", "Activity": "Seeking to partition beginning in data file", "Source": "172.31.54.46", "SourceElapsed": 1047 }, { "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d", "Eventid": "4f53a820-ba2f-11e5-8729-e1d125cb9b2d", "Activity": "Key cache hit for sstable 379893", "Source": "172.31.54.46", "SourceElapsed": 1256 }, { "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d", "Eventid": "4f53cf30-ba2f-11e5-8729-e1d125cb9b2d", "Activity": "Seeking to partition beginning in data file", "Source": "172.31.54.46", "SourceElapsed": 1257 }, { "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d", "Eventid": "4f53f640-ba2f-11e5-8729-e1d125cb9b2d", "Activity": "Skipped 0/4 non-slice-intersecting sstables, included 0 due to tombstones", "Source": "172.31.54.46", "SourceElapsed": 1357 }, { "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d", "Eventid": "4f541d50-ba2f-11e5-8729-e1d125cb9b2d", "Activity": "Merging data from memtables and 2 sstables", "Source": "172.31.54.46", "SourceElapsed": 1359 }, { "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d", "Eventid": "78d82a90-ba2f-11e5-8729-e1d125cb9b2d", "Activity": "Read 1 live and 0 tombstone cells", "Source": "172.31.54.46", "SourceElapsed": 69666648 } ]