Very large partitions create a lot of garbage during reads: 
https://issues.apache.org/jira/browse/CASSANDRA-9754 - you will see significant 
GC pauses trying to read from large enough partitions. 

I suspect GC, though it’s odd that you’re unable to see it. 



From:  Bryan Cheng
Reply-To:  "user@cassandra.apache.org"
Date:  Wednesday, January 13, 2016 at 12:40 PM
To:  "user@cassandra.apache.org"
Subject:  Help debugging a very slow query

Hi list, 

Would appreciate some insight into some irregular performance we're seeing.

We have a column family that has become problematic recently. We've noticed a 
few queries take enormous amounts of time, and seem to clog up read resources 
on the machine (read pending tasks pile up, then immediately are relieved). 

I've included the output of cfhistograms on this keyspace[1]. The latencies 
sampled do not include one of these problematic partitions, but show two 
things: 1) the vast majority of queries to this table seem to be healthy, and 
2) that the maximum partition size is absurd (4139110981 bytes).

This particular cf is not expected to be updated beyond an initial set of 
writes, but can be read many times. The data model includes several hashes that 
amount to a few KB at most, a set<ascii> that can hit ~30-40 entries, and three 
list<int> that reach a hundred or so entries at most. There doesn't appear to 
be any material difference in the size or character of the data saved between 
"good" and "bad" partitions. Often, the same extremely slow partition queried 
with consistency ONE returns cleanly and very quickly against other replicas.

I've included a trace of one of these slow returns[2], which I find very 
strange: The vast majority of operations are very quick, but the final step is 
extremely slow. Nothing exceeds 2ms until the final "Read 1 live and 0 
tombstone cells" which takes a whopping 69 seconds [!!]. We've checked our 
garbage collection in this time period and have not noticed any significant 
collections.

As far as I can tell, the trace doesn't raise any red flags, and we're largely 
stumped.

We've got two main questions:

1) What's up with the megapartition? What's the best way to debug this? Our 
data model is largely write once, we don't do any updates. We do DELETE, but 
the partitions that are giving us issues haven't been removed. We had some 
suspicions on https://issues.apache.org/jira/browse/CASSANDRA-10547, but that 
seems to largely be triggered by UPDATE operations.

2) What could cause the Read to take such an absurd amount of time when it's a 
pair of sstables and the memtable being examined, and its just a single cell 
being read? We originally suspected just memory pressure from huge sstables, 
but without a corresponding GC this seems unlikely?

Any ideas?

Thanks in advance!

--Bryan


[1] 
Percentile  SSTables     Write Latency      Read Latency    Partition Size      
  Cell Count
                              (micros)          (micros)           (bytes)      
            
50%             1.00             35.00             72.00              1109      
          14
75%             1.00             50.00            149.00              1331      
          17
95%             1.00             72.00            924.00              4768      
          35
98%             2.00            103.00           1597.00              9887      
          72
99%             2.00            149.00           1597.00             14237      
         103
Min             0.00             15.00             25.00                43      
           0
Max             2.00            258.00           6866.00        4139110981      
       20501

[2]
[
  {
    "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
    "Eventid": "4f524890-ba2f-11e5-8729-e1d125cb9b2d",
    "Activity": "Parsing select * from pooltx where hash = 
0x5f805c68d66e7d271361e7774a7eeec0591eb5197d4f420126cea83171f0a8ff;",
    "Source": "172.31.54.46",
    "SourceElapsed": 26
  },
  {
    "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
    "Eventid": "4f526fa0-ba2f-11e5-8729-e1d125cb9b2d",
    "Activity": "Preparing statement",
    "Source": "172.31.54.46",
    "SourceElapsed": 79
  },
  {
    "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
    "Eventid": "4f52bdc0-ba2f-11e5-8729-e1d125cb9b2d",
    "Activity": "Executing single-partition query on pooltx",
    "Source": "172.31.54.46",
    "SourceElapsed": 1014
  },
  {
    "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
    "Eventid": "4f52e4d0-ba2f-11e5-8729-e1d125cb9b2d",
    "Activity": "Acquiring sstable references",
    "Source": "172.31.54.46",
    "SourceElapsed": 1016
  },
  {
    "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
    "Eventid": "4f530be0-ba2f-11e5-8729-e1d125cb9b2d",
    "Activity": "Merging memtable tombstones",
    "Source": "172.31.54.46",
    "SourceElapsed": 1029
  },
  {
    "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
    "Eventid": "4f5332f0-ba2f-11e5-8729-e1d125cb9b2d",
    "Activity": "Bloom filter allows skipping sstable 387133",
    "Source": "172.31.54.46",
    "SourceElapsed": 1040
  },
  {
    "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
    "Eventid": "4f535a00-ba2f-11e5-8729-e1d125cb9b2d",
    "Activity": "Key cache hit for sstable 386331",
    "Source": "172.31.54.46",
    "SourceElapsed": 1046
  },
  {
    "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
    "Eventid": "4f538110-ba2f-11e5-8729-e1d125cb9b2d",
    "Activity": "Seeking to partition beginning in data file",
    "Source": "172.31.54.46",
    "SourceElapsed": 1047
  },
  {
    "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
    "Eventid": "4f53a820-ba2f-11e5-8729-e1d125cb9b2d",
    "Activity": "Key cache hit for sstable 379893",
    "Source": "172.31.54.46",
    "SourceElapsed": 1256
  },
  {
    "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
    "Eventid": "4f53cf30-ba2f-11e5-8729-e1d125cb9b2d",
    "Activity": "Seeking to partition beginning in data file",
    "Source": "172.31.54.46",
    "SourceElapsed": 1257
  },
  {
    "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
    "Eventid": "4f53f640-ba2f-11e5-8729-e1d125cb9b2d",
    "Activity": "Skipped 0/4 non-slice-intersecting sstables, included 0 due to 
tombstones",
    "Source": "172.31.54.46",
    "SourceElapsed": 1357
  },
  {
    "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
    "Eventid": "4f541d50-ba2f-11e5-8729-e1d125cb9b2d",
    "Activity": "Merging data from memtables and 2 sstables",
    "Source": "172.31.54.46",
    "SourceElapsed": 1359
  },
  {
    "Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
    "Eventid": "78d82a90-ba2f-11e5-8729-e1d125cb9b2d",
    "Activity": "Read 1 live and 0 tombstone cells",
    "Source": "172.31.54.46",
    "SourceElapsed": 69666648
  }
]

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to