[ https://issues.apache.org/jira/browse/CASSANDRA-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884260#comment-13884260 ]
sankalp kohli commented on CASSANDRA-6474: ------------------------------------------ I like the first approach since it is simpler in which we will have HLL using 1K per sstable. Regarding you second approach, what do you mean by "For LCS it would only be useful for prioritization"? So you will calculate the stable set based on description in CASSANDRA-6216 but I am not sure how you will use HLL for prioritization. > Compaction strategy based on MinHash > ------------------------------------ > > Key: CASSANDRA-6474 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6474 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Yuki Morishita > Assignee: sankalp kohli > Labels: compaction > Fix For: 3.0 > > > We can consider an SSTable as a set of partition keys, and 'compaction' as > de-duplication of those partition keys. > We want to find compaction candidates from SSTables that have as many same > keys as possible. If we can group similar SSTables based on some measurement, > we can achieve more efficient compaction. > One such measurement is [Jaccard > Distance|http://en.wikipedia.org/wiki/Jaccard_index], > !http://upload.wikimedia.org/math/1/8/6/186c7f4e83da32e889d606140fae25a0.png! > which we can estimate using technique called > [MinHash|http://en.wikipedia.org/wiki/MinHash]. > In Cassandra, we can calculate and store MinHash signature when writing > SSTable. New compaction strategy uses the signature to find the group of > similar SSTable for compaction candidates. We can always fall back to STCS > when such candidates are not exists. > This is just an idea floating around my head, but before I forget, I dump it > here. For introduction to this technique, [Chapter 3 of 'Mining of Massive > Datasets'|http://infolab.stanford.edu/~ullman/mmds/ch3.pdf] is a good start. -- This message was sent by Atlassian JIRA (v6.1.5#6160)