Thanks for the answer. It means that if we use randompartioner it will be very difficult to find a sstable without any overlap.
Let me give you an example from my test. I have ~50 sstables in total and an sstable with droppable ratio 0.9. I use GUID for key and only insert (no update -delete) so I dont expect a key in different sstables. I put extra logging to AbstractCompactionStrategy to see the overlaps.size() and keys and remainingKeys: overlaps.size() is around 30, number of keys for that sstable is around 5 M and remainingKeys is always 0. Are you sure that it is a good idea to estimate remainingKeys like that? Best Regards, Cem On Wed, May 22, 2013 at 5:58 PM, Yuki Morishita <mor.y...@gmail.com> wrote: > > Can method calculate non-overlapping keys as overlapping? > > Yes. > And randomized keys don't matter here since sstables are sorted by > "token" calculated from key by your partitioner, and the method uses > sstable's min/max token to estimate overlap. > > On Tue, May 21, 2013 at 4:43 PM, cem <cayiro...@gmail.com> wrote: > > Thank you very much for the swift answer. > > > > I have one more question about the second part. Can method calculate > > non-overlapping keys as overlapping? I mean it uses max and min tokens > and > > column count. They can be very close to each other if random keys are > used. > > > > In my use case I generate a GUID for each key and send a single write > > request. > > > > Cem > > > > On Tue, May 21, 2013 at 11:13 PM, Yuki Morishita <mor.y...@gmail.com> > wrote: > >> > >> > Why does Cassandra single table compaction skips the keys that are in > >> > the other sstables? > >> > >> because we don't want to resurrect deleted columns. Say, sstable A has > >> the column with timestamp 1, and sstable B has the same column which > >> deleted at timestamp 2. Then if we purge that column only from sstable > >> B, we would see the column with timestamp 1 again. > >> > >> > I also dont understand why we have this line in > worthDroppingTombstones > >> > method > >> > >> What the method is trying to do is to "guess" how many columns that > >> are not in the rows that don't overlap, without actually going through > >> every rows in the sstable. We have statistics like column count > >> histogram, min and max row token for every sstables, we use those in > >> the method to estimate how many columns the two sstables overlap. > >> You may have remainingColumnsRatio of 0 when the two sstables overlap > >> almost entirely. > >> > >> > >> On Tue, May 21, 2013 at 3:43 PM, cem <cayiro...@gmail.com> wrote: > >> > Hi all, > >> > > >> > I have a question about ticket > >> > https://issues.apache.org/jira/browse/CASSANDRA-3442 > >> > > >> > Why does Cassandra single table compaction skips the keys that are in > >> > the > >> > other sstables? Please correct if I am wrong. > >> > > >> > I also dont understand why we have this line in > worthDroppingTombstones > >> > method: > >> > > >> > double remainingColumnsRatio = ((double) columns) / > >> > (sstable.getEstimatedColumnCount().count() * > >> > sstable.getEstimatedColumnCount().mean()); > >> > > >> > remainingColumnsRatio is always 0 in my case and the droppableRatio > is > >> > 0.9. Cassandra skips all sstables which are already expired. > >> > > >> > This line was introduced by > >> > https://issues.apache.org/jira/browse/CASSANDRA-4022. > >> > > >> > Best Regards, > >> > Cem > >> > >> > >> > >> -- > >> Yuki Morishita > >> t:yukim (http://twitter.com/yukim) > > > > > > > > -- > Yuki Morishita > t:yukim (http://twitter.com/yukim) >