Thanks for the answer.

It means that if we use randompartioner it will be very difficult to  find
a sstable without any overlap.

Let me give you an example from my test.

I have ~50 sstables in total and an sstable with droppable ratio 0.9. I use
GUID for key and only insert (no update -delete) so I dont expect a key
in different sstables.

I put extra logging to  AbstractCompactionStrategy to see
the overlaps.size() and keys and remainingKeys:

overlaps.size() is around 30, number of keys for that sstable is around 5 M
and remainingKeys is always 0.

Are you sure that it is a good idea to estimate remainingKeys like that?

Best Regards,
Cem



On Wed, May 22, 2013 at 5:58 PM, Yuki Morishita <mor.y...@gmail.com> wrote:

> > Can method calculate non-overlapping keys as overlapping?
>
> Yes.
> And randomized keys don't matter here since sstables are sorted by
> "token" calculated from key by your partitioner, and the method uses
> sstable's min/max token to estimate overlap.
>
> On Tue, May 21, 2013 at 4:43 PM, cem <cayiro...@gmail.com> wrote:
> > Thank you very much for the swift answer.
> >
> > I have one more question about the second part. Can method calculate
> > non-overlapping keys as overlapping? I mean it uses max and min tokens
> and
> > column count. They can be very close to each other if random keys are
> used.
> >
> > In my use case I generate a GUID for each key and send a single write
> > request.
> >
> > Cem
> >
> > On Tue, May 21, 2013 at 11:13 PM, Yuki Morishita <mor.y...@gmail.com>
> wrote:
> >>
> >> > Why does Cassandra single table compaction skips the keys that are in
> >> > the other sstables?
> >>
> >> because we don't want to resurrect deleted columns. Say, sstable A has
> >> the column with timestamp 1, and sstable B has the same column which
> >> deleted at timestamp 2. Then if we purge that column only from sstable
> >> B, we would see the column with timestamp 1 again.
> >>
> >> > I also dont understand why we have this line in
> worthDroppingTombstones
> >> > method
> >>
> >> What the method is trying to do is to "guess" how many columns that
> >> are not in the rows that don't overlap, without actually going through
> >> every rows in the sstable. We have statistics like column count
> >> histogram, min and max row token for every sstables, we use those in
> >> the method to estimate how many columns the two sstables overlap.
> >> You may have remainingColumnsRatio of 0 when the two sstables overlap
> >> almost entirely.
> >>
> >>
> >> On Tue, May 21, 2013 at 3:43 PM, cem <cayiro...@gmail.com> wrote:
> >> > Hi all,
> >> >
> >> > I have a question about ticket
> >> > https://issues.apache.org/jira/browse/CASSANDRA-3442
> >> >
> >> > Why does Cassandra single table compaction skips the keys that are in
> >> > the
> >> > other sstables? Please correct if I am wrong.
> >> >
> >> > I also dont understand why we have this line in
> worthDroppingTombstones
> >> > method:
> >> >
> >> > double remainingColumnsRatio = ((double) columns) /
> >> > (sstable.getEstimatedColumnCount().count() *
> >> > sstable.getEstimatedColumnCount().mean());
> >> >
> >> > remainingColumnsRatio  is always 0 in my case and the droppableRatio
>  is
> >> > 0.9. Cassandra skips all sstables which are already expired.
> >> >
> >> > This line was introduced by
> >> > https://issues.apache.org/jira/browse/CASSANDRA-4022.
> >> >
> >> > Best Regards,
> >> > Cem
> >>
> >>
> >>
> >> --
> >> Yuki Morishita
> >>  t:yukim (http://twitter.com/yukim)
> >
> >
>
>
>
> --
> Yuki Morishita
>  t:yukim (http://twitter.com/yukim)
>

Reply via email to