CIL From: Jeff Jirsa [mailto:[email protected]] Sent: Thursday, March 17, 2016 11:01 AM To: [email protected] Subject: Re: DTCS bucketing Question
> am trying to concretely understand how DTCS makes buckets and I am looking > at the DateTieredCompactionStrategyTest.testGetBuckets method and played with > some of the parameters to GetBuckets method call (Cassandra 2.1.12). I don’t > think I fully understand something there. Don’t feel bad, you’re not alone. > In this case, the buckets should look like [0-4000] [4000-]. Is this correct > ? The buckets that I get back are different (“a” lives in its bucket and > everyone else in another). What I am missing here ? The latest/newest window never gets combined, it’s ALWAYS the base size. Only subsequent windows get merged. First window will always be 0-1000. https://spotifylabscom.files.wordpress.com/2014/12/dtcs3.png [Anubhav Kale] This doesn’t seem correct. In the original test (look at comments), the first window is pretty big and in many cases, the first window is big. > Note, that if I keep the base to original (100L) or increase it and play with > min_threshold the results are exactly what I would expect. Because the original base is lower than the lowest timestamp, which means you’re never looking in the first window (0-base). > I am afraid that the math in Target class is somewhat hard to follow so I am > thinking about it this way. The Target class is too clever for its own good. I couldn’t follow it. You’re having trouble following it. Other smart people I’ve talked to couldn’t follow it. Last June I proposed an alternative (CASSANDRA-9666 / https://github.com/jeffjirsa/twcs ). It was never taken upstream, but it does get a fair bit of use by people with large time series clusters (we use it on one of our petabyte-scale clusters here). Significantly easier to reason about. * Jeff From: Anubhav Kale Reply-To: "[email protected]<mailto:[email protected]>" Date: Thursday, March 17, 2016 at 10:24 AM To: "[email protected]<mailto:[email protected]>" Subject: DTCS bucketing Question <Not sure if this is the right alias or Dev, so asking in both places> Hello, I am trying to concretely understand how DTCS makes buckets and I am looking at the DateTieredCompactionStrategyTest.testGetBuckets method and played with some of the parameters to GetBuckets method call (Cassandra 2.1.12). I don’t think I fully understand something there. Let me try to explain. Consider the second test there. I changed the pairs a bit for easier explanation and changed base (initial window size)=1000L and Min_Threshold=2 pairs = Lists.newArrayList( Pair.create("a", 200L), Pair.create("b", 2000L), Pair.create("c", 3600L), Pair.create("d", 3899L), Pair.create("e", 3900L), Pair.create("f", 3950L), Pair.create("too new", 4125L) ); buckets = getBuckets(pairs, 1000L, 2, 4050L, Long.MAX_VALUE); In this case, the buckets should look like [0-4000] [4000-]. Is this correct ? The buckets that I get back are different (“a” lives in its bucket and everyone else in another). What I am missing here ? Another case, pairs = Lists.newArrayList( Pair.create("a", 200L), Pair.create("b", 2000L), Pair.create("c", 3600L), Pair.create("d", 3899L), Pair.create("e", 3900L), Pair.create("f", 3950L), Pair.create("too new", 4125L) ); buckets = getBuckets(pairs, 50L, 4, 4050L, Long.MAX_VALUE); Here, the buckets should be [0-3200] [3200-4000] [4000-4050] [4050-]. Is this correct ? Again, the buckets that come back are quite different. Note, that if I keep the base to original (100L) or increase it and play with min_threshold the results are exactly what I would expect. The way I think about DTCS is, try to make buckets of maximum possible sizes from 0, and once you can’t make do that , make smaller buckets (similar to what the comment suggests). Is this mental model wrong ? I am afraid that the math in Target class is somewhat hard to follow so I am thinking about it this way. Thanks a lot in advance. -Anubhav
