Re: best practices for time-series data with massive amounts of records

Eric Stevens Sat, 07 Mar 2015 14:21:03 -0800

It's probably quite rare for extremely large time series data to be
querying the whole set of data.  Instead there's almost always a "Between X
and Y dates" aspect to nearly every real time query you might have against
a table like this (with the exception of "most recent N events").

Because of this, time bucketing can be an effective strategy, though until
you understand your data better, it's hard to know how large (or small) to
make your buckets.  Because of *that*, I recommend using timestamp data
type for your bucketing strategy - this gives you the advantage of being
able to reduce your bucket sizes while keeping your at-rest data mostly
still quite accessible.

What I mean is that if you change your bucketing strategy from day to hour,
when you are querying across that changed time period, you can iterate at
the finer granularity buckets (hour), and you'll pick up the coarser
granularity (day) automatically for all but the earliest bucket (which is
easy to correct for when you're flooring your start bucket).  In the
coarser time period, most reads are partition key misses, which are
extremely inexpensive in Cassandra.

If you do need most-recent-N queries for broad ranges and you expect to
have some users whose clickrate is dramatically less frequent than your
bucket interval (making iterating over buckets inefficient), you can keep a
separate counter table with PK of ((user_id), bucket) in which you count
new events.  Now you can identify the exact set of buckets you need to read
to satisfy the query no matter what the user's click volume is (so very low
volume users have at most N partition keys queried, higher volume users
query fewer partition keys).

On Fri, Mar 6, 2015 at 4:06 PM, graham sanderson <gra...@vast.com> wrote:

> Note that using static column(s) for the “head” value, and trailing TTLed
> values behind is something we’re considering. Note this is especially nice
> if your head state includes say a map which is updated by small deltas
> (individual keys)
>
> We have not yet studied the effect of static columns on say DTCS
>
>
> On Mar 6, 2015, at 4:42 PM, Clint Kelly <clint.ke...@gmail.com> wrote:
>
> Hi all,
>
> Thanks for the responses, this was very helpful.
>
> I don't know yet what the distribution of clicks and users will be, but I
> expect to see a few users with an enormous amount of interactions and most
> users having very few.  The idea of doing some additional manual
> partitioning, and then maintaining another table that contains the "head"
> partition for each user makes sense, although it would add additional
> latency when we want to get say the most recent 1000 interactions for a
> given user (which is something that we have to do sometimes for
> applications with tight SLAs).
>
> FWIW I doubt that any users will have so many interactions that they
> exceed what we could reasonably put in a row, but I wanted to have a
> strategy to deal with this.
>
> Having a nice design pattern in Cassandra for maintaining a row with the
> N-most-recent interactions would also solve this reasonably well, but I
> don't know of any way to implement that without running batch jobs that
> periodically clean out data (which might be okay).
>
> Best regards,
> Clint
>
>
>
>
> On Tue, Mar 3, 2015 at 8:10 AM, mck <m...@apache.org> wrote:
>
>>
>> > Here "partition" is a random digit from 0 to (N*M)
>> > where N=nodes in cluster, and M=arbitrary number.
>>
>>
>> Hopefully it was obvious, but here (unless you've got hot partitions),
>> you don't need N.
>> ~mck
>>
>
>
>

Re: best practices for time-series data with massive amounts of records

Reply via email to