It's probably quite rare for extremely large time series data to be querying the whole set of data. Instead there's almost always a "Between X and Y dates" aspect to nearly every real time query you might have against a table like this (with the exception of "most recent N events").
Because of this, time bucketing can be an effective strategy, though until you understand your data better, it's hard to know how large (or small) to make your buckets. Because of *that*, I recommend using timestamp data type for your bucketing strategy - this gives you the advantage of being able to reduce your bucket sizes while keeping your at-rest data mostly still quite accessible. What I mean is that if you change your bucketing strategy from day to hour, when you are querying across that changed time period, you can iterate at the finer granularity buckets (hour), and you'll pick up the coarser granularity (day) automatically for all but the earliest bucket (which is easy to correct for when you're flooring your start bucket). In the coarser time period, most reads are partition key misses, which are extremely inexpensive in Cassandra. If you do need most-recent-N queries for broad ranges and you expect to have some users whose clickrate is dramatically less frequent than your bucket interval (making iterating over buckets inefficient), you can keep a separate counter table with PK of ((user_id), bucket) in which you count new events. Now you can identify the exact set of buckets you need to read to satisfy the query no matter what the user's click volume is (so very low volume users have at most N partition keys queried, higher volume users query fewer partition keys). On Fri, Mar 6, 2015 at 4:06 PM, graham sanderson <gra...@vast.com> wrote: > Note that using static column(s) for the “head” value, and trailing TTLed > values behind is something we’re considering. Note this is especially nice > if your head state includes say a map which is updated by small deltas > (individual keys) > > We have not yet studied the effect of static columns on say DTCS > > > On Mar 6, 2015, at 4:42 PM, Clint Kelly <clint.ke...@gmail.com> wrote: > > Hi all, > > Thanks for the responses, this was very helpful. > > I don't know yet what the distribution of clicks and users will be, but I > expect to see a few users with an enormous amount of interactions and most > users having very few. The idea of doing some additional manual > partitioning, and then maintaining another table that contains the "head" > partition for each user makes sense, although it would add additional > latency when we want to get say the most recent 1000 interactions for a > given user (which is something that we have to do sometimes for > applications with tight SLAs). > > FWIW I doubt that any users will have so many interactions that they > exceed what we could reasonably put in a row, but I wanted to have a > strategy to deal with this. > > Having a nice design pattern in Cassandra for maintaining a row with the > N-most-recent interactions would also solve this reasonably well, but I > don't know of any way to implement that without running batch jobs that > periodically clean out data (which might be okay). > > Best regards, > Clint > > > > > On Tue, Mar 3, 2015 at 8:10 AM, mck <m...@apache.org> wrote: > >> >> > Here "partition" is a random digit from 0 to (N*M) >> > where N=nodes in cluster, and M=arbitrary number. >> >> >> Hopefully it was obvious, but here (unless you've got hot partitions), >> you don't need N. >> ~mck >> > > >