Re: best practices for time-series data with massive amounts of records

Jack Krupansky Tue, 03 Mar 2015 04:51:08 -0800

I'd recommend using 100K and 10M as rough guidelines for the maximum number
of rows and bytes in a single partition. Sure, Cassandra can technically
handle a lot more than that, but very large partitions can make your life
more difficult. Of course you will have to do a POC to validate the sweet
spot for your particular app, data model, actual data values, hardware, app
access patterns, and app latency requirements. It may be that your actual
numbers should be half or twice my guidance, but they are a starting point.


Back to your starting point: You really need to characterize the number of
"records" per user. For example, will you have a large number of users with
few records? IOW, what are the expected distributions for user count and
record per user count. Give some specific numbers. Even if you don't know
what the real numbers will be, you have to at least have a model for counts
before modeling the partition keys.

-- Jack Krupansky

On Mon, Mar 2, 2015 at 2:47 PM, Clint Kelly <clint.ke...@gmail.com> wrote:

> Hi all,
>
> I am designing an application that will capture time series data where we
> expect the number of records per user to potentially be extremely high.  I
> am not sure if we will eclipse the max row size of 2B elements, but I
> assume that we would not want our application to approach that size anyway.
>
> If we wanted to put all of the interactions in a single row, then I would
> make a data model that looks like:
>
> CREATE TABLE events (
>   id text,
>   event_time timestamp,
>   event blob,
>   PRIMARY KEY (id, event_time))
> WITH CLUSTERING ORDER BY (event_time DESC);
>
> The best practice for breaking up large rows of time series data is, as I
> understand it, to put part of the time into the partitioning key (
> http://planetcassandra.org/getting-started-with-time-series-data-modeling/
> ):
>
> CREATE TABLE events (
>   id text,
>   date text, // Could also use year+month here or year+week or something
> else
>   event_time timestamp,
>   event blob,
>   PRIMARY KEY ((id, date), event_time))
> WITH CLUSTERING ORDER BY (event_time DESC);
>
> The downside of this approach is that we can no longer do a simple
> continuous scan to get all of the events for a given user.  Some users may
> log lots and lots of interactions every day, while others may interact with
> our application infrequently, so I'd like a quick way to get the most
> recent interaction for a given user.
>
> Has anyone used different approaches for this problem?
>
> The only thing I can think of is to use the second table schema described
> above, but switch to an order-preserving hashing function, and then
> manually hash the "id" field.  This is essentially what we would do in
> HBase.
>
> Curious if anyone else has any thoughts.
>
> Best regards,
> Clint
>
>
>

Re: best practices for time-series data with massive amounts of records

Reply via email to