Hi all, I am designing an application that will capture time series data where we expect the number of records per user to potentially be extremely high. I am not sure if we will eclipse the max row size of 2B elements, but I assume that we would not want our application to approach that size anyway.
If we wanted to put all of the interactions in a single row, then I would make a data model that looks like: CREATE TABLE events ( id text, event_time timestamp, event blob, PRIMARY KEY (id, event_time)) WITH CLUSTERING ORDER BY (event_time DESC); The best practice for breaking up large rows of time series data is, as I understand it, to put part of the time into the partitioning key ( http://planetcassandra.org/getting-started-with-time-series-data-modeling/): CREATE TABLE events ( id text, date text, // Could also use year+month here or year+week or something else event_time timestamp, event blob, PRIMARY KEY ((id, date), event_time)) WITH CLUSTERING ORDER BY (event_time DESC); The downside of this approach is that we can no longer do a simple continuous scan to get all of the events for a given user. Some users may log lots and lots of interactions every day, while others may interact with our application infrequently, so I'd like a quick way to get the most recent interaction for a given user. Has anyone used different approaches for this problem? The only thing I can think of is to use the second table schema described above, but switch to an order-preserving hashing function, and then manually hash the "id" field. This is essentially what we would do in HBase. Curious if anyone else has any thoughts. Best regards, Clint