We use PlayOrm to do 60,000 different streams which are all time series and use the virtual column families of PlayOrm so they are all in one column family. We then partition by time as well. I don't believe that we really have any hotspots from what I can tell.
Dean From: Keith Wright <kwri...@nanigans.com<mailto:kwri...@nanigans.com>> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Date: Tuesday, May 7, 2013 2:02 PM To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>> Subject: CQL3 Data Model Question Hi all, I was hoping you could provide some assistance with a data modeling question (my apologies if a similar question has already been posed). I have time based data that I need to store on a per customer (aka app id ) basis so that I can easily return it in sorted order by event time. The data in question is being written at high volume (~50K / sec) and I am concerned about the cardinality of using either app id or event time as the row key as either will likely result in hot spots. Here are is the table definition I am considering: create table organic_events ( event_id UUID, app_id INT, event_time TIMESTAMP, user_id INT, …. PRIMARY KEY (app_id, event_time, event_id) ) WITH CLUSTERING ORDER BY (app_id asc,event_time desc); So that I can be able to query as follows which will naturally sort the results by time descending: select * from organic_events where app_id = 1234 and event_time <= '2012-01-01' and event_time > '2012-01-01'; Anyone have an idea of the best way to accomplish this? I was considering the following: * Making the row key a concatenation of app id and 0-100 using a mod on event id to get the value. When getting data I would just fetch all keys given the mods (app_id in (1234_0,1234_1,1234_2, etc). This would alleviate the "hot" key issue but still seems expensive and a little hacky * I tried removing app_id from the primary key all together (using primary key of user_id, event_time, event_id) and making app_id a secondary index. I would need to sort by time on the client. The above query is valid however running a query is VERY slow as I believe it needs to fetch every row key that matches the index which is quite expensive (I get a timeout in cqlsh). * Create a different column family for each app id (I.e. 1234_organic_events). Note that we could easily have 1000s of application ids. Thanks!