We use PlayOrm to do 60,000 different streams which are all time series and use 
the virtual column families of PlayOrm so they are all in one column family.  
We then partition by time as well.  I don't believe that we really have any 
hotspots from what I can tell.

Dean

From: Keith Wright <kwri...@nanigans.com<mailto:kwri...@nanigans.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, May 7, 2013 2:02 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: CQL3 Data Model Question

Hi all,

    I was hoping you could provide some assistance with a data modeling 
question (my apologies if a similar question has already been posed).  I have 
time based data that I need to store on a per customer (aka app id ) basis so 
that I can easily return it in sorted order by event time.  The data in 
question is being written at high volume (~50K / sec) and I am concerned about 
the cardinality of using either app id or event time as the row key as either 
will likely result in hot spots.  Here are is the table definition I am 
considering:

create table organic_events (
event_id UUID,
app_id INT,
event_time TIMESTAMP,
user_id INT,
….
PRIMARY KEY (app_id, event_time, event_id)
)  WITH CLUSTERING ORDER BY (app_id asc,event_time desc);

So that I can be able to query as follows which will naturally sort the results 
by time descending:

select * from organic_events where app_id = 1234 and event_time <= '2012-01-01' 
and event_time > '2012-01-01';

Anyone have an idea of the best way to accomplish this?  I was considering the 
following:

 *   Making the row key a concatenation of app id and 0-100 using a mod on 
event id to get the value.  When getting data I would just fetch all keys given 
the mods (app_id in (1234_0,1234_1,1234_2, etc).  This would alleviate the 
"hot" key issue but still seems expensive and a little hacky
 *   I tried removing app_id from the primary key all together (using primary 
key of user_id, event_time, event_id) and making app_id a secondary index.  I 
would need to sort by time on the client.  The above query is valid however 
running a query is VERY slow as I believe it needs to fetch every row key that 
matches the index which is quite expensive (I get a timeout in cqlsh).
 *   Create a different column family for each app id (I.e. 
1234_organic_events).  Note that we could easily have 1000s of application ids.

Thanks!

Reply via email to