Hi all,

    I was hoping you could provide some assistance with a data modeling 
question (my apologies if a similar question has already been posed).  I have 
time based data that I need to store on a per customer (aka app id ) basis so 
that I can easily return it in sorted order by event time.  The data in 
question is being written at high volume (~50K / sec) and I am concerned about 
the cardinality of using either app id or event time as the row key as either 
will likely result in hot spots.  Here are is the table definition I am 
considering:

create table organic_events (
event_id UUID,
app_id INT,
event_time TIMESTAMP,
user_id INT,
….
PRIMARY KEY (app_id, event_time, event_id)
)  WITH CLUSTERING ORDER BY (app_id asc,event_time desc);

So that I can be able to query as follows which will naturally sort the results 
by time descending:

select * from organic_events where app_id = 1234 and event_time <= '2012-01-01' 
and event_time > '2012-01-01';

Anyone have an idea of the best way to accomplish this?  I was considering the 
following:

 *   Making the row key a concatenation of app id and 0-100 using a mod on 
event id to get the value.  When getting data I would just fetch all keys given 
the mods (app_id in (1234_0,1234_1,1234_2, etc).  This would alleviate the 
"hot" key issue but still seems expensive and a little hacky
 *   I tried removing app_id from the primary key all together (using primary 
key of user_id, event_time, event_id) and making app_id a secondary index.  I 
would need to sort by time on the client.  The above query is valid however 
running a query is VERY slow as I believe it needs to fetch every row key that 
matches the index which is quite expensive (I get a timeout in cqlsh).
 *   Create a different column family for each app id (I.e. 
1234_organic_events).  Note that we could easily have 1000s of application ids.

Thanks!

Reply via email to