Okay, so bucketing by day/week/month is a capacity planning stuff and actual questions I want to ask. As as a conclusion: I have a table events
CREATE TABLE user_plans ( id timeuuid, user_id timeuuid, event_ts timestamp, event_type int, some_other_attr text PRIMARY KEY (user_id, ends) ); which fits tactic queries: select smth from user_plans where user_id='xxx' and end_ts > now() Then I create second table user_plans_daily (or weekly, monthy) with DDL: CREATE TABLE user_plans_daily/weekly/monthly ( ymd int, user_id timeuuid, event_ts timestamp, event_type int, some_other_attr text ) PRIMARY KEY ((ymd, user_id), event_ts ) WITH CLUSTERING ORDER BY (event_ts DESC); And this table is good for answering strategic questions: select * from user_plans_daily/weekly/monthly where ymd in (....) And I should avoid long condition inside IN clause, that is why you suggest me to create bigger bucket, correct? 2015-04-04 20:00 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>: > It sounds like your time bucket should be a month, but it depends on the > amount of data per user per day and your main query range. Within the > partition you can then query for a range of days. > > Yes, all of the rows within a partition are stored on one physical node as > well as the replica nodes. > > -- Jack Krupansky > > On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <serega.shey...@gmail.com> > wrote: > >> >non-equal relation on a partition key is not supported >> Ok, can I generate select query: >> select some_attributes >> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or >> 20150331 >> >> > The partition key determines which node can satisfy the query >> So you mean that all rows with the same *(ymd, user_id)* would be on one >> physical node? >> >> >> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>: >> >>> Unfortunately, a non-equal relation on a partition key is not supported. >>> You would need to bucket by some larger unit, like a month, and then use >>> the date/time as a clustering column for the row key. Then you could query >>> within the partition. The partition key determines which node can satisfy >>> the query. Designing your partition key judiciously is the key (haha!) to >>> performant Cassandra applications. >>> >>> -- Jack Krupansky >>> >>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <serega.shey...@gmail.com >>> > wrote: >>> >>>> Hi, we plan to have 10^8 users and each user could generate 10 events >>>> per day. >>>> So we have: >>>> 10^8 records per day >>>> 10^8*30 records per month. >>>> Our timewindow analysis could be from 1 to 6 months. >>>> >>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of >>>> event. >>>> >>>> So you suggest this approach: >>>> *PRIMARY KEY ((ymd, user_id), event_ts ) * >>>> *WITH CLUSTERING ORDER BY (**event_ts* >>>> * DESC);* >>>> >>>> where ymd=20150102 (the Second of January)? >>>> >>>> *What happens to writes:* >>>> SSTable with past days (ymd < current_day) stay untouched and don't >>>> take part in Compaction process since there are o changes to them? >>>> >>>> What happens to read: >>>> I issue query: >>>> select some_attributes >>>> from events where ymd >= 20150101 and ymd < 20150301 >>>> Does Cassandra skip SSTables which don't have ymd in specified range >>>> and give me a kind of partition elimination, like in traditional DBs? >>>> >>>> >>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>: >>>> >>>>> It depends on the actual number of events per user, but simply >>>>> bucketing the partition key can give you the same effect - clustering rows >>>>> by time range. A composite partition key could be comprised of the user >>>>> name and the date. >>>>> >>>>> It also depends on the data rate - is it many events per day or just a >>>>> few events per week, or over what time period. You need to be careful - >>>>> you >>>>> don't want your Cassandra partitions to be too big (millions of rows) or >>>>> too small (just a few or even one row per partition.) >>>>> >>>>> -- Jack Krupansky >>>>> >>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak < >>>>> serega.shey...@gmail.com> wrote: >>>>> >>>>>> Hi, I switched from HBase to Cassandra and try to find problem >>>>>> solution for timeseries analysis on top Cassandra. >>>>>> I have a entity named "Event". >>>>>> "Event" has attributes: >>>>>> user_id - a guy who triggered event >>>>>> event_ts - when even happened >>>>>> event_type - type of event >>>>>> some_other_attr - some other attrs we don't care about right now. >>>>>> >>>>>> The DDL for entity event looks this way: >>>>>> >>>>>> CREATE TABLE user_plans ( >>>>>> >>>>>> id timeuuid, >>>>>> user_id timeuuid, >>>>>> event_ts timestamp, >>>>>> event_type int, >>>>>> some_other_attr text >>>>>> >>>>>> PRIMARY KEY (user_id, ends) >>>>>> ); >>>>>> >>>>>> Table is "infinite", It would grow continuously during application >>>>>> lifetime. >>>>>> I want to ask question: >>>>>> Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy. >>>>>> >>>>>> Right now it would lead to full table scan. >>>>>> >>>>>> There is a trick in HBase. HBase has table abstraction and HBase has >>>>>> Column Family abstraction. >>>>>> Column family should be declared in advance. >>>>>> Column family - physically is a pack of HFiles ("SSTables in C*"). >>>>>> So I can easily add partitioning for my HBase table: >>>>>> alter table hbase_events add column familiy '2015_01' >>>>>> and store all 2015 January data to Column familiy named '2015_01'. >>>>>> >>>>>> When I want to get January data, I would directly access column >>>>>> family named '2015_01' and I won't massage all data in table, just this >>>>>> piece. >>>>>> >>>>>> What is approach in C* in this case? >>>>>> I have an idea create several tables: event_2015_01, event_2015_02, >>>>>> e.t.c. but it looks rather ugly from my current understanding how it >>>>>> works. >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >