Re: Timeseries analysis using Cassandra and partition by date period

Serega Sheypak Sat, 04 Apr 2015 11:32:30 -0700

Okay, so bucketing by day/week/month is a capacity planning stuff and
actual questions I want to ask.
As as a conclusion:
I have a table events


CREATE TABLE user_plans (
  id timeuuid,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text

PRIMARY KEY (user_id, ends)
);
which fits tactic queries:
select smth from user_plans where user_id='xxx' and end_ts > now()

Then I create second table user_plans_daily (or weekly, monthy)

with DDL:
CREATE TABLE user_plans_daily/weekly/monthly (
  ymd int,
  user_id timeuuid,
  event_ts timestamp,
  event_type int,
  some_other_attr text
)
PRIMARY KEY ((ymd, user_id), event_ts )
WITH CLUSTERING ORDER BY (event_ts DESC);

And this table is good for answering strategic questions:
select * from
user_plans_daily/weekly/monthly
where ymd in (....)
And I should avoid long condition inside IN clause, that is why you suggest
me to create bigger bucket, correct?


2015-04-04 20:00 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:

> It sounds like your time bucket should be a month, but it depends on the
> amount of data per user per day and your main query range. Within the
> partition you can then query for a range of days.
>
> Yes, all of the rows within a partition are stored on one physical node as
> well as the replica nodes.
>
> -- Jack Krupansky
>
> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <serega.shey...@gmail.com>
> wrote:
>
>> >non-equal relation on a partition key is not supported
>> Ok, can I generate select query:
>> select some_attributes
>> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
>> 20150331
>>
>> > The partition key determines which node can satisfy the query
>> So you mean that all rows with the same *(ymd, user_id)* would be on one
>> physical node?
>>
>>
>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:
>>
>>> Unfortunately, a non-equal relation on a partition key is not supported.
>>> You would need to bucket by some larger unit, like a month, and then use
>>> the date/time as a clustering column for the row key. Then you could query
>>> within the partition. The partition key determines which node can satisfy
>>> the query. Designing your partition key judiciously is the key (haha!) to
>>> performant Cassandra applications.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <serega.shey...@gmail.com
>>> > wrote:
>>>
>>>> Hi, we plan to have 10^8 users and each user could generate 10 events
>>>> per day.
>>>> So we have:
>>>> 10^8 records per day
>>>> 10^8*30 records per month.
>>>> Our timewindow analysis could be from 1 to 6 months.
>>>>
>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of
>>>> event.
>>>>
>>>> So you suggest this approach:
>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>>> * DESC);*
>>>>
>>>> where ymd=20150102 (the Second of January)?
>>>>
>>>> *What happens to writes:*
>>>> SSTable with past days (ymd < current_day) stay untouched and don't
>>>> take part in Compaction process since there are o changes to them?
>>>>
>>>> What happens to read:
>>>> I issue query:
>>>> select some_attributes
>>>> from events where ymd >= 20150101 and ymd < 20150301
>>>> Does Cassandra skip SSTables which don't have ymd in specified range
>>>> and give me a kind of partition elimination, like in traditional DBs?
>>>>
>>>>
>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:
>>>>
>>>>> It depends on the actual number of events per user, but simply
>>>>> bucketing the partition key can give you the same effect - clustering rows
>>>>> by time range. A composite partition key could be comprised of the user
>>>>> name and the date.
>>>>>
>>>>> It also depends on the data rate - is it many events per day or just a
>>>>> few events per week, or over what time period. You need to be careful - 
>>>>> you
>>>>> don't want your Cassandra partitions to be too big (millions of rows) or
>>>>> too small (just a few or even one row per partition.)
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>>>> serega.shey...@gmail.com> wrote:
>>>>>
>>>>>> Hi, I switched from HBase to Cassandra and try to find problem
>>>>>> solution for timeseries analysis on top Cassandra.
>>>>>> I have a entity named "Event".
>>>>>> "Event" has attributes:
>>>>>> user_id - a guy who triggered event
>>>>>> event_ts - when even happened
>>>>>> event_type - type of event
>>>>>> some_other_attr - some other attrs we don't care about right now.
>>>>>>
>>>>>> The DDL for entity event looks this way:
>>>>>>
>>>>>> CREATE TABLE user_plans (
>>>>>>
>>>>>>   id timeuuid,
>>>>>>   user_id timeuuid,
>>>>>>   event_ts timestamp,
>>>>>>   event_type int,
>>>>>>   some_other_attr text
>>>>>>
>>>>>> PRIMARY KEY (user_id, ends)
>>>>>> );
>>>>>>
>>>>>> Table is "infinite", It would grow continuously during application
>>>>>> lifetime.
>>>>>> I want to ask question:
>>>>>> Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.
>>>>>>
>>>>>> Right now it would lead to full table scan.
>>>>>>
>>>>>> There is a trick in HBase. HBase has table abstraction and HBase has
>>>>>> Column Family abstraction.
>>>>>> Column family should be declared in advance.
>>>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>>>> So I can easily add partitioning for my HBase table:
>>>>>> alter table hbase_events add column familiy '2015_01'
>>>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>>>
>>>>>> When I want to get January data, I would directly access column
>>>>>> family named '2015_01' and I won't massage all data in table, just this
>>>>>> piece.
>>>>>>
>>>>>> What is approach in C* in this case?
>>>>>> I have an idea create several tables: event_2015_01, event_2015_02,
>>>>>> e.t.c. but it looks rather ugly from my current understanding how it 
>>>>>> works.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Timeseries analysis using Cassandra and partition by date period

Reply via email to