Re: Timeseries analysis using Cassandra and partition by date period

Serega Sheypak Sat, 04 Apr 2015 06:36:01 -0700

Hi, we plan to have 10^8 users and each user could generate 10 events per
day.
So we have:
10^8 records per day
10^8*30 records per month.
Our timewindow analysis could be from 1 to 6 months.


Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of
event.

So you suggest this approach:
*PRIMARY KEY ((ymd, user_id), event_ts ) *
*WITH CLUSTERING ORDER BY (**event_ts*
* DESC);*

where ymd=20150102 (the Second of January)?

*What happens to writes:*
SSTable with past days (ymd < current_day) stay untouched and don't take
part in Compaction process since there are o changes to them?

What happens to read:
I issue query:
select some_attributes
from events where ymd >= 20150101 and ymd < 20150301
Does Cassandra skip SSTables which don't have ymd in specified range and
give me a kind of partition elimination, like in traditional DBs?


2015-04-04 14:41 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:

> It depends on the actual number of events per user, but simply bucketing
> the partition key can give you the same effect - clustering rows by time
> range. A composite partition key could be comprised of the user name and
> the date.
>
> It also depends on the data rate - is it many events per day or just a few
> events per week, or over what time period. You need to be careful - you
> don't want your Cassandra partitions to be too big (millions of rows) or
> too small (just a few or even one row per partition.)
>
> -- Jack Krupansky
>
> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <serega.shey...@gmail.com>
> wrote:
>
>> Hi, I switched from HBase to Cassandra and try to find problem solution
>> for timeseries analysis on top Cassandra.
>> I have a entity named "Event".
>> "Event" has attributes:
>> user_id - a guy who triggered event
>> event_ts - when even happened
>> event_type - type of event
>> some_other_attr - some other attrs we don't care about right now.
>>
>> The DDL for entity event looks this way:
>>
>> CREATE TABLE user_plans (
>>
>>   id timeuuid,
>>   user_id timeuuid,
>>   event_ts timestamp,
>>   event_type int,
>>   some_other_attr text
>>
>> PRIMARY KEY (user_id, ends)
>> );
>>
>> Table is "infinite", It would grow continuously during application
>> lifetime.
>> I want to ask question:
>> Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.
>>
>> Right now it would lead to full table scan.
>>
>> There is a trick in HBase. HBase has table abstraction and HBase has
>> Column Family abstraction.
>> Column family should be declared in advance.
>> Column family - physically is a pack of HFiles ("SSTables in C*").
>> So I can easily add partitioning for my HBase table:
>> alter table hbase_events add column familiy '2015_01'
>> and store all 2015 January data to Column familiy named '2015_01'.
>>
>> When I want to get January data, I would directly access column family
>> named '2015_01' and I won't massage all data in table, just this piece.
>>
>> What is approach in C* in this case?
>> I have an idea create several tables: event_2015_01, event_2015_02,
>> e.t.c. but it looks rather ugly from my current understanding how it works.
>>
>>
>>
>

Re: Timeseries analysis using Cassandra and partition by date period

Reply via email to