Re: Timeseries analysis using Cassandra and partition by date period

Jack Krupansky Sat, 04 Apr 2015 11:03:12 -0700

It sounds like your time bucket should be a month, but it depends on the
amount of data per user per day and your main query range. Within the
partition you can then query for a range of days.


Yes, all of the rows within a partition are stored on one physical node as
well as the replica nodes.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <serega.shey...@gmail.com>
wrote:

> >non-equal relation on a partition key is not supported
> Ok, can I generate select query:
> select some_attributes
> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
> 20150331
>
> > The partition key determines which node can satisfy the query
> So you mean that all rows with the same *(ymd, user_id)* would be on one
> physical node?
>
>
> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:
>
>> Unfortunately, a non-equal relation on a partition key is not supported.
>> You would need to bucket by some larger unit, like a month, and then use
>> the date/time as a clustering column for the row key. Then you could query
>> within the partition. The partition key determines which node can satisfy
>> the query. Designing your partition key judiciously is the key (haha!) to
>> performant Cassandra applications.
>>
>> -- Jack Krupansky
>>
>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <serega.shey...@gmail.com>
>> wrote:
>>
>>> Hi, we plan to have 10^8 users and each user could generate 10 events
>>> per day.
>>> So we have:
>>> 10^8 records per day
>>> 10^8*30 records per month.
>>> Our timewindow analysis could be from 1 to 6 months.
>>>
>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of
>>> event.
>>>
>>> So you suggest this approach:
>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>> * DESC);*
>>>
>>> where ymd=20150102 (the Second of January)?
>>>
>>> *What happens to writes:*
>>> SSTable with past days (ymd < current_day) stay untouched and don't take
>>> part in Compaction process since there are o changes to them?
>>>
>>> What happens to read:
>>> I issue query:
>>> select some_attributes
>>> from events where ymd >= 20150101 and ymd < 20150301
>>> Does Cassandra skip SSTables which don't have ymd in specified range and
>>> give me a kind of partition elimination, like in traditional DBs?
>>>
>>>
>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:
>>>
>>>> It depends on the actual number of events per user, but simply
>>>> bucketing the partition key can give you the same effect - clustering rows
>>>> by time range. A composite partition key could be comprised of the user
>>>> name and the date.
>>>>
>>>> It also depends on the data rate - is it many events per day or just a
>>>> few events per week, or over what time period. You need to be careful - you
>>>> don't want your Cassandra partitions to be too big (millions of rows) or
>>>> too small (just a few or even one row per partition.)
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>>> serega.shey...@gmail.com> wrote:
>>>>
>>>>> Hi, I switched from HBase to Cassandra and try to find problem
>>>>> solution for timeseries analysis on top Cassandra.
>>>>> I have a entity named "Event".
>>>>> "Event" has attributes:
>>>>> user_id - a guy who triggered event
>>>>> event_ts - when even happened
>>>>> event_type - type of event
>>>>> some_other_attr - some other attrs we don't care about right now.
>>>>>
>>>>> The DDL for entity event looks this way:
>>>>>
>>>>> CREATE TABLE user_plans (
>>>>>
>>>>>   id timeuuid,
>>>>>   user_id timeuuid,
>>>>>   event_ts timestamp,
>>>>>   event_type int,
>>>>>   some_other_attr text
>>>>>
>>>>> PRIMARY KEY (user_id, ends)
>>>>> );
>>>>>
>>>>> Table is "infinite", It would grow continuously during application
>>>>> lifetime.
>>>>> I want to ask question:
>>>>> Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.
>>>>>
>>>>> Right now it would lead to full table scan.
>>>>>
>>>>> There is a trick in HBase. HBase has table abstraction and HBase has
>>>>> Column Family abstraction.
>>>>> Column family should be declared in advance.
>>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>>> So I can easily add partitioning for my HBase table:
>>>>> alter table hbase_events add column familiy '2015_01'
>>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>>
>>>>> When I want to get January data, I would directly access column family
>>>>> named '2015_01' and I won't massage all data in table, just this piece.
>>>>>
>>>>> What is approach in C* in this case?
>>>>> I have an idea create several tables: event_2015_01, event_2015_02,
>>>>> e.t.c. but it looks rather ugly from my current understanding how it 
>>>>> works.
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Timeseries analysis using Cassandra and partition by date period

Reply via email to