Re: Timeseries analysis using Cassandra and partition by date period

Srinivasa T N Mon, 06 Apr 2015 03:31:27 -0700

 Comparison to OpenTSDB HBase

For one we do not use id’s for strings. The string data (metric names and
tags) are written to row keys and the appropriate indexes. Because
Cassandra has much wider rows there are far fewer keys written to the
database. The space saved by using id’s is minor and by not using id’s we
avoid having to use any kind of locks across the cluster.


As mentioned the Cassandra has wider rows. The default row size in OpenTSDB
HBase is 1 hour. Cassandra is set to 3 weeks.
http://kairosdb.github.io/kairosdocs/CassandraSchema.html

On Mon, Apr 6, 2015 at 3:27 PM, Serega Sheypak <[email protected]>
wrote:

> Thanks, is it a kind of opentsdb?
>
> 2015-04-05 18:28 GMT+02:00 Kevin Burton <[email protected]>:
>
>> > Hi, I switched from HBase to Cassandra and try to find problem solution
>> for timeseries analysis on top Cassandra.
>>
>> Depending on what you’re looking for, you might want to check out
>> KairosDB.
>>
>> 0.95 beta2 just shipped yesterday as well so you have good timing.
>>
>> https://github.com/kairosdb/kairosdb
>>
>> On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <[email protected]
>> > wrote:
>>
>>> Okay, so bucketing by day/week/month is a capacity planning stuff and
>>> actual questions I want to ask.
>>> As as a conclusion:
>>> I have a table events
>>>
>>> CREATE TABLE user_plans (
>>>   id timeuuid,
>>>   user_id timeuuid,
>>>   event_ts timestamp,
>>>   event_type int,
>>>   some_other_attr text
>>>
>>> PRIMARY KEY (user_id, ends)
>>> );
>>> which fits tactic queries:
>>> select smth from user_plans where user_id='xxx' and end_ts > now()
>>>
>>> Then I create second table user_plans_daily (or weekly, monthy)
>>>
>>> with DDL:
>>> CREATE TABLE user_plans_daily/weekly/monthly (
>>>   ymd int,
>>>   user_id timeuuid,
>>>   event_ts timestamp,
>>>   event_type int,
>>>   some_other_attr text
>>> )
>>> PRIMARY KEY ((ymd, user_id), event_ts )
>>> WITH CLUSTERING ORDER BY (event_ts DESC);
>>>
>>> And this table is good for answering strategic questions:
>>> select * from
>>> user_plans_daily/weekly/monthly
>>> where ymd in (....)
>>> And I should avoid long condition inside IN clause, that is why you
>>> suggest me to create bigger bucket, correct?
>>>
>>>
>>> 2015-04-04 20:00 GMT+02:00 Jack Krupansky <[email protected]>:
>>>
>>>> It sounds like your time bucket should be a month, but it depends on
>>>> the amount of data per user per day and your main query range. Within the
>>>> partition you can then query for a range of days.
>>>>
>>>> Yes, all of the rows within a partition are stored on one physical node
>>>> as well as the replica nodes.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <
>>>> [email protected]> wrote:
>>>>
>>>>> >non-equal relation on a partition key is not supported
>>>>> Ok, can I generate select query:
>>>>> select some_attributes
>>>>> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
>>>>> 20150331
>>>>>
>>>>> > The partition key determines which node can satisfy the query
>>>>> So you mean that all rows with the same *(ymd, user_id)* would be on
>>>>> one physical node?
>>>>>
>>>>>
>>>>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <[email protected]>:
>>>>>
>>>>>> Unfortunately, a non-equal relation on a partition key is not
>>>>>> supported. You would need to bucket by some larger unit, like a month, 
>>>>>> and
>>>>>> then use the date/time as a clustering column for the row key. Then you
>>>>>> could query within the partition. The partition key determines which node
>>>>>> can satisfy the query. Designing your partition key judiciously is the 
>>>>>> key
>>>>>> (haha!) to performant Cassandra applications.
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi, we plan to have 10^8 users and each user could generate 10
>>>>>>> events per day.
>>>>>>> So we have:
>>>>>>> 10^8 records per day
>>>>>>> 10^8*30 records per month.
>>>>>>> Our timewindow analysis could be from 1 to 6 months.
>>>>>>>
>>>>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts
>>>>>>> of event.
>>>>>>>
>>>>>>> So you suggest this approach:
>>>>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>>>>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>>>>>> * DESC);*
>>>>>>>
>>>>>>> where ymd=20150102 (the Second of January)?
>>>>>>>
>>>>>>> *What happens to writes:*
>>>>>>> SSTable with past days (ymd < current_day) stay untouched and don't
>>>>>>> take part in Compaction process since there are o changes to them?
>>>>>>>
>>>>>>> What happens to read:
>>>>>>> I issue query:
>>>>>>> select some_attributes
>>>>>>> from events where ymd >= 20150101 and ymd < 20150301
>>>>>>> Does Cassandra skip SSTables which don't have ymd in specified range
>>>>>>> and give me a kind of partition elimination, like in traditional DBs?
>>>>>>>
>>>>>>>
>>>>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <[email protected]>
>>>>>>> :
>>>>>>>
>>>>>>>> It depends on the actual number of events per user, but simply
>>>>>>>> bucketing the partition key can give you the same effect - clustering 
>>>>>>>> rows
>>>>>>>> by time range. A composite partition key could be comprised of the user
>>>>>>>> name and the date.
>>>>>>>>
>>>>>>>> It also depends on the data rate - is it many events per day or
>>>>>>>> just a few events per week, or over what time period. You need to be
>>>>>>>> careful - you don't want your Cassandra partitions to be too big 
>>>>>>>> (millions
>>>>>>>> of rows) or too small (just a few or even one row per partition.)
>>>>>>>>
>>>>>>>> -- Jack Krupansky
>>>>>>>>
>>>>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi, I switched from HBase to Cassandra and try to find problem
>>>>>>>>> solution for timeseries analysis on top Cassandra.
>>>>>>>>> I have a entity named "Event".
>>>>>>>>> "Event" has attributes:
>>>>>>>>> user_id - a guy who triggered event
>>>>>>>>> event_ts - when even happened
>>>>>>>>> event_type - type of event
>>>>>>>>> some_other_attr - some other attrs we don't care about right now.
>>>>>>>>>
>>>>>>>>> The DDL for entity event looks this way:
>>>>>>>>>
>>>>>>>>> CREATE TABLE user_plans (
>>>>>>>>>
>>>>>>>>>   id timeuuid,
>>>>>>>>>   user_id timeuuid,
>>>>>>>>>   event_ts timestamp,
>>>>>>>>>   event_type int,
>>>>>>>>>   some_other_attr text
>>>>>>>>>
>>>>>>>>> PRIMARY KEY (user_id, ends)
>>>>>>>>> );
>>>>>>>>>
>>>>>>>>> Table is "infinite", It would grow continuously during application
>>>>>>>>> lifetime.
>>>>>>>>> I want to ask question:
>>>>>>>>> Cassandra, give me all event where event_ts >= xxx
>>>>>>>>> and event_ts <=yyy.
>>>>>>>>>
>>>>>>>>> Right now it would lead to full table scan.
>>>>>>>>>
>>>>>>>>> There is a trick in HBase. HBase has table abstraction and HBase
>>>>>>>>> has Column Family abstraction.
>>>>>>>>> Column family should be declared in advance.
>>>>>>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>>>>>>> So I can easily add partitioning for my HBase table:
>>>>>>>>> alter table hbase_events add column familiy '2015_01'
>>>>>>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>>>>>>
>>>>>>>>> When I want to get January data, I would directly access column
>>>>>>>>> family named '2015_01' and I won't massage all data in table, just 
>>>>>>>>> this
>>>>>>>>> piece.
>>>>>>>>>
>>>>>>>>> What is approach in C* in this case?
>>>>>>>>> I have an idea create several tables: event_2015_01,
>>>>>>>>> event_2015_02, e.t.c. but it looks rather ugly from my current
>>>>>>>>> understanding how it works.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>>
>>
>

Re: Timeseries analysis using Cassandra and partition by date period

Reply via email to