Re: Timeseries analysis using Cassandra and partition by date period

Serega Sheypak Mon, 06 Apr 2015 03:39:37 -0700

Thank you, we'll see that instrument,

2015-04-06 12:30 GMT+02:00 Srinivasa T N <seen...@gmail.com>:


> Comparison to OpenTSDB HBase
>
> For one we do not use id’s for strings. The string data (metric names and
> tags) are written to row keys and the appropriate indexes. Because
> Cassandra has much wider rows there are far fewer keys written to the
> database. The space saved by using id’s is minor and by not using id’s we
> avoid having to use any kind of locks across the cluster.
>
> As mentioned the Cassandra has wider rows. The default row size in
> OpenTSDB HBase is 1 hour. Cassandra is set to 3 weeks.
> http://kairosdb.github.io/kairosdocs/CassandraSchema.html
>
> On Mon, Apr 6, 2015 at 3:27 PM, Serega Sheypak <serega.shey...@gmail.com>
> wrote:
>
>> Thanks, is it a kind of opentsdb?
>>
>> 2015-04-05 18:28 GMT+02:00 Kevin Burton <bur...@spinn3r.com>:
>>
>>> > Hi, I switched from HBase to Cassandra and try to find problem
>>> solution for timeseries analysis on top Cassandra.
>>>
>>> Depending on what you’re looking for, you might want to check out
>>> KairosDB.
>>>
>>> 0.95 beta2 just shipped yesterday as well so you have good timing.
>>>
>>> https://github.com/kairosdb/kairosdb
>>>
>>> On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <
>>> serega.shey...@gmail.com> wrote:
>>>
>>>> Okay, so bucketing by day/week/month is a capacity planning stuff and
>>>> actual questions I want to ask.
>>>> As as a conclusion:
>>>> I have a table events
>>>>
>>>> CREATE TABLE user_plans (
>>>>   id timeuuid,
>>>>   user_id timeuuid,
>>>>   event_ts timestamp,
>>>>   event_type int,
>>>>   some_other_attr text
>>>>
>>>> PRIMARY KEY (user_id, ends)
>>>> );
>>>> which fits tactic queries:
>>>> select smth from user_plans where user_id='xxx' and end_ts > now()
>>>>
>>>> Then I create second table user_plans_daily (or weekly, monthy)
>>>>
>>>> with DDL:
>>>> CREATE TABLE user_plans_daily/weekly/monthly (
>>>>   ymd int,
>>>>   user_id timeuuid,
>>>>   event_ts timestamp,
>>>>   event_type int,
>>>>   some_other_attr text
>>>> )
>>>> PRIMARY KEY ((ymd, user_id), event_ts )
>>>> WITH CLUSTERING ORDER BY (event_ts DESC);
>>>>
>>>> And this table is good for answering strategic questions:
>>>> select * from
>>>> user_plans_daily/weekly/monthly
>>>> where ymd in (....)
>>>> And I should avoid long condition inside IN clause, that is why you
>>>> suggest me to create bigger bucket, correct?
>>>>
>>>>
>>>> 2015-04-04 20:00 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:
>>>>
>>>>> It sounds like your time bucket should be a month, but it depends on
>>>>> the amount of data per user per day and your main query range. Within the
>>>>> partition you can then query for a range of days.
>>>>>
>>>>> Yes, all of the rows within a partition are stored on one physical
>>>>> node as well as the replica nodes.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <
>>>>> serega.shey...@gmail.com> wrote:
>>>>>
>>>>>> >non-equal relation on a partition key is not supported
>>>>>> Ok, can I generate select query:
>>>>>> select some_attributes
>>>>>> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ...
>>>>>> or 20150331
>>>>>>
>>>>>> > The partition key determines which node can satisfy the query
>>>>>> So you mean that all rows with the same *(ymd, user_id)* would be on
>>>>>> one physical node?
>>>>>>
>>>>>>
>>>>>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:
>>>>>>
>>>>>>> Unfortunately, a non-equal relation on a partition key is not
>>>>>>> supported. You would need to bucket by some larger unit, like a month, 
>>>>>>> and
>>>>>>> then use the date/time as a clustering column for the row key. Then you
>>>>>>> could query within the partition. The partition key determines which 
>>>>>>> node
>>>>>>> can satisfy the query. Designing your partition key judiciously is the 
>>>>>>> key
>>>>>>> (haha!) to performant Cassandra applications.
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <
>>>>>>> serega.shey...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi, we plan to have 10^8 users and each user could generate 10
>>>>>>>> events per day.
>>>>>>>> So we have:
>>>>>>>> 10^8 records per day
>>>>>>>> 10^8*30 records per month.
>>>>>>>> Our timewindow analysis could be from 1 to 6 months.
>>>>>>>>
>>>>>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact
>>>>>>>> ts of event.
>>>>>>>>
>>>>>>>> So you suggest this approach:
>>>>>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>>>>>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>>>>>>> * DESC);*
>>>>>>>>
>>>>>>>> where ymd=20150102 (the Second of January)?
>>>>>>>>
>>>>>>>> *What happens to writes:*
>>>>>>>> SSTable with past days (ymd < current_day) stay untouched and don't
>>>>>>>> take part in Compaction process since there are o changes to them?
>>>>>>>>
>>>>>>>> What happens to read:
>>>>>>>> I issue query:
>>>>>>>> select some_attributes
>>>>>>>> from events where ymd >= 20150101 and ymd < 20150301
>>>>>>>> Does Cassandra skip SSTables which don't have ymd in specified
>>>>>>>> range and give me a kind of partition elimination, like in traditional 
>>>>>>>> DBs?
>>>>>>>>
>>>>>>>>
>>>>>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com
>>>>>>>> >:
>>>>>>>>
>>>>>>>>> It depends on the actual number of events per user, but simply
>>>>>>>>> bucketing the partition key can give you the same effect - clustering 
>>>>>>>>> rows
>>>>>>>>> by time range. A composite partition key could be comprised of the 
>>>>>>>>> user
>>>>>>>>> name and the date.
>>>>>>>>>
>>>>>>>>> It also depends on the data rate - is it many events per day or
>>>>>>>>> just a few events per week, or over what time period. You need to be
>>>>>>>>> careful - you don't want your Cassandra partitions to be too big 
>>>>>>>>> (millions
>>>>>>>>> of rows) or too small (just a few or even one row per partition.)
>>>>>>>>>
>>>>>>>>> -- Jack Krupansky
>>>>>>>>>
>>>>>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>>>>>>>> serega.shey...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, I switched from HBase to Cassandra and try to find problem
>>>>>>>>>> solution for timeseries analysis on top Cassandra.
>>>>>>>>>> I have a entity named "Event".
>>>>>>>>>> "Event" has attributes:
>>>>>>>>>> user_id - a guy who triggered event
>>>>>>>>>> event_ts - when even happened
>>>>>>>>>> event_type - type of event
>>>>>>>>>> some_other_attr - some other attrs we don't care about right now.
>>>>>>>>>>
>>>>>>>>>> The DDL for entity event looks this way:
>>>>>>>>>>
>>>>>>>>>> CREATE TABLE user_plans (
>>>>>>>>>>
>>>>>>>>>>   id timeuuid,
>>>>>>>>>>   user_id timeuuid,
>>>>>>>>>>   event_ts timestamp,
>>>>>>>>>>   event_type int,
>>>>>>>>>>   some_other_attr text
>>>>>>>>>>
>>>>>>>>>> PRIMARY KEY (user_id, ends)
>>>>>>>>>> );
>>>>>>>>>>
>>>>>>>>>> Table is "infinite", It would grow continuously during
>>>>>>>>>> application lifetime.
>>>>>>>>>> I want to ask question:
>>>>>>>>>> Cassandra, give me all event where event_ts >= xxx
>>>>>>>>>> and event_ts <=yyy.
>>>>>>>>>>
>>>>>>>>>> Right now it would lead to full table scan.
>>>>>>>>>>
>>>>>>>>>> There is a trick in HBase. HBase has table abstraction and HBase
>>>>>>>>>> has Column Family abstraction.
>>>>>>>>>> Column family should be declared in advance.
>>>>>>>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>>>>>>>> So I can easily add partitioning for my HBase table:
>>>>>>>>>> alter table hbase_events add column familiy '2015_01'
>>>>>>>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>>>>>>>
>>>>>>>>>> When I want to get January data, I would directly access column
>>>>>>>>>> family named '2015_01' and I won't massage all data in table, just 
>>>>>>>>>> this
>>>>>>>>>> piece.
>>>>>>>>>>
>>>>>>>>>> What is approach in C* in this case?
>>>>>>>>>> I have an idea create several tables: event_2015_01,
>>>>>>>>>> event_2015_02, e.t.c. but it looks rather ugly from my current
>>>>>>>>>> understanding how it works.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> <https://plus.google.com/102718274791889610666/posts>
>>> <http://spinn3r.com>
>>>
>>>
>>
>

Re: Timeseries analysis using Cassandra and partition by date period

Reply via email to