Thank you, we'll see that instrument, 2015-04-06 12:30 GMT+02:00 Srinivasa T N <seen...@gmail.com>:
> Comparison to OpenTSDB HBase > > For one we do not use id’s for strings. The string data (metric names and > tags) are written to row keys and the appropriate indexes. Because > Cassandra has much wider rows there are far fewer keys written to the > database. The space saved by using id’s is minor and by not using id’s we > avoid having to use any kind of locks across the cluster. > > As mentioned the Cassandra has wider rows. The default row size in > OpenTSDB HBase is 1 hour. Cassandra is set to 3 weeks. > http://kairosdb.github.io/kairosdocs/CassandraSchema.html > > On Mon, Apr 6, 2015 at 3:27 PM, Serega Sheypak <serega.shey...@gmail.com> > wrote: > >> Thanks, is it a kind of opentsdb? >> >> 2015-04-05 18:28 GMT+02:00 Kevin Burton <bur...@spinn3r.com>: >> >>> > Hi, I switched from HBase to Cassandra and try to find problem >>> solution for timeseries analysis on top Cassandra. >>> >>> Depending on what you’re looking for, you might want to check out >>> KairosDB. >>> >>> 0.95 beta2 just shipped yesterday as well so you have good timing. >>> >>> https://github.com/kairosdb/kairosdb >>> >>> On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak < >>> serega.shey...@gmail.com> wrote: >>> >>>> Okay, so bucketing by day/week/month is a capacity planning stuff and >>>> actual questions I want to ask. >>>> As as a conclusion: >>>> I have a table events >>>> >>>> CREATE TABLE user_plans ( >>>> id timeuuid, >>>> user_id timeuuid, >>>> event_ts timestamp, >>>> event_type int, >>>> some_other_attr text >>>> >>>> PRIMARY KEY (user_id, ends) >>>> ); >>>> which fits tactic queries: >>>> select smth from user_plans where user_id='xxx' and end_ts > now() >>>> >>>> Then I create second table user_plans_daily (or weekly, monthy) >>>> >>>> with DDL: >>>> CREATE TABLE user_plans_daily/weekly/monthly ( >>>> ymd int, >>>> user_id timeuuid, >>>> event_ts timestamp, >>>> event_type int, >>>> some_other_attr text >>>> ) >>>> PRIMARY KEY ((ymd, user_id), event_ts ) >>>> WITH CLUSTERING ORDER BY (event_ts DESC); >>>> >>>> And this table is good for answering strategic questions: >>>> select * from >>>> user_plans_daily/weekly/monthly >>>> where ymd in (....) >>>> And I should avoid long condition inside IN clause, that is why you >>>> suggest me to create bigger bucket, correct? >>>> >>>> >>>> 2015-04-04 20:00 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>: >>>> >>>>> It sounds like your time bucket should be a month, but it depends on >>>>> the amount of data per user per day and your main query range. Within the >>>>> partition you can then query for a range of days. >>>>> >>>>> Yes, all of the rows within a partition are stored on one physical >>>>> node as well as the replica nodes. >>>>> >>>>> -- Jack Krupansky >>>>> >>>>> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak < >>>>> serega.shey...@gmail.com> wrote: >>>>> >>>>>> >non-equal relation on a partition key is not supported >>>>>> Ok, can I generate select query: >>>>>> select some_attributes >>>>>> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... >>>>>> or 20150331 >>>>>> >>>>>> > The partition key determines which node can satisfy the query >>>>>> So you mean that all rows with the same *(ymd, user_id)* would be on >>>>>> one physical node? >>>>>> >>>>>> >>>>>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>: >>>>>> >>>>>>> Unfortunately, a non-equal relation on a partition key is not >>>>>>> supported. You would need to bucket by some larger unit, like a month, >>>>>>> and >>>>>>> then use the date/time as a clustering column for the row key. Then you >>>>>>> could query within the partition. The partition key determines which >>>>>>> node >>>>>>> can satisfy the query. Designing your partition key judiciously is the >>>>>>> key >>>>>>> (haha!) to performant Cassandra applications. >>>>>>> >>>>>>> -- Jack Krupansky >>>>>>> >>>>>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak < >>>>>>> serega.shey...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, we plan to have 10^8 users and each user could generate 10 >>>>>>>> events per day. >>>>>>>> So we have: >>>>>>>> 10^8 records per day >>>>>>>> 10^8*30 records per month. >>>>>>>> Our timewindow analysis could be from 1 to 6 months. >>>>>>>> >>>>>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact >>>>>>>> ts of event. >>>>>>>> >>>>>>>> So you suggest this approach: >>>>>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) * >>>>>>>> *WITH CLUSTERING ORDER BY (**event_ts* >>>>>>>> * DESC);* >>>>>>>> >>>>>>>> where ymd=20150102 (the Second of January)? >>>>>>>> >>>>>>>> *What happens to writes:* >>>>>>>> SSTable with past days (ymd < current_day) stay untouched and don't >>>>>>>> take part in Compaction process since there are o changes to them? >>>>>>>> >>>>>>>> What happens to read: >>>>>>>> I issue query: >>>>>>>> select some_attributes >>>>>>>> from events where ymd >= 20150101 and ymd < 20150301 >>>>>>>> Does Cassandra skip SSTables which don't have ymd in specified >>>>>>>> range and give me a kind of partition elimination, like in traditional >>>>>>>> DBs? >>>>>>>> >>>>>>>> >>>>>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com >>>>>>>> >: >>>>>>>> >>>>>>>>> It depends on the actual number of events per user, but simply >>>>>>>>> bucketing the partition key can give you the same effect - clustering >>>>>>>>> rows >>>>>>>>> by time range. A composite partition key could be comprised of the >>>>>>>>> user >>>>>>>>> name and the date. >>>>>>>>> >>>>>>>>> It also depends on the data rate - is it many events per day or >>>>>>>>> just a few events per week, or over what time period. You need to be >>>>>>>>> careful - you don't want your Cassandra partitions to be too big >>>>>>>>> (millions >>>>>>>>> of rows) or too small (just a few or even one row per partition.) >>>>>>>>> >>>>>>>>> -- Jack Krupansky >>>>>>>>> >>>>>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak < >>>>>>>>> serega.shey...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, I switched from HBase to Cassandra and try to find problem >>>>>>>>>> solution for timeseries analysis on top Cassandra. >>>>>>>>>> I have a entity named "Event". >>>>>>>>>> "Event" has attributes: >>>>>>>>>> user_id - a guy who triggered event >>>>>>>>>> event_ts - when even happened >>>>>>>>>> event_type - type of event >>>>>>>>>> some_other_attr - some other attrs we don't care about right now. >>>>>>>>>> >>>>>>>>>> The DDL for entity event looks this way: >>>>>>>>>> >>>>>>>>>> CREATE TABLE user_plans ( >>>>>>>>>> >>>>>>>>>> id timeuuid, >>>>>>>>>> user_id timeuuid, >>>>>>>>>> event_ts timestamp, >>>>>>>>>> event_type int, >>>>>>>>>> some_other_attr text >>>>>>>>>> >>>>>>>>>> PRIMARY KEY (user_id, ends) >>>>>>>>>> ); >>>>>>>>>> >>>>>>>>>> Table is "infinite", It would grow continuously during >>>>>>>>>> application lifetime. >>>>>>>>>> I want to ask question: >>>>>>>>>> Cassandra, give me all event where event_ts >= xxx >>>>>>>>>> and event_ts <=yyy. >>>>>>>>>> >>>>>>>>>> Right now it would lead to full table scan. >>>>>>>>>> >>>>>>>>>> There is a trick in HBase. HBase has table abstraction and HBase >>>>>>>>>> has Column Family abstraction. >>>>>>>>>> Column family should be declared in advance. >>>>>>>>>> Column family - physically is a pack of HFiles ("SSTables in C*"). >>>>>>>>>> So I can easily add partitioning for my HBase table: >>>>>>>>>> alter table hbase_events add column familiy '2015_01' >>>>>>>>>> and store all 2015 January data to Column familiy named '2015_01'. >>>>>>>>>> >>>>>>>>>> When I want to get January data, I would directly access column >>>>>>>>>> family named '2015_01' and I won't massage all data in table, just >>>>>>>>>> this >>>>>>>>>> piece. >>>>>>>>>> >>>>>>>>>> What is approach in C* in this case? >>>>>>>>>> I have an idea create several tables: event_2015_01, >>>>>>>>>> event_2015_02, e.t.c. but it looks rather ugly from my current >>>>>>>>>> understanding how it works. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> -- >>> >>> Founder/CEO Spinn3r.com >>> Location: *San Francisco, CA* >>> blog: http://burtonator.wordpress.com >>> … or check out my Google+ profile >>> <https://plus.google.com/102718274791889610666/posts> >>> <http://spinn3r.com> >>> >>> >> >