Re: Storage of time-series data

Daniel Einspanjer Tue, 18 May 2010 22:51:54 -0700

I do a lot of temporal aggregate statistics in the Mozilla Socorroproject using HBase. The problem is made much easier there because youcan have a rowkey that uses the timestamp as a prefix making it easy todo a range query, and then HBase also has an atomic increment functionthat can be used to easily accumulate and store the aggregates.

Thinking about this problem from what I've learned so far about Riak(which I confess I am still learning), it seems to me that the hardestpart would be querying for a particular subset of the bucket objects forwhich you wish to aggregate statistics. If you don't expect to bestoring so many documents that it would be unreasonable to map reduceover the entire bucket set and filter for only the time range you areinterested in, then you shouldn't have a problem. If you were expectingmassive quantities of documents, then maybe you could partition the datainto a bucket for each day or week or whatever interval gives you asmall enough collection size that you can map over them.

Once the problem of the input data set is resolved, I suspect you couldhave the reduce phase build a json object containing all the relevantaggregate statistics for that time period, then store that object in a"metrics" bucket with the key being the time period. I'm thinkingsomething along the lines of this (based onhttps://wiki.mozilla.org/Socorro:HBase#special_records):


bucket: "metrics"
key: "2010-05-19T00
value: {
  widgets_sold: 15000
  website_visits: 2
  sum_page_views: 46
  average_page_views_per_visit: 23
  sum_visit_duration_seconds: 1216
  average_visit_duration_seconds: 608
}


On 5/18/10 11:01 PM, Sean Cribbs wrote:

Buckets are essentially free if you are not changing their properties from the 
defaults (which you can set globally in app.config).  Keep in mind the options 
I presented are not the only ones, just points of departure for your own schema 
design.

Sean Cribbs<s...@basho.com>
Developer Advocate
Basho Technologies, Inc.
http://basho.com/

On May 18, 2010, at 8:03 PM, Joel Pitt wrote:

Thanks Sean. Looks like 3 might be the best plan.

And, pre/post-commit hooks... cool! I didn't see those - that's
something I've been looking for (since I'd prefer to keep that kind of
stuff happening on the data nodes rather than in the client/app
itself).

One further question, is there any limitation to how the number of
buckets can scale? If you're recommending using them to box data by
minute I'm guessing that # buckets can increase without worry, but is
this still the case if say I started binning into buckets by second?

J

On Wed, May 19, 2010 at 1:53 AM, Sean Cribbs<s...@basho.com>  wrote:

Joel,

Riak's only query mechanism aside from simple key retrieval is map-reduce.  
However, there are a number of strategies you could take, depending on what you 
want to query. I don't know the requirements of your application, but here are 
some options:

1) Store the data either keyed on the timestamp, or as separate objects linked 
from a timestamp object.
2) Create buckets for each time-window you want to track.  For example, if I 
wanted to box data by minute, I'd make bucket names that look like: 
2010-05-18T09.46.  Then if I want all the data from that minute, I'd run a 
map-reduce query with that bucket name as the inputs.
3) Create your own secondary indexes with a post-commit hook or code in your 
application for year, month, day, etc.  The secondary index would be, like #1, 
keys that only contain links to the actual data.

With any of these options (which are by no means exhaustive), your map-reduce 
query will need to sort the data in a reduce phase if you require chronological 
ordering. Also, if you're building your own indexes in separate buckets, 
depending on the write throughput of your application, you might want to build 
in some sort of conflict resolution and turn on allow_mult so that concurrent 
updates are not lost.

Sean Cribbs<s...@basho.com>
Developer Advocate
Basho Technologies, Inc.
http://basho.com/

On May 17, 2010, at 8:31 PM, Joel Pitt wrote:

Hi,

I'm trying to work out the best way of storing temporal data in Riak.

I've been investigating several NoSQL solutions and originally started
out using CouchDB, however I want to move to a db that scales more
gradually (CouchDB scales, but you really have to set up the
architecture before-hand and I'd prefer to be able to build a cluster
a node at a time)

In CouchDB, I use a multi-level key in a map-reduce view to create an
index by time. Each reduce level corresponds to year, month, day,
time... so I can easily get aggregate data for say a month.

In addition to Riak I'm investigating Cassandra. In Cassandra the way
to store time series is by making the column keys timestamps and
sorting columns by TimeUUID. This allows one to do slices across a
range of time. This isn't exactly the same as what I have in CouchDB,
but by consensus it seems to be the way to store a time index.

Any suggestions for working with or creating time indexes in Riak?

Ideally I'd be able to query documents with a time range to either get
the documents, or to calculate aggregate statistics using a map-reduce
task.

Any information appreciated :-)

Joel Pitt, PhD | http://ferrouswheel.me | +64 21 101 7308
NetEmpathy Co-founder | http://netempathy.com
OpenCog Developer | http://opencog.org
Board member, Humanity+ | http://humanityplus.org

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Storage of time-series data

Reply via email to