This seems like the bread and butter of most Hadoop/MapReduce implementations. If you needs some data more near realtime, you could add HBase on top. If you need ad hoc querying, use Hive or Pig. If you want to do clustering and more advanced analyses, use Mahout.
We love Riak and use it extensively but wouldn't suggest using it for the use case you're describing. Jacques On Sun, Feb 12, 2012 at 3:27 AM, Marco Monteiro <ma...@textovirtual.com>wrote: > Hello! > > I'm considering Riak for the statistics of a site that is approaching a > billion page views per month. > The plan is to log a little information about each the page view and then > to query that data. > > I'm very new to Riak. I've gone over the documentation on the wiki, and I > know about map-reduce, > secondary indexes and Riak search. I've installed Riak on a single node > and made a test with the > default configuration. The results were a little bellow what I expected. > For the test is used the following > requirement. > > We want the page view count by day for registered and unregistered users. > We are storing session > documents. Each document has a session identifier as it's key and a list > of page views as the value > (and a few additional properties we can ignore). This document structure > comes from CouchDB, > where I organised things like this to be able to more easily query the > database. I've done a basic > javascript map-reduce query for this. I just map over each session (every > k/v in a bucket) returning > the length of the page views array for either the registered or > unregistered field (the other is zero), and > the day of the request. In the reduce I collect them by hashing the day > and summing the two number > of page views. Then I have a second reduce to sort the list by day. > > This is very slow on a single machine setup with default Riak > configuration. 1.000 sessions takes > 6 seconds. 10.000 sessions takes more that 2 minutes (timeout). We want to > handle 10.000.000 > sessions, at least. Is there a way, maybe with secondary indexes, to make > this go faster using only Riak? > Or must I use some kind of persistent cache to store this info as time > goes by? Or can I make Riak > run 100 times faster by tweaking the config? I don't want to have 1000 > machines for making this work. > > Also, will updating the session documents be a problem for Riak? Would it > be better to store each > page hit under a new key, to not update the the session document. Because > of the "multilevel" map > reduce this ca work on Riak, where it didn't work on CouchDB, because its > view system limitation. > Unfortunately, with the update of documents the CouchDB database was > growing way too fast for it > to be a feasible solution. > > > Any advice to make Riak work for this problem is greatly appreciated. > > Thanks, > Marco > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com