Marco, Thanks for stopping to take a look at Riak. Here are a few thoughts for you to consider and try:
> We want the page view count by day for registered and unregistered users. We > are storing session > documents. Each document has a session identifier as it's key and a list of > page views as the value > (and a few additional properties we can ignore). This document structure > comes from CouchDB, The "few additional properties we can ignore" actually can't be ignored. The reason is because the SpiderMonkey JS VMs that are running on each node still have to parse the JSON data for each document. If your document is 1K of JSON text, but you only care about a fraction of that data, the VM still has to parse it to give you - what essentially boils down to - a length value. This is going to end up being a time sync (you would need to profile to know how much). So, ... > Would it be better to store each page hit under a new key, Yes. > Also, will updating the session documents be a problem for Riak? You may want to consider using "links" in your documents to help speed this up: http://wiki.basho.com/Links.html. The data you are talking about storing off is (from what little I know) largely immutable. The list of pages a user visited during their time on the site isn't going to change once the session is done. So, you should be able to make a really small object that is nothing more than a key of <session>_page_count and the value being the count of the pages visited. You can then make a link from the original session object to the cached count object with a link. Once in place, you can then use a post-commit hook (http://wiki.basho.com/Commit-Hooks.html#Post-Commit-Hooks) so that whenever the session log is updated, the cached page count object is also updated, keeping the data in-sync. This will seriously cut down on the time spent in the JS VMs. Next, without knowing what else is in the database (is it 100% logging data or do your logs make up only 10% of the total data?), it's worth taking a moment to point out secondary indexes: http://wiki.basho.com/Secondary-Indexes.html. Right now, your map is traversing over all the objects in the database. Buckets are *not* physical namespaces (like directories in a file system). The bucket name is merely a quick "early out" for the map phase before your Javascript code is even executed. It's still fast, but maybe not fast enough. CouchDB views make a tradeoff to gain performance at the cost of disk space, and you can use secondary indices the same way. If you give all your page count objects a 2i index field, you can then pass it as an input to map/reduce query, you are now instantly limiting which objects are getting scanned to only those with the 2i field. This has the added benefit of allowing you to range query (e.g. if your field was a UTC timestamp, you could look at only the page hits for sessions over the last week, month, day, minute, …). Hope this helps, if you have the time/ability to try the above and give feedback on the results, I'd be very interested in learning them and helping further. -- Jeffrey Massung j...@basho.com On Feb 12, 2012, at 4:27 AM, Marco Monteiro wrote: > Hello! > > I'm considering Riak for the statistics of a site that is approaching a > billion page views per month. > The plan is to log a little information about each the page view and then to > query that data. > > I'm very new to Riak. I've gone over the documentation on the wiki, and I > know about map-reduce, > secondary indexes and Riak search. I've installed Riak on a single node and > made a test with the > default configuration. The results were a little bellow what I expected. For > the test is used the following > requirement. > > We want the page view count by day for registered and unregistered users. We > are storing session > documents. Each document has a session identifier as it's key and a list of > page views as the value > (and a few additional properties we can ignore). This document structure > comes from CouchDB, > where I organised things like this to be able to more easily query the > database. I've done a basic > javascript map-reduce query for this. I just map over each session (every k/v > in a bucket) returning > the length of the page views array for either the registered or unregistered > field (the other is zero), and > the day of the request. In the reduce I collect them by hashing the day and > summing the two number > of page views. Then I have a second reduce to sort the list by day. > > This is very slow on a single machine setup with default Riak configuration. > 1.000 sessions takes > 6 seconds. 10.000 sessions takes more that 2 minutes (timeout). We want to > handle 10.000.000 > sessions, at least. Is there a way, maybe with secondary indexes, to make > this go faster using only Riak? > Or must I use some kind of persistent cache to store this info as time goes > by? Or can I make Riak > run 100 times faster by tweaking the config? I don't want to have 1000 > machines for making this work. > > Also, will updating the session documents be a problem for Riak? Would it be > better to store each > page hit under a new key, to not update the the session document. Because of > the "multilevel" map > reduce this ca work on Riak, where it didn't work on CouchDB, because its > view system limitation. > Unfortunately, with the update of documents the CouchDB database was growing > way too fast for it > to be a feasible solution. > > > Any advice to make Riak work for this problem is greatly appreciated. > > Thanks, > Marco > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com