Re: Is Riak a good solution for this problem?

Jeffrey Massung Mon, 13 Feb 2012 15:59:58 -0800

Marco,

Thanks for stopping to take a look at Riak. Here are a few thoughts for you to 
consider and try:

> We want the page view count by day for registered and unregistered users. We 
> are storing session
> documents. Each document has a session identifier as it's key and a list of 
> page views as the value
> (and a few additional properties we can ignore). This document structure 
> comes from CouchDB,

The "few additional properties we can ignore" actually can't be ignored. The 
reason is because the SpiderMonkey JS VMs that are running on each node still 
have to parse the JSON data for each document. If your document is 1K of JSON 
text, but you only care about a fraction of that data, the VM still has to 
parse it to give you - what essentially boils down to - a length value. This is 
going to end up being a time sync (you would need to profile to know how much). 
So, ...

> Would it be better to store each page hit under a new key, 

Yes. 

> Also, will updating the session documents be a problem for Riak?

You may want to consider using "links" in your documents to help speed this up: 
http://wiki.basho.com/Links.html. The data you are talking about storing off is 
(from what little I know) largely immutable. The list of pages a user visited 
during their time on the site isn't going to change once the session is done. 
So, you should be able to make a really small object that is nothing more than 
a key of <session>_page_count and the value being the count of the pages 
visited. You can then make a link from the original session object to the 
cached count object with a link. Once in place, you can then use a post-commit 
hook (http://wiki.basho.com/Commit-Hooks.html#Post-Commit-Hooks) so that 
whenever the session log is updated, the cached page count object is also 
updated, keeping the data in-sync. This will seriously cut down on the time 
spent in the JS VMs. 

Next, without knowing what else is in the database (is it 100% logging data or 
do your logs make up only 10% of the total data?), it's worth taking a moment 
to point out secondary indexes: http://wiki.basho.com/Secondary-Indexes.html. 
Right now, your map is traversing over all the objects in the database. Buckets 
are *not* physical namespaces (like directories in a file system). The bucket 
name is merely a quick "early out" for the map phase before your Javascript 
code is even executed. It's still fast, but maybe not fast enough. CouchDB 
views make a tradeoff to gain performance at the cost of disk space, and you 
can use secondary indices the same way. If you give all your page count objects 
a 2i index field, you can then pass it as an input to map/reduce query, you are 
now instantly limiting which objects are getting scanned to only those with the 
2i field. This has the added benefit of allowing you to range query (e.g. if 
your field was a UTC timestamp, you could look at only the page hits for 
sessions over the last week, month, day, minute, …).

Hope this helps, if you have the time/ability to try the above and give 
feedback on the results, I'd be very interested in learning them and helping 
further.

--
Jeffrey Massung
j...@basho.com

On Feb 12, 2012, at 4:27 AM, Marco Monteiro wrote:

> Hello!
> 
> I'm considering Riak for the statistics of a site that is approaching  a 
> billion page views per month.
> The plan is to log a little information about each the page view and then to 
> query that data.
> 
> I'm very new to Riak.  I've gone over the documentation on the wiki, and I 
> know about map-reduce,
> secondary indexes and Riak search. I've installed Riak on a single node and 
> made a test with the
> default configuration. The results were a little bellow what I expected. For 
> the test is used the following
> requirement.
> 
> We want the page view count by day for registered and unregistered users. We 
> are storing session
> documents. Each document has a session identifier as it's key and a list of 
> page views as the value
> (and a few additional properties we can ignore). This document structure 
> comes from CouchDB,
> where I organised things like this to be able to more easily query the 
> database. I've done a basic
> javascript map-reduce query for this. I just map over each session (every k/v 
> in a bucket) returning 
> the length of the page views array for either the registered or unregistered 
> field (the other is zero), and
> the day of the request. In the reduce I collect them by hashing the day and 
> summing the two number
> of page views. Then I have a second reduce to sort the list by day.
> 
> This is very slow on a single machine setup with default Riak configuration. 
> 1.000 sessions takes
> 6 seconds. 10.000 sessions takes more that 2 minutes (timeout). We want to 
> handle 10.000.000
> sessions, at least. Is there a way, maybe with secondary indexes, to make 
> this go faster using only Riak?
> Or must I use some kind of persistent cache to store this info as time goes 
> by? Or can I make Riak
> run 100 times faster by tweaking the config? I don't want to have 1000 
> machines for making this work.
> 
> Also, will updating the session documents be a problem for Riak? Would it be 
> better to store each
> page hit under a new key, to not update the the session document. Because of 
> the "multilevel" map
> reduce this ca work on Riak, where it didn't work on CouchDB, because its 
> view system limitation.
> Unfortunately, with the update of documents the CouchDB database was growing 
> way too fast for it
> to be a feasible solution.
> 
> 
> Any advice to make Riak work for this problem is greatly appreciated.
> 
> Thanks,
> Marco
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Is Riak a good solution for this problem?

Reply via email to