Thank you to everyone that answered my first message. I got ideas from every answer. Most helpful.
We decided to go ahead with a few more tests. We are storing each page hit has a new object. Some of the queries we are doing need to visit many objects, and doing it in real time would not work. So we have implemented a system that every hour goes through all the objects of the previous hour and creates summary objects in some other bucket. We then map-reduce this new set of object to query in real time. We started with about 1 request every 2 seconds, and all was working fine. Then we went to 50 requests per second and it stopped working. When doing the hourly map-reduce I get error messages like { [Error: HTTP error 500: {"phase":0,"error":"[preflist_exhausted]","input":"{ok,{r_object,<<\"whs2\">>,<<\"E4cWJZs2mZtnZqtv2xXz957lnY4\">>,[{r_content,{dict,6,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[[<<\"Links\">>]],[],[],[],[],[],[],[],[[<<\"content-type\">>,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110],[<<\"X-Riak-VTag\">>,50,121,57,75,51,109,84,103,90,74,87,108,53,67,83,89,86,88,97,82,65,52]],[[<<\"index\">>,{<<\"htime_int\">>,1329642026123},{<<\"hua_bin\">>,<<\"IE\">>},{<<\"pid_bin\">>,<<\"56a710af68f46d1a\">>},{<<\"sid_bin\">>,<<\"95...\">>}]],...}}},...}],...},...}","type":"forward_preflist","stack":"[]"}] statusCode: 500 } for every single map-reduce query, even some that use secondary indexes that should only get a few dozen objects. We are using 3 large instances in ec2. I'm sure that all instances are running. We are using eleveldb (because of the secondary indexes) and Riak1.1.0rc2-1 is running on Ubuntu systems. Thanks for the help. Cheers, Marco On 13 February 2012 23:59, Jeffrey Massung <j...@basho.com> wrote: > Marco, > > Thanks for stopping to take a look at Riak. Here are a few thoughts for > you to consider and try: > > We want the page view count by day for registered and unregistered users. > We are storing session > documents. Each document has a session identifier as it's key and a list > of page views as the value > (and a few additional properties we can ignore). This document structure > comes from CouchDB, > > > The "few additional properties we can ignore" actually can't be ignored. > The reason is because the SpiderMonkey JS VMs that are running on each node > still have to parse the JSON data for each document. If your document is 1K > of JSON text, but you only care about a fraction of that data, the VM still > has to parse it to give you - what essentially boils down to - a length > value. This is going to end up being a time sync (you would need to profile > to know how much). So, ... > > Would it be better to store each page hit under a new key, > > > Yes. > > Also, will updating the session documents be a problem for Riak? > > > You may want to consider using "links" in your documents to help speed > this up: http://wiki.basho.com/Links.html. The data you are talking about > storing off is (from what little I know) largely immutable. The list of > pages a user visited during their time on the site isn't going to change > once the session is done. So, you should be able to make a really > small object that is nothing more than a key of <session>_page_count and > the value being the count of the pages visited. You can then make a link > from the original session object to the cached count object with a link. > Once in place, you can then use a post-commit hook ( > http://wiki.basho.com/Commit-Hooks.html#Post-Commit-Hooks) so that > whenever the session log is updated, the cached page count object is also > updated, keeping the data in-sync. This will seriously cut down on the time > spent in the JS VMs. > > Next, without knowing what else is in the database (is it 100% logging > data or do your logs make up only 10% of the total data?), it's worth > taking a moment to point out secondary indexes: > http://wiki.basho.com/Secondary-Indexes.html. Right now, your map is > traversing over all the objects in the database. Buckets are *not* physical > namespaces (like directories in a file system). The bucket name is merely a > quick "early out" for the map phase before your Javascript code is even > executed. It's still fast, but maybe not fast enough. CouchDB views make a > tradeoff to gain performance at the cost of disk space, and you can use > secondary indices the same way. If you give all your page count objects a > 2i index field, you can then pass it as an input to map/reduce query, you > are now instantly limiting which objects are getting scanned to only those > with the 2i field. This has the added benefit of allowing you to range > query (e.g. if your field was a UTC timestamp, you could look at only the > page hits for sessions over the last week, month, day, minute, …). > > Hope this helps, if you have the time/ability to try the above and give > feedback on the results, I'd be very interested in learning them and > helping further. > > -- > Jeffrey Massung > j...@basho.com > > On Feb 12, 2012, at 4:27 AM, Marco Monteiro wrote: > > Hello! > > I'm considering Riak for the statistics of a site that is approaching a > billion page views per month. > The plan is to log a little information about each the page view and then > to query that data. > > I'm very new to Riak. I've gone over the documentation on the wiki, and I > know about map-reduce, > secondary indexes and Riak search. I've installed Riak on a single node > and made a test with the > default configuration. The results were a little bellow what I expected. > For the test is used the following > requirement. > > We want the page view count by day for registered and unregistered users. > We are storing session > documents. Each document has a session identifier as it's key and a list > of page views as the value > (and a few additional properties we can ignore). This document structure > comes from CouchDB, > where I organised things like this to be able to more easily query the > database. I've done a basic > javascript map-reduce query for this. I just map over each session (every > k/v in a bucket) returning > the length of the page views array for either the registered or > unregistered field (the other is zero), and > the day of the request. In the reduce I collect them by hashing the day > and summing the two number > of page views. Then I have a second reduce to sort the list by day. > > This is very slow on a single machine setup with default Riak > configuration. 1.000 sessions takes > 6 seconds. 10.000 sessions takes more that 2 minutes (timeout). We want to > handle 10.000.000 > sessions, at least. Is there a way, maybe with secondary indexes, to make > this go faster using only Riak? > Or must I use some kind of persistent cache to store this info as time > goes by? Or can I make Riak > run 100 times faster by tweaking the config? I don't want to have 1000 > machines for making this work. > > Also, will updating the session documents be a problem for Riak? Would it > be better to store each > page hit under a new key, to not update the the session document. Because > of the "multilevel" map > reduce this ca work on Riak, where it didn't work on CouchDB, because its > view system limitation. > Unfortunately, with the update of documents the CouchDB database was > growing way too fast for it > to be a feasible solution. > > > Any advice to make Riak work for this problem is greatly appreciated. > > Thanks, > Marco > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com