Re: Is Riak a good solution for this problem?

Marco Monteiro Mon, 20 Feb 2012 12:46:35 -0800

Thank you to everyone that answered my first message.

I got ideas from every answer. Most helpful.


We decided to go ahead with a few more tests.

We are storing each page hit has a new object. Some of the queries we are
doing need to visit many objects,
and doing it in real time would not work. So we have implemented a system
that every hour goes through all
the objects of the previous hour and creates summary objects in some other
bucket. We then map-reduce
this new set of object to query in real time.

We started with about 1 request every 2 seconds, and all was working fine.
Then we went to 50 requests
per second and it stopped working. When doing the hourly map-reduce I get
error messages like

{ [Error: HTTP error 500:
{"phase":0,"error":"[preflist_exhausted]","input":"{ok,{r_object,<<\"whs2\">>,<<\"E4cWJZs2mZtnZqtv2xXz957lnY4\">>,[{r_content,{dict,6,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[[<<\"Links\">>]],[],[],[],[],[],[],[],[[<<\"content-type\">>,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110],[<<\"X-Riak-VTag\">>,50,121,57,75,51,109,84,103,90,74,87,108,53,67,83,89,86,88,97,82,65,52]],[[<<\"index\">>,{<<\"htime_int\">>,1329642026123},{<<\"hua_bin\">>,<<\"IE\">>},{<<\"pid_bin\">>,<<\"56a710af68f46d1a\">>},{<<\"sid_bin\">>,<<\"95...\">>}]],...}}},...}],...},...}","type":"forward_preflist","stack":"[]"}]
statusCode: 500 }

for every single map-reduce query, even some that use secondary indexes
that should only get  a few dozen
objects. We are using 3 large instances in ec2. I'm sure that all instances
are running.

We are using eleveldb (because of the secondary indexes) and Riak1.1.0rc2-1
is running on Ubuntu systems.

Thanks for the help.

Cheers,
Marco


On 13 February 2012 23:59, Jeffrey Massung <j...@basho.com> wrote:

> Marco,
>
> Thanks for stopping to take a look at Riak. Here are a few thoughts for
> you to consider and try:
>
> We want the page view count by day for registered and unregistered users.
> We are storing session
> documents. Each document has a session identifier as it's key and a list
> of page views as the value
> (and a few additional properties we can ignore). This document structure
> comes from CouchDB,
>
>
> The "few additional properties we can ignore" actually can't be ignored.
> The reason is because the SpiderMonkey JS VMs that are running on each node
> still have to parse the JSON data for each document. If your document is 1K
> of JSON text, but you only care about a fraction of that data, the VM still
> has to parse it to give you - what essentially boils down to - a length
> value. This is going to end up being a time sync (you would need to profile
> to know how much). So, ...
>
> Would it be better to store each page hit under a new key,
>
>
> Yes.
>
> Also, will updating the session documents be a problem for Riak?
>
>
> You may want to consider using "links" in your documents to help speed
> this up: http://wiki.basho.com/Links.html. The data you are talking about
> storing off is (from what little I know) largely immutable. The list of
> pages a user visited during their time on the site isn't going to change
> once the session is done. So, you should be able to make a really
> small object that is nothing more than a key of <session>_page_count and
> the value being the count of the pages visited. You can then make a link
> from the original session object to the cached count object with a link.
> Once in place, you can then use a post-commit hook (
> http://wiki.basho.com/Commit-Hooks.html#Post-Commit-Hooks) so that
> whenever the session log is updated, the cached page count object is also
> updated, keeping the data in-sync. This will seriously cut down on the time
> spent in the JS VMs.
>
> Next, without knowing what else is in the database (is it 100% logging
> data or do your logs make up only 10% of the total data?), it's worth
> taking a moment to point out secondary indexes:
> http://wiki.basho.com/Secondary-Indexes.html. Right now, your map is
> traversing over all the objects in the database. Buckets are *not* physical
> namespaces (like directories in a file system). The bucket name is merely a
> quick "early out" for the map phase before your Javascript code is even
> executed. It's still fast, but maybe not fast enough. CouchDB views make a
> tradeoff to gain performance at the cost of disk space, and you can use
> secondary indices the same way. If you give all your page count objects a
> 2i index field, you can then pass it as an input to map/reduce query, you
> are now instantly limiting which objects are getting scanned to only those
> with the 2i field. This has the added benefit of allowing you to range
> query (e.g. if your field was a UTC timestamp, you could look at only the
> page hits for sessions over the last week, month, day, minute, …).
>
> Hope this helps, if you have the time/ability to try the above and give
> feedback on the results, I'd be very interested in learning them and
> helping further.
>
> --
> Jeffrey Massung
> j...@basho.com
>
> On Feb 12, 2012, at 4:27 AM, Marco Monteiro wrote:
>
> Hello!
>
> I'm considering Riak for the statistics of a site that is approaching  a
> billion page views per month.
> The plan is to log a little information about each the page view and then
> to query that data.
>
> I'm very new to Riak.  I've gone over the documentation on the wiki, and I
> know about map-reduce,
> secondary indexes and Riak search. I've installed Riak on a single node
> and made a test with the
> default configuration. The results were a little bellow what I expected.
> For the test is used the following
> requirement.
>
> We want the page view count by day for registered and unregistered users.
> We are storing session
> documents. Each document has a session identifier as it's key and a list
> of page views as the value
> (and a few additional properties we can ignore). This document structure
> comes from CouchDB,
> where I organised things like this to be able to more easily query the
> database. I've done a basic
> javascript map-reduce query for this. I just map over each session (every
> k/v in a bucket) returning
> the length of the page views array for either the registered or
> unregistered field (the other is zero), and
> the day of the request. In the reduce I collect them by hashing the day
> and summing the two number
> of page views. Then I have a second reduce to sort the list by day.
>
> This is very slow on a single machine setup with default Riak
> configuration. 1.000 sessions takes
> 6 seconds. 10.000 sessions takes more that 2 minutes (timeout). We want to
> handle 10.000.000
> sessions, at least. Is there a way, maybe with secondary indexes, to make
> this go faster using only Riak?
> Or must I use some kind of persistent cache to store this info as time
> goes by? Or can I make Riak
> run 100 times faster by tweaking the config? I don't want to have 1000
> machines for making this work.
>
> Also, will updating the session documents be a problem for Riak? Would it
> be better to store each
> page hit under a new key, to not update the the session document. Because
> of the "multilevel" map
> reduce this ca work on Riak, where it didn't work on CouchDB, because its
> view system limitation.
> Unfortunately, with the update of documents the CouchDB database was
> growing way too fast for it
> to be a feasible solution.
>
>
> Any advice to make Riak work for this problem is greatly appreciated.
>
> Thanks,
> Marco
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Is Riak a good solution for this problem?

Reply via email to