Is Riak a good solution for this problem?

2012-02-12 Thread Marco Monteiro
Hello!

I'm considering Riak for the statistics of a site that is approaching  a
billion page views per month.
The plan is to log a little information about each the page view and then
to query that data.

I'm very new to Riak.  I've gone over the documentation on the wiki, and I
know about map-reduce,
secondary indexes and Riak search. I've installed Riak on a single node and
made a test with the
default configuration. The results were a little bellow what I expected.
For the test is used the following
requirement.

We want the page view count by day for registered and unregistered users.
We are storing session
documents. Each document has a session identifier as it's key and a list of
page views as the value
(and a few additional properties we can ignore). This document structure
comes from CouchDB,
where I organised things like this to be able to more easily query the
database. I've done a basic
javascript map-reduce query for this. I just map over each session (every
k/v in a bucket) returning
the length of the page views array for either the registered or
unregistered field (the other is zero), and
the day of the request. In the reduce I collect them by hashing the day and
summing the two number
of page views. Then I have a second reduce to sort the list by day.

This is very slow on a single machine setup with default Riak
configuration. 1.000 sessions takes
6 seconds. 10.000 sessions takes more that 2 minutes (timeout). We want to
handle 10.000.000
sessions, at least. Is there a way, maybe with secondary indexes, to make
this go faster using only Riak?
Or must I use some kind of persistent cache to store this info as time goes
by? Or can I make Riak
run 100 times faster by tweaking the config? I don't want to have 1000
machines for making this work.

Also, will updating the session documents be a problem for Riak? Would it
be better to store each
page hit under a new key, to not update the the session document. Because
of the "multilevel" map
reduce this ca work on Riak, where it didn't work on CouchDB, because its
view system limitation.
Unfortunately, with the update of documents the CouchDB database was
growing way too fast for it
to be a feasible solution.


Any advice to make Riak work for this problem is greatly appreciated.

Thanks,
Marco
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Is Riak a good solution for this problem?

2012-02-20 Thread Marco Monteiro
ces the same way. If you give all your page count objects a
> 2i index field, you can then pass it as an input to map/reduce query, you
> are now instantly limiting which objects are getting scanned to only those
> with the 2i field. This has the added benefit of allowing you to range
> query (e.g. if your field was a UTC timestamp, you could look at only the
> page hits for sessions over the last week, month, day, minute, …).
>
> Hope this helps, if you have the time/ability to try the above and give
> feedback on the results, I'd be very interested in learning them and
> helping further.
>
> --
> Jeffrey Massung
> j...@basho.com
>
> On Feb 12, 2012, at 4:27 AM, Marco Monteiro wrote:
>
> Hello!
>
> I'm considering Riak for the statistics of a site that is approaching  a
> billion page views per month.
> The plan is to log a little information about each the page view and then
> to query that data.
>
> I'm very new to Riak.  I've gone over the documentation on the wiki, and I
> know about map-reduce,
> secondary indexes and Riak search. I've installed Riak on a single node
> and made a test with the
> default configuration. The results were a little bellow what I expected.
> For the test is used the following
> requirement.
>
> We want the page view count by day for registered and unregistered users.
> We are storing session
> documents. Each document has a session identifier as it's key and a list
> of page views as the value
> (and a few additional properties we can ignore). This document structure
> comes from CouchDB,
> where I organised things like this to be able to more easily query the
> database. I've done a basic
> javascript map-reduce query for this. I just map over each session (every
> k/v in a bucket) returning
> the length of the page views array for either the registered or
> unregistered field (the other is zero), and
> the day of the request. In the reduce I collect them by hashing the day
> and summing the two number
> of page views. Then I have a second reduce to sort the list by day.
>
> This is very slow on a single machine setup with default Riak
> configuration. 1.000 sessions takes
> 6 seconds. 10.000 sessions takes more that 2 minutes (timeout). We want to
> handle 10.000.000
> sessions, at least. Is there a way, maybe with secondary indexes, to make
> this go faster using only Riak?
> Or must I use some kind of persistent cache to store this info as time
> goes by? Or can I make Riak
> run 100 times faster by tweaking the config? I don't want to have 1000
> machines for making this work.
>
> Also, will updating the session documents be a problem for Riak? Would it
> be better to store each
> page hit under a new key, to not update the the session document. Because
> of the "multilevel" map
> reduce this ca work on Riak, where it didn't work on CouchDB, because its
> view system limitation.
> Unfortunately, with the update of documents the CouchDB database was
> growing way too fast for it
> to be a feasible solution.
>
>
> Any advice to make Riak work for this problem is greatly appreciated.
>
> Thanks,
> Marco
> ___
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Problems writing objects to an half full bucket

2012-03-05 Thread Marco Monteiro
Hello!

I have a riak cluster and I'm seeing a write fail rate of  10% to 30%
(varies with the nodes). At the moment I am writing about
300 new objects per second to the same bucket. If I direct the write to
a new (empty) bucket the problem goes away and I don't see any failure.

The non-empty bucket has between 2 and 3 million objects. Each object
has between 4 and 8 secondary indexes (most have 4).

When we started the system, yesterday, it handled a peak of about 1000
writes per second without problems, with the same hardware.

The cluster has 6 nodes, all debian with Riak 1.0.3. We tried Riak 1.1 at
first,
but had the known map-reduce problem and reverted back.

I requested help on the IRC channel and pharkmillups suggested that Riak is
just trying to write too many things to the disk, given the secondary index.

This is an issue report but if someone has any idea of how a change to the
configuration can fix this, please do tell. I would also like to know what
the
problem is (why this happens) and if it can be fixed in the next few days
with maybe a new release of Riak 1.1, along with the fixes for the
map-reduce
problems.

Thanks,
Marco
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Problems writing objects to an half full bucket

2012-03-05 Thread Marco Monteiro
Hi, David!

On 6 March 2012 04:37, David Smith  wrote:

> 1. What sort of error are you getting when a write fails?
>

I'm using riak-js and the error I get is:

{ [Error: socket hang up] code: 'ECONNRESET' }


> 2. What backend are you using? (I'm guessing LevelDB)
>

LevelDB. The documentation says this is the only one to support 2i.


> 3. What do your keys look like? For example, are they date-based (and
> thus naturally increasing) or are they UUIDs? :)
>

UUIDs. They are created by Riak. All my queries use 2i. The 2i are integers
(representing seconds) and random strings (length 16) used as identifiers
for user sessions and similar.

Thanks,
Marco
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Problems writing objects to an half full bucket

2012-03-06 Thread Marco Monteiro
It makes sense, David. I'm going to give it a try.
Hopefully this will make it usable for the next month
until the issue is addressed.

I'll let you know how it goes.

Thanks,
Marco

On 6 March 2012 15:19, David Smith  wrote:

> On Mon, Mar 5, 2012 at 9:55 PM, Marco Monteiro 
> wrote:
>
> > I'm using riak-js and the error I get is:
> >
> > { [Error: socket hang up] code: 'ECONNRESET' }
>
> That is a strange error -- are there any corresponding errors in
> server logs? I would have expected a timeout or some such...
>
> >
> > UUIDs. They are created by Riak. All my queries use 2i. The 2i are
> integers
> > (representing seconds) and random strings (length 16) used as identifiers
> > for user sessions and similar.
>
> So, this explains why the problem goes away when you switch to an
> empty bucket. A bit of background...
>
> If you're using the functionality in Riak that automatically generates
> a UUID on PUT, you're going to get a uniformly distributed 160-bit
> number (since the implementation SHA-1 hashes the input). This sort of
> distribution is great for uniqueness, since there is a 1 in 2^160
> chance (roughly) that you will encounter another similar ID. It can be
> very bad from a caching perspective, however, if you have a cache that
> uses pages of information for locality purposes. In a scheme such as
> this (which is what LevelDB uses), the system will wind up churning
> the cache constantly since the odds are quite low that the next UUID
> to be accessed will be already in memory (remember, uniform
> distribution of keys).
>
> LevelDB also makes this pathological case a bit worse by not having
> bloom filters -- when inserting a new UUID, you will potentially have
> to do 7 disk seeks just to determine if the UUID is not present. The
> Google team is working to address this problem, but I'm guessing it'll
> be a month or so before that's done and then we have to integrate with
> Riak -- so we can't count on that just yet.
>
> Now, all is not lost. :)
>
> If you craft your keys so that there is some temporal locality _and_
> the access pattern of your keys has some sort of exponential-ish
> decay, you can still get very good performance out of LevelDB. One
> simple way to do this is to prefix the current date-time on front of
> the UUID, like so:
>
> 201203060806- (YMDhm-UUID)
>
> You could also use seconds since the epoch, etc. This has the effect
> of keeping recently accessed/hot UUIDs on (close to) the same cache
> page, and lets you avoid a lot of cache churn and typically
> dramatically improves LevelDB performance.
>
> Does this help/make sense?
>
> D.
> --
> Dave Smith
> VP, Engineering
> Basho Technologies, Inc.
> diz...@basho.com
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Problems writing objects to an half full bucket

2012-03-07 Thread Marco Monteiro
Having the keys prefixed with the seconds since epoch solved the problem.

Thanks,
Marco

On 6 March 2012 15:47, Marco Monteiro  wrote:

> It makes sense, David. I'm going to give it a try.
> Hopefully this will make it usable for the next month
> until the issue is addressed.
>
> I'll let you know how it goes.
>
> Thanks,
> Marco
>
>
> On 6 March 2012 15:19, David Smith  wrote:
>
>> On Mon, Mar 5, 2012 at 9:55 PM, Marco Monteiro 
>> wrote:
>>
>> > I'm using riak-js and the error I get is:
>> >
>> > { [Error: socket hang up] code: 'ECONNRESET' }
>>
>> That is a strange error -- are there any corresponding errors in
>> server logs? I would have expected a timeout or some such...
>>
>> >
>> > UUIDs. They are created by Riak. All my queries use 2i. The 2i are
>> integers
>> > (representing seconds) and random strings (length 16) used as
>> identifiers
>> > for user sessions and similar.
>>
>> So, this explains why the problem goes away when you switch to an
>> empty bucket. A bit of background...
>>
>> If you're using the functionality in Riak that automatically generates
>> a UUID on PUT, you're going to get a uniformly distributed 160-bit
>> number (since the implementation SHA-1 hashes the input). This sort of
>> distribution is great for uniqueness, since there is a 1 in 2^160
>> chance (roughly) that you will encounter another similar ID. It can be
>> very bad from a caching perspective, however, if you have a cache that
>> uses pages of information for locality purposes. In a scheme such as
>> this (which is what LevelDB uses), the system will wind up churning
>> the cache constantly since the odds are quite low that the next UUID
>> to be accessed will be already in memory (remember, uniform
>> distribution of keys).
>>
>> LevelDB also makes this pathological case a bit worse by not having
>> bloom filters -- when inserting a new UUID, you will potentially have
>> to do 7 disk seeks just to determine if the UUID is not present. The
>> Google team is working to address this problem, but I'm guessing it'll
>> be a month or so before that's done and then we have to integrate with
>> Riak -- so we can't count on that just yet.
>>
>> Now, all is not lost. :)
>>
>> If you craft your keys so that there is some temporal locality _and_
>> the access pattern of your keys has some sort of exponential-ish
>> decay, you can still get very good performance out of LevelDB. One
>> simple way to do this is to prefix the current date-time on front of
>> the UUID, like so:
>>
>> 201203060806- (YMDhm-UUID)
>>
>> You could also use seconds since the epoch, etc. This has the effect
>> of keeping recently accessed/hot UUIDs on (close to) the same cache
>> page, and lets you avoid a lot of cache churn and typically
>> dramatically improves LevelDB performance.
>>
>> Does this help/make sense?
>>
>> D.
>> --
>> Dave Smith
>> VP, Engineering
>> Basho Technologies, Inc.
>> diz...@basho.com
>>
>
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com