I agree with the suggestion of gathering a bulk of analytics and then
flushing to Riak, especially as each record is so small, which seems the
overhead per record vs size of actual record seems excessive. I'd consider
grouping your analytics into daily blobs, if not hourly. Writing directly
to Riak in such a way would risk data inconsistencies, or loss, or
conflicts, but if you cached and then flushed at once with data accumulated
into your pre-defined groupings (hour, day, whatever), that would reduce
the number of Riak writes, reduce the raw # records, and you could even
accumulate some interesting summary data per record grouping.

We're working on an analytics application currently, primarily based on
analyzing activity from external systems, and we're taking a similar
approach (gathering data into grouped records, summarizing where desired,
storing into Riak). Our app is still incomplete though, so this is more of
a suggestion than the result of production-level experience.
*

 <http://www.loomlearning.com/>
 Jonathan Langevin
Manager, Information Technology
Loom Inc.
Wilmington, NC: (910) 241-0433 - jlange...@loomlearning.com -
www.loomlearning.com - Skype: intel352
*


On Mon, Nov 28, 2011 at 5:59 PM, Michael Dungan <m...@stippleit.com> wrote:

> Thank you for getting back to me. It does look like we'll be needing to go
> big, as we're already at 5m new records/month, so just dealing with monthly
> numbers is already beyond the few hundred thousand keys you mentioned,
> unless I'm thinking about this wrong.
>
> I would love to hear more about Mecha if you're willing to share. Feel
> free to contact me off-list.
>
> thanks again,
>
> -mike
>
>
>
> On 11/28/11 2:24 PM, Aphyr wrote:
>
>> For limited mapreduce (where you know the keys in advance) riak would be
>> a fine choice. 500 million keys, n val 3 is readily achievable on
>> commodity hardware; say four nodes with 128GB SSDs.
>>
>> If large-scale mapreduce (more than a few hundred thousand keys) is
>> important, or listing keys is critical, you might consider HBase.
>>
>> If you start hitting latency/write bottlenecks, it may be worth
>> accumulating metrics in Redis before flushing them to disk.
>>
>> At Showyou, we're also building a custom backend called Mecha which
>> integrates Riak and SOLR, specifically for this kind of analytics over
>> billions of keys. We haven't packaged it for open-source release yet,
>> but it might be worth talking about off-list.
>>
>> --Kyle
>>
>> On 11/28/2011 02:07 PM, Michael Dungan wrote:
>>
>>> Hi,
>>>
>>> Sorry if this has been asked before - I couldn't find a searchable
>>> archive of this list.
>>>
>>> I was told to ask this list whether or not Riak would be appropriate for
>>> tracking our site's metrics. We are currently using Redis for this but
>>> are at the point where we need both clustering and m/r capability, and
>>> on the surface, Riak looks to fit this bill (we already use Erlang
>>> elsewhere in our app, so that's an additional plus).
>>>
>>> The records are pretty small and can be representated easily in json. An
>>> example:
>>>
>>> {
>>> "id": "**c4473dc5cfc5da53831d47c4c016d1**c7de0a31e4fd94229e47ade569ef01*
>>> *1a7b"
>>> "type": "Photo::Click",
>>> "user_id": 2640,
>>> "photo_id": 255,
>>> "ip": "100.101.102.103",
>>> "created_at": "2011/04/08 17:09:40 -0700"
>>> }
>>>
>>> We currently have around 25 million records similar to this one, and are
>>> adding 4-5 million more each month.
>>>
>>> Is Riak appropriate for this use case? Are there any gotchas I need to
>>> be aware of?
>>>
>>> thank you,
>>>
>>> -mike
>>>
>>
> ______________________________**_________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/**mailman/listinfo/riak-users_**lists.basho.com<http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
>
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to