Re: Schema Architecture, Map Reduce & Key Lists

Mat Ellis Thu, 10 Feb 2011 10:20:27 -0800

That's interesting. Unfortunately I think a smart key would simply shift the 
problem elsewhere as we also generate a UUID so clicks can be converted when a 
conversion event occurs. So we either have a UUID as the key (and can quickly 
look up a click to see if it's valid and then the associated data if it is) or 
we can have a Smart Key and can easily lookup the associated data. Either way 
the posting I linked to suggests that unless I supply long lists of keys I have 
to scan *all* keys in *all* buckets, no matter how much I segment my data into 
separate (i.e. smaller) "sub-" buckets.


The auto-expire feature will definitely be handy. We'll look at that.

Does anyone have:

(a) opinions on addressing this issue using Ripple's "many" and "one" 
associations? It seems I could end up with a placement record with millions of 
clicks attached

(b) any good examples of Ripple's associations being used in real code (as 
opposed to examples here 
http://seancribbs.github.com/ripple/Ripple/Associations.html) and/or a 
discussion about Ripple::Document vs Ripple::EmbeddedDocument?

M.



On Feb 10, 2011, at 9:52 AM, Jeremiah Peschka wrote:

> Riak 0.14 brings key filters - it's still going to take time to filter the 
> keys in memory, but it's an in memory operation. Using 'smart keys' along the 
> lines of UNIXTIMESTAMP:placement:campaign:customer you can rapidly filter 
> your keys using meaningful criteria and perform MapReduce jobs on the results.
> 
> Nothing says you can't also store the same data in multiple buckets in 
> multiple formats to make querying easier.
> 
> In response to number 2 - there's a way to set Riak to auto expire data from 
> a bucket. It'll only be removed when compactions occur, but if you're storing 
> clickstream data that should be happen often enough.
> 
> -- 
> Jeremiah Peschka
> Microsoft SQL Server MVP
> MCITP: Database Developer, DBA
> On Thursday, February 10, 2011 at 9:35 AM, Mat Ellis wrote:
> 
>> We are converting a mysql based schema to Riak using Ripple. We're tracking 
>> a lot of clicks, and each click belongs to a cascade of other objects:
>> 
>> click -> placement -> campaign -> customer
>> 
>> i.e. we do a lot of operations on these clicks grouped by placement or sets 
>> of placements.
>> 
>> Reading this 
>> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2010-July/001591.html
>>  gave me pause for thought. I was hoping the time needed to crunch each 
>> day's data would be proportional to the volume of clicks on that day but it 
>> seems that it would be proportional to the total number of clicks ever.
>> 
>> What's the best approach here? I can see a number of 'solutions' each of 
>> them complicated:
>> 
>> (1) Maintain an index of clicks by day so that we can focus our operations 
>> on a time bound set of clicks
>> 
>> (2) Delete or archive clicks once they have been processed or after a 
>> certain number of days
>> 
>> (3) Add many links to each placement, one per click (millions potentially)
>> 
>> On a related noob-note, what would be the best way of creating a set of the 
>> clicks for a given placement? Map Reduce or Riak Search or some other method?
>> 
>> Thanks in advance.
>> 
>> M.
>> _______________________________________________
>> riak-users mailing list
>> riak-users@lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Schema Architecture, Map Reduce & Key Lists

Reply via email to