Hi David,  

The word everywhere is to avoid key filters. It effectively does a whole-bucket 
key-listing, and that starts to get seriously slow out past 100k items. Since 
you say test queries work I'll presume you've debugged your map and reduce on 
some queries where you manually add a set of keys. (Right?)   

Since you're on LevelDB, it means you can use secondary indices ("2i") to drive 
these queries.

I don't have access to your filter_map, so I don't have access to how you 
construct your keys, but if you have 2i turned on, then you get the first 
key-field "for free" from 2i.

Let's say, hypothetically, that your keys are constructed as:
 keyprefix:<date>:<country>:<campaign_id>

Well, you can then rewrite the query input as:

def main():
    client = riak.RiakClient(host=riak_host,
        port=8087,transport_class=riak.transports.pbc.RiakPbcTransport)
    query = client.index(
                    bucket,  
                    '$key',  
                    'keyprefix:201210',  
                    'keyprefix:201210~')
    query.map('''function(value, keyData, arg) { ... }''')
    …



That's fine as far as it goes, but it doesn't solve the problem of querying 
country or campaign id, right?

As a temporary measure, I'd suggest trying your key filters, cranking up the 
timeout to something on the order of hours (I gave 5 minutes conservatively and 
arbitrarily), and going ahead and running it for however long it takes.


If those queries do give good results, I'd suggest going ahead and re-indexing 
your existing entries with 'country_bin' and 'campaign_bin'. It's up to 
personal style whether you treat dates as int or bin.

There are lots of tricks and further discussion on how best to get at every 
corner of your data, but how does this strike you so far?
--  
Adam Lindsay


On Sunday, 14 October 2012 at 12:57, David Montgomery wrote:

> Hi,
>  
> Below is my code for running a map reduce in python. I have a six
> node cluster, 2 cores each with 4 gigs for ram. I am no load and
> about 3 Mill keys and using leveldb with riak 1.2. Doing the below
> is taking a terribly long time. Never finished and I dont even know
> how I can check if it is even running other than the python script has
> not timed out. I look at the number of executed mappers in stats and
> it is flat lined when looking at Graphite. On test queries the below
> works.
>  
> So..how do I debug what is going on?
>  
>  
> def main():
> client = 
> riak.RiakClient(host=riak_host,port=8087,transport_class=riak.transports.pbc.RiakPbcTransport)
> query = client.add(bucket)
> filters = key_filter.tokenize(":", filter_map['date']) +
> (key_filter.starts_with('201210'))
> #& key_filter.tokenize(":", filter_map['country']).eq("US") \
> #& key_filter.tokenize(":", filter_map['campaign_id']).eq("t1") \
> query.add_key_filters(filters)
>  
> query.map('''
> function(value, keyData, arg) {
> var data = Riak.mapValuesJson(value)[0];
>  
> if(data['adx']=='gdn'){
> var alt_key = data['hw'];
> var obj = {};
> obj[alt_key] = 1;
> return [ obj ];
> }else{
> return [];
> }
>  
>  
> }''')
>  
>  
> query.reduce('''
> function(values, arg){
> return [ values.reduce( function(acc, item) {
> for (var state in item) {
> if (acc[state])
> acc[state] += item[state];
> else
> acc[state] = item[state];
> }
> return acc;
> })];
> }
> ''')
>  
> for result in query.run(timeout=300000):
> print result
>  
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com (mailto:riak-users@lists.basho.com)
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>  
>  


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to