Hi Kurt,

A Riak cluster can handle very large amounts of data, and 500 000 000 keys 
should not be a problem. Riak's MapReduce implementation is however not  
designed or meant to be used for this type of large bulk processing, so 
inserting all the data and then periodically performing MapReduce over the 
entire set will not work. You will therefore need to change your approach to 
how you process and query your data in order to make it work with Riak.

I would recommend looking closely at how you need to be able to query your data 
and then perhaps consider performing periodic aggregations in different ways to 
support these query patterns. This would allow you to directly access data 
through keys, secondary indexes or run MapReduce on a smaller set of keys, 
which would most likely scale and perform much better. Depending on your data 
and requirements, there may also be other ways to tackle the problem.

If you could provide me with some example data and a description of how you 
need to be able to query this data and what type of information you are looking 
to get out of it, I am sure we can try helping you design a suitable data model 
and an efficient approach to process it.

If you are not comfortable sharing this type of information on the mailing 
list, feel free to email me directly.

Best regards,

Christian


On 17 May 2013, at 09:25, kurt campher <campherku...@gmail.com> wrote:

> So just to provide a bit of context.
> 
> We want a datastore that can hold over 500 000 000 keys and will those keys 
> will map reduced routinely. 
> 
> I would love to use Riak for this but the question is can it handle this 
> amount of data (and possibly more) and can it be done cheaply?
> 
> What sort of hosting would be needed? RAM? CPU? etc...
> 
> Thanks for the help 
> 
> 
> 
> 
> On Wed, May 15, 2013 at 5:33 PM, Dmitri Zagidulin <dzagidu...@basho.com> 
> wrote:
> Kurt,
> 
> I'm not sure about the cause of the MapReduce crash (I suspect it's running 
> out of resources of some kind, even with the increase of vm count and mem). 
> One word of advice about the list keys timeout, though:
> Be sure to use streaming list keys.
> 
> In Python, this would look something like:
> for keylist in bucket.stream_keys():
> 
> 
>     for key in keylist:
>         # Do something with the key
> 
> 
> This will at least avoid the timeout problem (though you may want to consider 
> your use case here, and maybe use secondary index queries or search queries 
> instead of listing all the keys in a bucket, since even a streaming list keys 
> has to iterate over _all_ keys in a cluster).
> 
> Dmitri
> 
> 
> On Wed, May 15, 2013 at 7:02 AM, kurt campher <campherku...@gmail.com> wrote:
> Hi People
> 
> Im running Map Reduce on a bucket with more than 100 000 items.
> 
> The MR runs for 10 seconds then stops with this error in the logs:
> @riak_pipe_vnode:new_worker:766 Pipe worker startup failed:fitting was gone 
> before startup
> 
> And this errror in the Python shell:
> Error running MapReduce operation. Headers: {'date': 'Tue, 14 May 2013 
> 15:07:27 GMT', 'content-length': '623', 'content-type': 'application/json', 
> 'http_code': 500, 'server': 'MochiWeb/1.1 WebMachine/1.9.0 (someone had 
> painted it blue)'} Body: 
> '{"phase":0,"error":"[preflist_exhausted]","input":"{ok,{r_object,<<\\"real_raw_logs\\">>,<<\\"8a4986cc235ec8690123677460ac05e6:2013-05-14
>  
> 12:11:08.178628:0.184912287858\\">>,[{r_content,{dict,6,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[[<<\\"Links\\">>]],[],[],[],[],[],[],[],[[<<\\"content-type\\">>,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110],[<<\\"X-Riak-VTag\\">>,49,89,48,122,98,99,66,53,120,86,120,50,90,67,101,51,115,120,79,85,65,79]],[[<<\\"index\\">>]],[],[[<<\\"X-Riak-Last-Modified\\">>|{1368,533468,242947}]],[],[...]}}},...}],...},...}","type":"forward_preflist","stack":"[]"}'
> 
> Also, I cant list the keys on the bucket. A timeout error occurs.
> 
> 
> I have Riak running on 2 nodes with 7 Gigs of RAM each.
> Map Reduce runs fine over 2000 items.
> I have increased the js_vm count multiple times.
> Also increased the js_max_vm_mem to 2048
> Also increased the Map Reduce query's timeout but never lasts longer than 10 
> seconds
> 
> Thanks to anyone who looks at this
> 
> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to