Hi Kurt, A Riak cluster can handle very large amounts of data, and 500 000 000 keys should not be a problem. Riak's MapReduce implementation is however not designed or meant to be used for this type of large bulk processing, so inserting all the data and then periodically performing MapReduce over the entire set will not work. You will therefore need to change your approach to how you process and query your data in order to make it work with Riak.
I would recommend looking closely at how you need to be able to query your data and then perhaps consider performing periodic aggregations in different ways to support these query patterns. This would allow you to directly access data through keys, secondary indexes or run MapReduce on a smaller set of keys, which would most likely scale and perform much better. Depending on your data and requirements, there may also be other ways to tackle the problem. If you could provide me with some example data and a description of how you need to be able to query this data and what type of information you are looking to get out of it, I am sure we can try helping you design a suitable data model and an efficient approach to process it. If you are not comfortable sharing this type of information on the mailing list, feel free to email me directly. Best regards, Christian On 17 May 2013, at 09:25, kurt campher <campherku...@gmail.com> wrote: > So just to provide a bit of context. > > We want a datastore that can hold over 500 000 000 keys and will those keys > will map reduced routinely. > > I would love to use Riak for this but the question is can it handle this > amount of data (and possibly more) and can it be done cheaply? > > What sort of hosting would be needed? RAM? CPU? etc... > > Thanks for the help > > > > > On Wed, May 15, 2013 at 5:33 PM, Dmitri Zagidulin <dzagidu...@basho.com> > wrote: > Kurt, > > I'm not sure about the cause of the MapReduce crash (I suspect it's running > out of resources of some kind, even with the increase of vm count and mem). > One word of advice about the list keys timeout, though: > Be sure to use streaming list keys. > > In Python, this would look something like: > for keylist in bucket.stream_keys(): > > > for key in keylist: > # Do something with the key > > > This will at least avoid the timeout problem (though you may want to consider > your use case here, and maybe use secondary index queries or search queries > instead of listing all the keys in a bucket, since even a streaming list keys > has to iterate over _all_ keys in a cluster). > > Dmitri > > > On Wed, May 15, 2013 at 7:02 AM, kurt campher <campherku...@gmail.com> wrote: > Hi People > > Im running Map Reduce on a bucket with more than 100 000 items. > > The MR runs for 10 seconds then stops with this error in the logs: > @riak_pipe_vnode:new_worker:766 Pipe worker startup failed:fitting was gone > before startup > > And this errror in the Python shell: > Error running MapReduce operation. Headers: {'date': 'Tue, 14 May 2013 > 15:07:27 GMT', 'content-length': '623', 'content-type': 'application/json', > 'http_code': 500, 'server': 'MochiWeb/1.1 WebMachine/1.9.0 (someone had > painted it blue)'} Body: > '{"phase":0,"error":"[preflist_exhausted]","input":"{ok,{r_object,<<\\"real_raw_logs\\">>,<<\\"8a4986cc235ec8690123677460ac05e6:2013-05-14 > > 12:11:08.178628:0.184912287858\\">>,[{r_content,{dict,6,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[[<<\\"Links\\">>]],[],[],[],[],[],[],[],[[<<\\"content-type\\">>,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110],[<<\\"X-Riak-VTag\\">>,49,89,48,122,98,99,66,53,120,86,120,50,90,67,101,51,115,120,79,85,65,79]],[[<<\\"index\\">>]],[],[[<<\\"X-Riak-Last-Modified\\">>|{1368,533468,242947}]],[],[...]}}},...}],...},...}","type":"forward_preflist","stack":"[]"}' > > Also, I cant list the keys on the bucket. A timeout error occurs. > > > I have Riak running on 2 nodes with 7 Gigs of RAM each. > Map Reduce runs fine over 2000 items. > I have increased the js_vm count multiple times. > Also increased the js_max_vm_mem to 2048 > Also increased the Map Reduce query's timeout but never lasts longer than 10 > seconds > > Thanks to anyone who looks at this > > > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com