Re: confused
On Sep 15, 2010, at 2:40 PM, Nils Petersohn wrote: > hello, > > i was setting up 9 riak instances: > > three on my mac with the appropriate app config > and six with two virtual machines on a different computer. > > all 8 joined the d...@192.168.1.20 > and the join request was sent. > > after setting this up: > i wanted to put data with the java client on d...@192.168.1.20 than i got a > timeout ?!? > I am curious if you started this node and then changed its name in the config file? Errors like this can happen if you don't riak-admin reip the node, also the ring file would be wrong and this could lead to some of the other errors you saw below. One thing you may want to look at is the state of your ring from the Riak console using riak_core_ring_manager:get_my_ring(). That might show any problems with the ring, feel free to send that along so we can take a look at it. > when i put data on one of the other machines than only this machine was using > cpu time and none of the other ... > if consistent hashing works like expected, than all the machines should show > up on "top" > > when i did a mapreduce job than only this machine was using cpu time and none > of the other ... > > i had "top" running on all of them. > > --- > the other problem is: > > when i have 1/2 mio. entrys in one bucket with less than 100 chars for each > entry > and i do a really simple mapreduce job, than it takes forever (15 minutes ...) > while sql uses .005 secons > > i know that doing a mr on a complete bucket, than it takes very long if i > don't secify keys in the bucket. but how should i know which keys to use ... What version of Riak are you using? There has been a fair amount of improvement to the map reduce system as well as list keys. Are the map reduce jobs you are running javascript? > -- > > if i put stuff in one bucket and add a machine with the join request, how can > i rebalance the bucket so that the other machine is taking some values > too. This happens automatically. When the new node joins the cluster you should see handoff messages in the erlang.log.X log file. Rebalancing is handled by the cluster and shouldn't be done manually. Grant Schofield Developer Advocate Basho Technologies, Inc. > > -- > > i don't understand these issues/behaviors (timeout, 15min. etc., > rebalancing), maybe i was setting the one of the three params incorrect ? i > left everything to the default settings. > > thx in advance for any hints... > > nils > ___ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Limit on number of buckets
Is there a practical (or hard) limit to the number of buckets a riak cluster can handle? One possible data model we could use for one application could result in ~80,000 buckets. Is that a reasonable number? Thanks, Scott ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Limit on number of buckets
Scott, There is no limit on the number of buckets unless you are changing the bucket properties, like the replication factor, allow_mult, or the pre- and post-commit hooks. Buckets that have properties other than the defaults consume space in the ring state. Other than that, they are essentially free unless you're using a backend that segregates data by bucket - the only one that does at this time is innostore. Is there a reason you need so many buckets? Sean Cribbs Developer Advocate Basho Technologies, Inc. http://basho.com/ On Sep 16, 2010, at 2:17 PM, SKester wrote: > Is there a practical (or hard) limit to the number of buckets a riak cluster > can handle? One possible data model we could use for one application could > result in ~80,000 buckets. Is that a reasonable number? > > Thanks, > Scott > ___ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Limit on number of buckets
There is no limit to the amount of buckets a cluster can handle. The only consideration I know of is when using non default bucket properties (like bucket specific N vals). The reason being that non default values are chatted around the cluster in the gossip channel. -Alexander @siculars on twitter http://siculars.posterous.com Sent from my iPhone On Sep 16, 2010, at 14:17, SKester wrote: Is there a practical (or hard) limit to the number of buckets a riak cluster can handle? One possible data model we could use for one application could result in ~80,000 buckets. Is that a reasonable number? Thanks, Scott ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Limit on number of buckets
Thanks for the quick replies Sean and Alexander. One of our current products allows users to sign up for weather alerts based on their zip code. When we receive a weather alert for a set of locations, we need to quickly find all users in the zip codes effected. We currently do this with a simple sql query against a relational db. Being new at this key/value store thing, we are not sure the best way to tackle this with Riak. Some zip codes have over 20,000 users, so storing the users in a json array with the zip code as the key would get ugly fast. One thought was to store the user profiles in one bucket, and then add an key per user in the correct zip code bucket, perhaps with a link back to the users record in the profile bucket. We could then fetch the keys for the effected zip codes using map reduce. I am open to all suggestions on how to best model this type of data in Riak. Thanks, Scott Sean Cribbs wrote: Scott, There is no limit on the number of buckets unless you are changing the bucket properties, like the replication factor, allow_mult, or the pre- and post-commit hooks. Buckets that have properties other than the defaults consume space in the ring state. Other than that, they are essentially free unless you're using a backend that segregates data by bucket - the only one that does at this time is innostore. Is there a reason you need so many buckets? Sean CribbsDeveloper Advocate Basho Technologies, Inc. http://basho.com/ On Sep 16, 2010, at 2:17 PM, SKester wrote: Is there a practical (or hard) limit to the number of buckets a riak cluster can handle? One possible data model we could use for one application could result in ~80,000 buckets. Is that a reasonable number? Thanks, Scott ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Limit on number of buckets
Listing keys in a bucket is not necessarily going to be faster than storing the list in an object. You might want to measure this to be sure - be aware that list-keys is bound by the total number of keys in the cluster, not by the amount in the bucket. Sean Cribbs Developer Advocate Basho Technologies, Inc. http://basho.com/ On Sep 16, 2010, at 2:49 PM, Scott wrote: > Thanks for the quick replies Sean and Alexander. One of our current products > allows users to sign up for weather alerts based on their zip code. When we > receive a weather alert for a set of locations, we need to quickly find all > users in the zip codes effected. We currently do this with a simple sql query > against a relational db. Being new at this key/value store thing, we are not > sure the best way to tackle this with Riak. > > Some zip codes have over 20,000 users, so storing the users in a json array > with the zip code as the key would get ugly fast. One thought was to store > the user profiles in one bucket, and then add an key per user in the correct > zip code bucket, perhaps with a link back to the users record in the profile > bucket. We could then fetch the keys for the effected zip codes using map > reduce. I am open to all suggestions on how to best model this type of data > in Riak. > > Thanks, > Scott > > > Sean Cribbs wrote: >> >> Scott, >> >> There is no limit on the number of buckets unless you are changing the >> bucket properties, like the replication factor, allow_mult, or the pre- and >> post-commit hooks. Buckets that have properties other than the defaults >> consume space in the ring state. Other than that, they are essentially free >> unless you're using a backend that segregates data by bucket - the only one >> that does at this time is innostore. >> >> Is there a reason you need so many buckets? >> >> Sean Cribbs >> Developer Advocate >> Basho Technologies, Inc. >> http://basho.com/ >> >> On Sep 16, 2010, at 2:17 PM, SKester wrote: >> >>> Is there a practical (or hard) limit to the number of buckets a riak >>> cluster can handle? One possible data model we could use for one >>> application could result in ~80,000 buckets. Is that a reasonable number? >>> >>> Thanks, >>> Scott >>> ___ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
riak not starting properly
Over the last few weeks I've been finding it harder and harder to start riak which given that it's running on an auto-provisioned ec2 instance is a bit of an issue! I can generally restart it by running /etc/init.d/riak restart but it's got to the stage where I have to run it four or five times. I should clarify here that when I say "harder to start" it does start but as soon as I try to do anything it fails. The contents of /var/log/riak are here: http://stuff.roughage.com.au/riak-failure-2.log.tar.gz rgh -- Richard Heycock http://topikality.com +61 (0) 410 646 369 [e]: r...@topikality.com [im]: r...@topikality.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: confused
ok, my ring seems ok now. what i did was to change the rel/vars/dev[1,2,3]_vars.config file. in there i was just replacing the ips... this reip thing did not really work out ... here is my riak ring now: (d...@192.168.0.100)1> riak_core_ring_manager:get_my_ring(). {ok,{chstate,'d...@192.168.0.100', [{'d...@192.168.0.107',{65,63451889794}}, {'d...@192.168.0.105',{13,63451889512}}, {'d...@192.168.0.100',{104,63451889512}}, {'d...@192.168.0.105',{49,63451889512}}, {'d...@192.168.0.100',{32,63451889009}}, {'d...@192.168.0.105',{94,63451889253}}, {'d...@192.168.0.107',{9,63451889769}}, {'d...@192.168.0.100',{97,63451889494}}], {64, [{0,'d...@192.168.0.100'}, {22835963083295358096932575511191922182123945984, 'd...@192.168.0.105'}, {45671926166590716193865151022383844364247891968, 'd...@192.168.0.107'}, {68507889249886074290797726533575766546371837952, 'd...@192.168.0.100'}, {91343852333181432387730302044767688728495783936, 'd...@192.168.0.105'}, {114179815416476790484662877555959610910619729920, 'd...@192.168.0.107'}, {137015778499772148581595453067151533092743675904, 'd...@192.168.0.100'}, {159851741583067506678528028578343455274867621888, 'd...@192.168.0.105'}, {182687704666362864775460604089535377456991567872, 'd...@192.168.0.100'}, {205523667749658222872393179600727299639115513856, 'd...@192.168.0.105'}, {228359630832953580969325755111919221821239459840, 'd...@192.168.0.107'}, {25119559391624893906625833062344003363405824, 'd...@192.168.0.100'}, {274031556999544297163190906134303066185487351808, 'd...@192.168.0.105'}, {296867520082839655260123481645494988367611297792, 'd...@192.168.0.107'}, {319703483166135013357056057156686910549735243776, 'd...@192.168.0.100'}, {342539446249430371453988632667878832731859189760, 'd...@192.168.0.105'}, {365375409332725729550921208179070754913983135744, 'd...@192.168.0.100'}, {388211372416021087647853783690262677096107081728, 'd...@192.168.0.105'}, {411047335499316445744786359201454599278231027712, 'd...@192.168.0.107'}, {433883298582611803841718934712646521460354973696,...}, {...}|...]}, {dict,0,16,16,8,80,48, {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...}, {{[],[],[],[],[],[],[],[],[],[],[],[],...} (d...@192.168.0.100)2> i am using 0.12.1 on my mac and 0.12 on both vms. i have now a set of 100.000 entrys like this (just for testing): {"id":"42164", "actionTime":"2007-05-11 17:08:55", "action":"some action", "res":"7024", "user":"5", "client":"2787"} and my mr job looks like this (just for testing): {"inputs":"actionbucket", "query":[ {"map":{"language":"javascript", "source": "function(values, keyData, arg) { var value = Riak.mapValuesJson(values)[0]; if(value.reservation == '4084'){ return [value]; } return []; }","keep":true}} ],"timeout": 90 } the beam instances are all showing on "top" now, and there is some traffic going back and forth. (~200kb / s) but this job takes like 1:30 min. i know that this is not really comparable with a mysql query because you can do more calculations in the mr job to produce much more special results and the mr job has a ~linear "worktime"... but ~1:30 min is still pretty bad is there any way to do much better ? best regards nils On Sep 16, 2010, at 7:08 PM, Grant Schofield wrote: > > On Sep 15, 2010, at 2:40 PM, Nils Petersohn wrote: > >> hello, >> >> i was setting up 9 riak instances: >> >> three on my mac with the appropriate app config >> and six with two virtual machines on a different computer. >> >> all 8 joined the d...@192.168.1.20 >> and the join request was sent. >> >> after setting this up: >> i wanted to put data with the java client on d...@192.168.1.20 than i got a >> timeout ?!? >> > > I am curious if you started this node and then changed its name in the config > file? Errors like this can happen if you don't riak-admin reip the node, also > the ring file would be wrong and this could lead to some of the other errors > you saw below. One thing you may want to look at is the state of your ring > from the Riak console using riak_core_ring_manager:get_my_ring(). That might > show any problems with the ring, feel free to send that along so we ca
Re: confused
I think the slowness is coming from the older list keys implementation in 0.12.1, list keys has been changed in the tip version of Riak and is quite a bit faster now. In addition there have been a lot of improvements to the Javascript map reduce implementation that should help the speed of your query. For the time being you will need to run Riak tip to get access to these enhancements. Grant Schofield Developer Advocate Basho Technologies, Inc. On Sep 16, 2010, at 5:17 PM, Nils Petersohn wrote: > ok, my ring seems ok now. > what i did was to change the rel/vars/dev[1,2,3]_vars.config file. > in there i was just replacing the ips... > this reip thing did not really work out ... > > here is my riak ring now: > (d...@192.168.0.100)1> riak_core_ring_manager:get_my_ring(). > {ok,{chstate,'d...@192.168.0.100', > [{'d...@192.168.0.107',{65,63451889794}}, > {'d...@192.168.0.105',{13,63451889512}}, > {'d...@192.168.0.100',{104,63451889512}}, > {'d...@192.168.0.105',{49,63451889512}}, > {'d...@192.168.0.100',{32,63451889009}}, > {'d...@192.168.0.105',{94,63451889253}}, > {'d...@192.168.0.107',{9,63451889769}}, > {'d...@192.168.0.100',{97,63451889494}}], > {64, > [{0,'d...@192.168.0.100'}, > {22835963083295358096932575511191922182123945984, >'d...@192.168.0.105'}, > {45671926166590716193865151022383844364247891968, >'d...@192.168.0.107'}, > {68507889249886074290797726533575766546371837952, >'d...@192.168.0.100'}, > {91343852333181432387730302044767688728495783936, >'d...@192.168.0.105'}, > {114179815416476790484662877555959610910619729920, >'d...@192.168.0.107'}, > {137015778499772148581595453067151533092743675904, >'d...@192.168.0.100'}, > {159851741583067506678528028578343455274867621888, >'d...@192.168.0.105'}, > {182687704666362864775460604089535377456991567872, >'d...@192.168.0.100'}, > {205523667749658222872393179600727299639115513856, >'d...@192.168.0.105'}, > {228359630832953580969325755111919221821239459840, >'d...@192.168.0.107'}, > {25119559391624893906625833062344003363405824, >'d...@192.168.0.100'}, > {274031556999544297163190906134303066185487351808, >'d...@192.168.0.105'}, > {296867520082839655260123481645494988367611297792, >'d...@192.168.0.107'}, > {319703483166135013357056057156686910549735243776, >'d...@192.168.0.100'}, > {342539446249430371453988632667878832731859189760, >'d...@192.168.0.105'}, > {365375409332725729550921208179070754913983135744, >'d...@192.168.0.100'}, > {388211372416021087647853783690262677096107081728, >'d...@192.168.0.105'}, > {411047335499316445744786359201454599278231027712, >'d...@192.168.0.107'}, > {433883298582611803841718934712646521460354973696,...}, > {...}|...]}, > {dict,0,16,16,8,80,48, > {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...}, > {{[],[],[],[],[],[],[],[],[],[],[],[],...} > (d...@192.168.0.100)2> > > i am using 0.12.1 on my mac and 0.12 on both vms. i have now a set of 100.000 > entrys like this (just for testing): > {"id":"42164", "actionTime":"2007-05-11 17:08:55", "action":"some action", > "res":"7024", "user":"5", "client":"2787"} > > > and my mr job looks like this (just for testing): > {"inputs":"actionbucket", > "query":[ > {"map":{"language":"javascript", "source": > "function(values, keyData, arg) { > > var value = Riak.mapValuesJson(values)[0]; >if(value.reservation == '4084'){ > return [value]; > } > return []; > }","keep":true}} > ],"timeout": 90 > } > > > the beam instances are all showing on "top" now, and there is some traffic > going back and forth. (~200kb / s) > > but this job takes like 1:30 min. > > i know that this is not really comparable with a mysql query because you can > do more calculations in the mr job to produce much more special results and > the mr job has a ~linear "worktime"... but ~1:30 min is still pretty bad > > is there any way to do much better ? > > best regards > nils > > On Sep 16, 2010, at 7:08 PM, Grant Schofield wrote: > >> >> On Sep 15, 2010, at 2:40 PM, Nils Petersohn wrote: >> >>> hello, >>> >>> i was setting up 9 riak instances: >>> >>> three on my mac with the appropriate app config >>> and six with two virtual machines on a different computer. >>> >>> all 8 joined the d...@
badarg ets delete
Hey guys I have an application using Riak 0.12 that does puts, gets, and updates. It works fine but I get these random error reports in my logs. Any ideas? ERROR <0.149.0> ** Generic server <0.149.0> terminating ** Last message in was stop ** When Server state == {state,139315} ** Reason for termination == ** {badarg,[{ets,delete,[139315]}, {riak_kv_ets_backend,srv_stop,1}, {riak_kv_ets_backend,handle_call,3}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]} ERROR <0.149.0> crash_report [[{initial_call, {riak_kv_ets_backend,init,['Argument__1']}}, {pid,<0.149.0>}, {registered_name,[]}, {error_info, {exit, {badarg, [{ets,delete,[139315]}, {riak_kv_ets_backend,srv_stop,1}, {riak_kv_ets_backend,handle_call,3}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]}, [{gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}}, {ancestors, [<0.148.0>,riak_core_vnode_sup,riak_core_sup, <0.58.0>]}, {messages,[]}, {links,[<0.148.0>]}, {dictionary,[]}, {trap_exit,false}, {status,running}, {heap_size,377}, {stack_size,24}, {reductions,243}], []] ERROR <0.148.0> ** State machine <0.148.0> terminating ** Last event in was timeout ** When State == active ** Data == {state,159851741583067506678528028578343455274867621888, riak_kv_vnode, {state,159851741583067506678528028578343455274867621888, riak_kv_ets_backend,<0.149.0>, {kv_lru,100,147509,143412,151606}, {dict,0,16,16,8,80,48, {[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]}, {{[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]}}}, true}, undefined,none} ** Reason for termination = ** {{badarg,[{ets,delete,[139315]}, {riak_kv_ets_backend,srv_stop,1}, {riak_kv_ets_backend,handle_call,3}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]}, {gen_server,call,[<0.149.0>,stop]}} ERROR <0.148.0> crash_report [[{initial_call, {riak_core_vnode,init,['Argument__1']}}, {pid,<0.148.0>}, {registered_name,[]}, {error_info, {exit, {{badarg, [{ets,delete,[139315]}, {riak_kv_ets_backend,srv_stop,1}, {riak_kv_ets_backend,handle_call,3}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]}, {gen_server,call,[<0.149.0>,stop]}}, [{gen_fsm,terminate,7}, {proc_lib,init_p_do_apply,3}]}}, {ancestors, [riak_core_vnode_sup,riak_core_sup,<0.58.0>]}, {messages, [{'EXIT',<0.149.0>, {badarg, [{ets,delete,[139315]}, {riak_kv_ets_backend,srv_stop,1}, {riak_kv_ets_backend,handle_call,3}, {gen_server,handle_msg,5}, {proc_lib,init_p_do_apply,3}]}}]}, {links,[<0.60.0>]}, {dictionary,[]}, {trap_exit,true}, {status,running}, {heap_size,377}, {stack_size,24}, {reductions,952}], []] ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: badarg ets delete
Hi Michael, These errors are almost certainly harmless and being thrown when empty, non-owned vnodes get shut down. It appears that in some cases, the underlying ets table might already be deleted/GC'd by the time BackendModule:stop tries to explicitly delete it. I've opened this bug to track the issue: http://issues.basho.com/show_bug.cgi?id=723 - Andy -- Andy Gross VP, Engineering Basho Technologies, Inc. http://basho.com On Thu, Sep 16, 2010 at 3:59 PM, Michael Colussi wrote: > > Hey guys I have an application using Riak 0.12 that does puts, gets, and > updates. It works fine but I get these random error reports in my logs. > Any ideas? > > ERROR <0.149.0> ** Generic server <0.149.0> terminating > ** Last message in was stop > ** When Server state == {state,139315} > ** Reason for termination == > ** {badarg,[{ets,delete,[139315]}, > {riak_kv_ets_backend,srv_stop,1}, > {riak_kv_ets_backend,handle_call,3}, > {gen_server,handle_msg,5}, > {proc_lib,init_p_do_apply,3}]} > ERROR <0.149.0> crash_report [[{initial_call, > > {riak_kv_ets_backend,init,['Argument__1']}}, >{pid,<0.149.0>}, >{registered_name,[]}, >{error_info, > {exit, > {badarg, > [{ets,delete,[139315]}, >{riak_kv_ets_backend,srv_stop,1}, >{riak_kv_ets_backend,handle_call,3}, >{gen_server,handle_msg,5}, >{proc_lib,init_p_do_apply,3}]}, > [{gen_server,terminate,6}, > {proc_lib,init_p_do_apply,3}]}}, >{ancestors, > > [<0.148.0>,riak_core_vnode_sup,riak_core_sup, > <0.58.0>]}, >{messages,[]}, >{links,[<0.148.0>]}, >{dictionary,[]}, >{trap_exit,false}, >{status,running}, >{heap_size,377}, >{stack_size,24}, >{reductions,243}], > []] > ERROR <0.148.0> ** State machine <0.148.0> terminating > ** Last event in was timeout > ** When State == active > ** Data == {state,159851741583067506678528028578343455274867621888, > riak_kv_vnode, > > {state,159851741583067506678528028578343455274867621888, >riak_kv_ets_backend,<0.149.0>, >{kv_lru,100,147509,143412,151606}, >{dict,0,16,16,8,80,48, > > {[],[],[],[],[],[],[],[],[],[],[],[],[], > [],[],[]}, > > {{[],[],[],[],[],[],[],[],[],[],[],[],[], >[],[],[]}}}, >true}, > undefined,none} > ** Reason for termination = > ** {{badarg,[{ets,delete,[139315]}, > {riak_kv_ets_backend,srv_stop,1}, > {riak_kv_ets_backend,handle_call,3}, > {gen_server,handle_msg,5}, > {proc_lib,init_p_do_apply,3}]}, > {gen_server,call,[<0.149.0>,stop]}} > ERROR <0.148.0> crash_report [[{initial_call, > {riak_core_vnode,init,['Argument__1']}}, >{pid,<0.148.0>}, >{registered_name,[]}, >{error_info, > {exit, > {{badarg, >[{ets,delete,[139315]}, > {riak_kv_ets_backend,srv_stop,1}, > {riak_kv_ets_backend,handle_call,3}, > {gen_server,handle_msg,5}, > {proc_lib,init_p_do_apply,3}]}, > {gen_server,call,[<0.149.0>,stop]}}, > [{gen_fsm,terminate,7}, > {proc_lib,init_p_do_apply,3}]}}, >{ancestors, > > [riak_core_vnode_sup,riak_core_sup,<0.58.0>]}, >{messages, > [{'EXIT',<0.149.0>, > {badarg, >[{ets,delete,[139315]}, > {riak_kv_ets_backend,srv_stop,1}, > {riak_kv_ets_backend,handle_call,3}, > {gen_server,handle_msg,5}, > {proc_lib,init_p_do_apply,3}]}}]}, >
Re: Limit on number of buckets
Hi Scott, Until Riak gains the ability to constrain list traversals by bucket this will continue to be a point of friction. This issue has been broached before and there are tickets open on the issues tracking site. As I understand it, one solution would potentially modify bitcask to open a 'cask' per bucket. However, nothing comes for free and this would come at the expense of file descriptors at the os level thereby introducing a constraint on the number of buckets in a cluster. This is similar to how the inno backend currently operates, as Sean pointed out. Recognizing this constraint and how you can mitigate it really depends on your use case. I hate to sound like a broken record, but recent improvements to key traversal notwithstanding, I have been using redis as an intermediary key list manager. Augmenting that further I will pull key lists out of redis and write them to riak either by cron or explicitly by user action. Admittedly my volume is not at a level where this is a considerable problem at the moment. Then again, I don't think it ever will be (for my use case). I'm not trying to crawl the world or build the next twitter or facebook. That said, what I do is pull this key list out of redis (or riak), generate an appropriate inputs array and feed that to the mapreduce function. I should note at the moment I do this in javascript for ease of development. Other big wins in my book on using redis instead of riak for list management is that redis understands certain data primitives whereas riak is data agnostic. What this means practically is that you can push/pull/pop/slice data in redis (among other things). You just can not do that in riak. Data must be written atomically as in if you have a meg you write a meg. There are no diff updates in riak. Performance wise, the first thing you are going to want to look at if and when optimization becomes a concern is moving from the http interface to the protobuf interface. After that I would look into rewriting your mapreduce in erlang. Marshaling complex data between the native erlang internals and the javascript interpreter has a non zero cost associated with it. Forgoing this step is a big win. Again, I view all this as a growth path within the riak environment and "a good thing" (tm). Assuming your most populous zip codes may have on the order of ~200k subscribers you could encode your user keys in radix 62 and be able to fit those keys in a 3 character space. Move up to 4 characters for way more leg room. At 3 characters (standard 8 bit encoding) your ~200k key list is under 1 MB (something to consider based on how riak allocates ram for this portion of the mapreduce in erlang and/or js). Also, I'm a big fan of fixed length keys for unrolled loops. Either way, feeding keys explicitly to a mapreduce will only get better as your input list shrinks in relation to the total keys in your system. Data modeling wise, I would have a user bucket a zip codes bucket and a zip_users bucket and the converse, users_zip bucket. The later two having the keys of the former as members. I'm also a big fan of explicitly derived keys/paths. I would not recommend links here simply because of the unbounded, potentially large nature of your problem. Do keep us posted, Alexander On Sep 16, 2010, at 2:49 PM, Scott wrote: > Thanks for the quick replies Sean and Alexander. One of our current products > allows users to sign up for weather alerts based on their zip code. When we > receive a weather alert for a set of locations, we need to quickly find all > users in the zip codes effected. We currently do this with a simple sql query > against a relational db. Being new at this key/value store thing, we are not > sure the best way to tackle this with Riak. > > Some zip codes have over 20,000 users, so storing the users in a json array > with the zip code as the key would get ugly fast. One thought was to store > the user profiles in one bucket, and then add an key per user in the correct > zip code bucket, perhaps with a link back to the users record in the profile > bucket. We could then fetch the keys for the effected zip codes using map > reduce. I am open to all suggestions on how to best model this type of data > in Riak. > > Thanks, > Scott > > > Sean Cribbs wrote: >> Scott, >> >> There is no limit on the number of buckets unless you are changing the >> bucket properties, like the replication factor, allow_mult, or the pre- and >> post-commit hooks. Buckets that have properties other than the defaults >> consume space in the ring state. Other than that, they are essentially free >> unless you're using a backend that segregates data by bucket - the only one >> that does at this time is innostore. >> >> Is there a reason you need so many buckets? >> >> Sean Cribbs >> Developer Advocate >> Basho Technologies, Inc. >> http://basho.com/ >> >> On Sep 16, 2010, at 2:17 PM, SKester wrote: >> >>> I