Re: confused

2010-09-16 Thread Grant Schofield

On Sep 15, 2010, at 2:40 PM, Nils Petersohn wrote:

> hello,
> 
> i was setting up 9 riak instances:
> 
> three on my mac with the appropriate app config
> and six with two virtual machines on a different computer.
> 
> all 8 joined the d...@192.168.1.20
> and the join request was sent.
> 
> after setting this up:
> i wanted to put data with the java client on d...@192.168.1.20 than i got a 
> timeout ?!?
> 

I am curious if you started this node and then changed its name in the config 
file? Errors like this can happen if you don't riak-admin reip the node, also 
the ring file would be wrong and this could lead to some of the other errors 
you saw below.  One thing you may want to look at is the state of your ring 
from the Riak console using riak_core_ring_manager:get_my_ring(). That might 
show any problems with the ring, feel free to send that along so we can take a 
look at it.

> when i put data on one of the other machines than only this machine was using 
> cpu time and none of the other ...
> if consistent hashing works like expected, than all the machines should show 
> up on "top"
> 
> when i did a mapreduce job than only this machine was using cpu time and none 
> of the other ...
> 
> i had "top" running on all of them.
> 
> ---
> the other problem is:
> 
> when i have 1/2 mio. entrys in one bucket with less than 100 chars for each 
> entry
> and i do a really simple mapreduce job, than it takes forever (15 minutes ...)
> while sql uses .005 secons
> 
> i know that doing a mr on a complete bucket, than it takes very long if i 
> don't secify keys in the bucket. but how should i know which keys to use ...

What version of Riak are you using?  There has been a fair amount of 
improvement to the map reduce system as well as list keys. Are the map reduce 
jobs you are running javascript?

> --
> 
> if i put stuff in one bucket and add a machine with the join request, how can 
> i rebalance the bucket so that the other machine is taking some values 
> too.

This happens automatically. When the new node joins the cluster you should see 
handoff messages in the erlang.log.X log file.   Rebalancing is handled by the 
cluster and shouldn't be done manually.

Grant Schofield
Developer Advocate
Basho Technologies, Inc.


> 
> --
> 
> i don't understand these issues/behaviors (timeout, 15min. etc., 
> rebalancing), maybe i was setting the one of the three params incorrect ? i 
> left everything to the default settings.
> 
> thx in advance for any hints...
> 
> nils
> ___
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Limit on number of buckets

2010-09-16 Thread SKester
Is there a practical (or hard) limit to the number of buckets a riak cluster
can handle?  One possible data model we could use for one application could
result in ~80,000 buckets.  Is that a reasonable number?

Thanks,
Scott

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Limit on number of buckets

2010-09-16 Thread Sean Cribbs
Scott,

There is no limit on the number of buckets unless you are changing the bucket 
properties, like the replication factor, allow_mult, or the pre- and 
post-commit hooks.  Buckets that have properties other than the defaults 
consume space in the ring state.  Other than that, they are essentially free 
unless you're using a backend that segregates data by bucket - the only one 
that does at this time is innostore.

Is there a reason you need so many buckets? 

Sean Cribbs 
Developer Advocate
Basho Technologies, Inc.
http://basho.com/

On Sep 16, 2010, at 2:17 PM, SKester wrote:

> Is there a practical (or hard) limit to the number of buckets a riak cluster 
> can handle?  One possible data model we could use for one application could 
> result in ~80,000 buckets.  Is that a reasonable number?
> 
> Thanks,
> Scott
> ___
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Limit on number of buckets

2010-09-16 Thread Alexander Sicular
There is no limit to the amount of buckets a cluster can handle. The  
only consideration I know of is when using non default bucket  
properties (like bucket specific N vals). The reason being that non  
default values are chatted around the cluster in the gossip channel.


-Alexander


@siculars on twitter
http://siculars.posterous.com

Sent from my iPhone

On Sep 16, 2010, at 14:17, SKester  wrote:

Is there a practical (or hard) limit to the number of buckets a riak  
cluster can handle?  One possible data model we could use for one  
application could result in ~80,000 buckets.  Is that a reasonable  
number?


Thanks,
Scott
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Limit on number of buckets

2010-09-16 Thread Scott




Thanks for the quick replies Sean and Alexander.  One of our current
products allows users to sign up for weather alerts based on their zip
code.  When we receive a weather alert for a set of locations, we need
to quickly find all users in the zip codes effected. We currently do
this with a simple sql query against a relational db.  Being new at
this key/value store thing, we are not sure the best way to tackle this
with Riak.

Some zip codes have over 20,000 users, so storing the users in a json
array with the zip code as the key would get ugly fast.  One thought
was to store the user profiles in one bucket, and then add an key per
user in the correct zip code bucket, perhaps with a link back to the
users record in the profile bucket.  We could then fetch the keys for
the effected zip codes using map reduce.  I am open to all suggestions
on how to best model this type of data in Riak.

Thanks,
Scott


Sean Cribbs wrote:
Scott,
  
  
  There is no limit on the number of buckets unless you are
changing the bucket properties, like the replication factor,
allow_mult, or the pre- and post-commit hooks.  Buckets that have
properties other than the defaults consume space in the ring state.
 Other than that, they are essentially free unless you're using a
backend that segregates data by bucket - the only one that does at this
time is innostore.
  
  
  Is there a reason you need so many buckets? 
  
  
  Sean Cribbs 
  Developer Advocate
  Basho Technologies, Inc.
  http://basho.com/
  
  
  
  On Sep 16, 2010, at 2:17 PM, SKester wrote:
  
  
Is there a practical (or hard) limit to the
number of buckets a riak cluster can handle?  One possible data model
we could use for one application could result in ~80,000 buckets.  Is
that a reasonable number?

Thanks,
Scott


___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
  
  
  
  




___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Limit on number of buckets

2010-09-16 Thread Sean Cribbs
Listing keys in a bucket is not necessarily going to be faster than storing the 
list in an object.  You might want to measure this to be sure - be aware that 
list-keys is bound by the total number of keys in the cluster, not by the 
amount in the bucket.

Sean Cribbs 
Developer Advocate
Basho Technologies, Inc.
http://basho.com/

On Sep 16, 2010, at 2:49 PM, Scott wrote:

> Thanks for the quick replies Sean and Alexander.  One of our current products 
> allows users to sign up for weather alerts based on their zip code.  When we 
> receive a weather alert for a set of locations, we need to quickly find all 
> users in the zip codes effected. We currently do this with a simple sql query 
> against a relational db.  Being new at this key/value store thing, we are not 
> sure the best way to tackle this with Riak.
> 
> Some zip codes have over 20,000 users, so storing the users in a json array 
> with the zip code as the key would get ugly fast.  One thought was to store 
> the user profiles in one bucket, and then add an key per user in the correct 
> zip code bucket, perhaps with a link back to the users record in the profile 
> bucket.  We could then fetch the keys for the effected zip codes using map 
> reduce.  I am open to all suggestions on how to best model this type of data 
> in Riak.
> 
> Thanks,
> Scott
> 
> 
> Sean Cribbs wrote:
>> 
>> Scott,
>> 
>> There is no limit on the number of buckets unless you are changing the 
>> bucket properties, like the replication factor, allow_mult, or the pre- and 
>> post-commit hooks.  Buckets that have properties other than the defaults 
>> consume space in the ring state.  Other than that, they are essentially free 
>> unless you're using a backend that segregates data by bucket - the only one 
>> that does at this time is innostore.
>> 
>> Is there a reason you need so many buckets? 
>> 
>> Sean Cribbs 
>> Developer Advocate
>> Basho Technologies, Inc.
>> http://basho.com/
>> 
>> On Sep 16, 2010, at 2:17 PM, SKester wrote:
>> 
>>> Is there a practical (or hard) limit to the number of buckets a riak 
>>> cluster can handle?  One possible data model we could use for one 
>>> application could result in ~80,000 buckets.  Is that a reasonable number?
>>> 
>>> Thanks,
>>> Scott
>>> ___
>>> riak-users mailing list
>>> riak-users@lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


riak not starting properly

2010-09-16 Thread Richard Heycock
Over the last few weeks I've been finding it harder and harder to start
riak which given that it's running on an auto-provisioned ec2 instance is
a bit of an issue! I can generally restart it by running
/etc/init.d/riak restart but it's got to the stage where I have to run
it four or five times. I should clarify here that when I say "harder to
start" it does start but as soon as I try to do anything it fails.

The contents of /var/log/riak are here:

http://stuff.roughage.com.au/riak-failure-2.log.tar.gz

rgh
-- 
Richard Heycock

http://topikality.com

+61 (0) 410 646 369
[e]:  r...@topikality.com
[im]: r...@topikality.com

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: confused

2010-09-16 Thread Nils Petersohn
ok, my ring seems ok now.
what i did was to change the rel/vars/dev[1,2,3]_vars.config file.
in there i was just replacing the ips...
this reip thing did not really work out ...

here is my riak ring now:
(d...@192.168.0.100)1> riak_core_ring_manager:get_my_ring().
{ok,{chstate,'d...@192.168.0.100',
 [{'d...@192.168.0.107',{65,63451889794}},
  {'d...@192.168.0.105',{13,63451889512}},
  {'d...@192.168.0.100',{104,63451889512}},
  {'d...@192.168.0.105',{49,63451889512}},
  {'d...@192.168.0.100',{32,63451889009}},
  {'d...@192.168.0.105',{94,63451889253}},
  {'d...@192.168.0.107',{9,63451889769}},
  {'d...@192.168.0.100',{97,63451889494}}],
 {64,
  [{0,'d...@192.168.0.100'},
   {22835963083295358096932575511191922182123945984,
'd...@192.168.0.105'},
   {45671926166590716193865151022383844364247891968,
'd...@192.168.0.107'},
   {68507889249886074290797726533575766546371837952,
'd...@192.168.0.100'},
   {91343852333181432387730302044767688728495783936,
'd...@192.168.0.105'},
   {114179815416476790484662877555959610910619729920,
'd...@192.168.0.107'},
   {137015778499772148581595453067151533092743675904,
'd...@192.168.0.100'},
   {159851741583067506678528028578343455274867621888,
'd...@192.168.0.105'},
   {182687704666362864775460604089535377456991567872,
'd...@192.168.0.100'},
   {205523667749658222872393179600727299639115513856,
'd...@192.168.0.105'},
   {228359630832953580969325755111919221821239459840,
'd...@192.168.0.107'},
   {25119559391624893906625833062344003363405824,
'd...@192.168.0.100'},
   {274031556999544297163190906134303066185487351808,
'd...@192.168.0.105'},
   {296867520082839655260123481645494988367611297792,
'd...@192.168.0.107'},
   {319703483166135013357056057156686910549735243776,
'd...@192.168.0.100'},
   {342539446249430371453988632667878832731859189760,
'd...@192.168.0.105'},
   {365375409332725729550921208179070754913983135744,
'd...@192.168.0.100'},
   {388211372416021087647853783690262677096107081728,
'd...@192.168.0.105'},
   {411047335499316445744786359201454599278231027712,
'd...@192.168.0.107'},
   {433883298582611803841718934712646521460354973696,...},
   {...}|...]},
 {dict,0,16,16,8,80,48,
   {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
   {{[],[],[],[],[],[],[],[],[],[],[],[],...}
(d...@192.168.0.100)2> 

i am using 0.12.1 on my mac and 0.12 on both vms. i have now a set of 100.000 
entrys like this (just for testing):
{"id":"42164", "actionTime":"2007-05-11 17:08:55", "action":"some action", 
"res":"7024", "user":"5", "client":"2787"}


and my mr job looks like this (just for testing):
{"inputs":"actionbucket",
 "query":[
   {"map":{"language":"javascript", "source":
   "function(values, keyData, arg) {
 
var value = Riak.mapValuesJson(values)[0];
if(value.reservation == '4084'){
return [value];
}
return [];
   }","keep":true}}
   ],"timeout": 90
}


the beam instances are all showing on "top" now, and there is some traffic 
going back and forth. (~200kb / s)

but this job takes like 1:30 min.

i know that this is not really comparable with a mysql query because you can do 
more calculations in the mr job to produce much more special results and the mr 
job has a ~linear "worktime"... but ~1:30 min is still pretty bad  

is there any way to do much better ?

best regards
nils

On Sep 16, 2010, at 7:08 PM, Grant Schofield wrote:

> 
> On Sep 15, 2010, at 2:40 PM, Nils Petersohn wrote:
> 
>> hello,
>> 
>> i was setting up 9 riak instances:
>> 
>> three on my mac with the appropriate app config
>> and six with two virtual machines on a different computer.
>> 
>> all 8 joined the d...@192.168.1.20
>> and the join request was sent.
>> 
>> after setting this up:
>> i wanted to put data with the java client on d...@192.168.1.20 than i got a 
>> timeout ?!?
>> 
> 
> I am curious if you started this node and then changed its name in the config 
> file? Errors like this can happen if you don't riak-admin reip the node, also 
> the ring file would be wrong and this could lead to some of the other errors 
> you saw below.  One thing you may want to look at is the state of your ring 
> from the Riak console using riak_core_ring_manager:get_my_ring(). That might 
> show any problems with the ring, feel free to send that along so we ca

Re: confused

2010-09-16 Thread Grant Schofield
I think the slowness is coming from the older list keys implementation in 
0.12.1, list keys has been changed in the tip version of Riak and is quite a 
bit faster now. In addition there have been a lot of improvements to the 
Javascript map reduce implementation that should help the speed of your query. 
For the time being you will need to run Riak tip to get access to these 
enhancements. 

Grant Schofield
Developer Advocate
Basho Technologies, Inc.


On Sep 16, 2010, at 5:17 PM, Nils Petersohn wrote:

> ok, my ring seems ok now.
> what i did was to change the rel/vars/dev[1,2,3]_vars.config file.
> in there i was just replacing the ips...
> this reip thing did not really work out ...
> 
> here is my riak ring now:
> (d...@192.168.0.100)1> riak_core_ring_manager:get_my_ring().
> {ok,{chstate,'d...@192.168.0.100',
> [{'d...@192.168.0.107',{65,63451889794}},
>  {'d...@192.168.0.105',{13,63451889512}},
>  {'d...@192.168.0.100',{104,63451889512}},
>  {'d...@192.168.0.105',{49,63451889512}},
>  {'d...@192.168.0.100',{32,63451889009}},
>  {'d...@192.168.0.105',{94,63451889253}},
>  {'d...@192.168.0.107',{9,63451889769}},
>  {'d...@192.168.0.100',{97,63451889494}}],
> {64,
>  [{0,'d...@192.168.0.100'},
>   {22835963083295358096932575511191922182123945984,
>'d...@192.168.0.105'},
>   {45671926166590716193865151022383844364247891968,
>'d...@192.168.0.107'},
>   {68507889249886074290797726533575766546371837952,
>'d...@192.168.0.100'},
>   {91343852333181432387730302044767688728495783936,
>'d...@192.168.0.105'},
>   {114179815416476790484662877555959610910619729920,
>'d...@192.168.0.107'},
>   {137015778499772148581595453067151533092743675904,
>'d...@192.168.0.100'},
>   {159851741583067506678528028578343455274867621888,
>'d...@192.168.0.105'},
>   {182687704666362864775460604089535377456991567872,
>'d...@192.168.0.100'},
>   {205523667749658222872393179600727299639115513856,
>'d...@192.168.0.105'},
>   {228359630832953580969325755111919221821239459840,
>'d...@192.168.0.107'},
>   {25119559391624893906625833062344003363405824,
>'d...@192.168.0.100'},
>   {274031556999544297163190906134303066185487351808,
>'d...@192.168.0.105'},
>   {296867520082839655260123481645494988367611297792,
>'d...@192.168.0.107'},
>   {319703483166135013357056057156686910549735243776,
>'d...@192.168.0.100'},
>   {342539446249430371453988632667878832731859189760,
>'d...@192.168.0.105'},
>   {365375409332725729550921208179070754913983135744,
>'d...@192.168.0.100'},
>   {388211372416021087647853783690262677096107081728,
>'d...@192.168.0.105'},
>   {411047335499316445744786359201454599278231027712,
>'d...@192.168.0.107'},
>   {433883298582611803841718934712646521460354973696,...},
>   {...}|...]},
> {dict,0,16,16,8,80,48,
>   {[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
>   {{[],[],[],[],[],[],[],[],[],[],[],[],...}
> (d...@192.168.0.100)2> 
> 
> i am using 0.12.1 on my mac and 0.12 on both vms. i have now a set of 100.000 
> entrys like this (just for testing):
> {"id":"42164", "actionTime":"2007-05-11 17:08:55", "action":"some action", 
> "res":"7024", "user":"5", "client":"2787"}
> 
> 
> and my mr job looks like this (just for testing):
> {"inputs":"actionbucket",
> "query":[
>   {"map":{"language":"javascript", "source":
>   "function(values, keyData, arg) {
>
>   var value = Riak.mapValuesJson(values)[0];
>if(value.reservation == '4084'){
>   return [value];
>   }
>   return [];
>   }","keep":true}}
>   ],"timeout": 90
> }
> 
> 
> the beam instances are all showing on "top" now, and there is some traffic 
> going back and forth. (~200kb / s)
> 
> but this job takes like 1:30 min.
> 
> i know that this is not really comparable with a mysql query because you can 
> do more calculations in the mr job to produce much more special results and 
> the mr job has a ~linear "worktime"... but ~1:30 min is still pretty bad  
> 
> is there any way to do much better ?
> 
> best regards
> nils
> 
> On Sep 16, 2010, at 7:08 PM, Grant Schofield wrote:
> 
>> 
>> On Sep 15, 2010, at 2:40 PM, Nils Petersohn wrote:
>> 
>>> hello,
>>> 
>>> i was setting up 9 riak instances:
>>> 
>>> three on my mac with the appropriate app config
>>> and six with two virtual machines on a different computer.
>>> 
>>> all 8 joined the d...@

badarg ets delete

2010-09-16 Thread Michael Colussi
Hey guys I have an application using Riak 0.12 that does puts, gets, and
updates.  It works fine but I get these random error reports in my logs.
 Any ideas?

ERROR <0.149.0> ** Generic server <0.149.0> terminating
** Last message in was stop
** When Server state == {state,139315}
** Reason for termination ==
** {badarg,[{ets,delete,[139315]},
{riak_kv_ets_backend,srv_stop,1},
{riak_kv_ets_backend,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
ERROR <0.149.0> crash_report [[{initial_call,
{riak_kv_ets_backend,init,['Argument__1']}},
   {pid,<0.149.0>},
   {registered_name,[]},
   {error_info,
{exit,
 {badarg,
  [{ets,delete,[139315]},
   {riak_kv_ets_backend,srv_stop,1},
   {riak_kv_ets_backend,handle_call,3},
   {gen_server,handle_msg,5},
   {proc_lib,init_p_do_apply,3}]},
 [{gen_server,terminate,6},
  {proc_lib,init_p_do_apply,3}]}},
   {ancestors,

 [<0.148.0>,riak_core_vnode_sup,riak_core_sup,
 <0.58.0>]},
   {messages,[]},
   {links,[<0.148.0>]},
   {dictionary,[]},
   {trap_exit,false},
   {status,running},
   {heap_size,377},
   {stack_size,24},
   {reductions,243}],
  []]
ERROR <0.148.0> ** State machine <0.148.0> terminating
** Last event in was timeout
** When State == active
**  Data  == {state,159851741583067506678528028578343455274867621888,
riak_kv_vnode,

 {state,159851741583067506678528028578343455274867621888,
   riak_kv_ets_backend,<0.149.0>,
   {kv_lru,100,147509,143412,151606},
   {dict,0,16,16,8,80,48,

{[],[],[],[],[],[],[],[],[],[],[],[],[],
  [],[],[]},

{{[],[],[],[],[],[],[],[],[],[],[],[],[],
   [],[],[]}}},
   true},
undefined,none}
** Reason for termination =
** {{badarg,[{ets,delete,[139315]},
 {riak_kv_ets_backend,srv_stop,1},
 {riak_kv_ets_backend,handle_call,3},
 {gen_server,handle_msg,5},
 {proc_lib,init_p_do_apply,3}]},
{gen_server,call,[<0.149.0>,stop]}}
ERROR <0.148.0> crash_report [[{initial_call,
{riak_core_vnode,init,['Argument__1']}},
   {pid,<0.148.0>},
   {registered_name,[]},
   {error_info,
{exit,
 {{badarg,
   [{ets,delete,[139315]},
{riak_kv_ets_backend,srv_stop,1},
{riak_kv_ets_backend,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]},
  {gen_server,call,[<0.149.0>,stop]}},
 [{gen_fsm,terminate,7},
  {proc_lib,init_p_do_apply,3}]}},
   {ancestors,

 [riak_core_vnode_sup,riak_core_sup,<0.58.0>]},
   {messages,
[{'EXIT',<0.149.0>,
  {badarg,
   [{ets,delete,[139315]},
{riak_kv_ets_backend,srv_stop,1},
{riak_kv_ets_backend,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}}]},
   {links,[<0.60.0>]},
   {dictionary,[]},
   {trap_exit,true},
   {status,running},
   {heap_size,377},
   {stack_size,24},
   {reductions,952}],
  []]
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: badarg ets delete

2010-09-16 Thread Andy Gross
Hi Michael,

These errors are almost certainly harmless and being thrown when empty,
non-owned vnodes get shut down.

It appears that in some cases, the underlying ets table might already be
deleted/GC'd by the time BackendModule:stop tries to explicitly delete it.
 I've opened this bug to track the issue:
http://issues.basho.com/show_bug.cgi?id=723

- Andy

--
Andy Gross 
VP, Engineering
Basho Technologies, Inc.
http://basho.com




On Thu, Sep 16, 2010 at 3:59 PM, Michael Colussi  wrote:

>
> Hey guys I have an application using Riak 0.12 that does puts, gets, and
> updates.  It works fine but I get these random error reports in my logs.
>  Any ideas?
>
> ERROR <0.149.0> ** Generic server <0.149.0> terminating
> ** Last message in was stop
> ** When Server state == {state,139315}
> ** Reason for termination ==
> ** {badarg,[{ets,delete,[139315]},
> {riak_kv_ets_backend,srv_stop,1},
> {riak_kv_ets_backend,handle_call,3},
> {gen_server,handle_msg,5},
> {proc_lib,init_p_do_apply,3}]}
> ERROR <0.149.0> crash_report [[{initial_call,
>
>  {riak_kv_ets_backend,init,['Argument__1']}},
>{pid,<0.149.0>},
>{registered_name,[]},
>{error_info,
> {exit,
>  {badarg,
>   [{ets,delete,[139315]},
>{riak_kv_ets_backend,srv_stop,1},
>{riak_kv_ets_backend,handle_call,3},
>{gen_server,handle_msg,5},
>{proc_lib,init_p_do_apply,3}]},
>  [{gen_server,terminate,6},
>   {proc_lib,init_p_do_apply,3}]}},
>{ancestors,
>
>  [<0.148.0>,riak_core_vnode_sup,riak_core_sup,
>  <0.58.0>]},
>{messages,[]},
>{links,[<0.148.0>]},
>{dictionary,[]},
>{trap_exit,false},
>{status,running},
>{heap_size,377},
>{stack_size,24},
>{reductions,243}],
>   []]
>  ERROR <0.148.0> ** State machine <0.148.0> terminating
> ** Last event in was timeout
> ** When State == active
> **  Data  == {state,159851741583067506678528028578343455274867621888,
> riak_kv_vnode,
>
>  {state,159851741583067506678528028578343455274867621888,
>riak_kv_ets_backend,<0.149.0>,
>{kv_lru,100,147509,143412,151606},
>{dict,0,16,16,8,80,48,
>
> {[],[],[],[],[],[],[],[],[],[],[],[],[],
>   [],[],[]},
>
> {{[],[],[],[],[],[],[],[],[],[],[],[],[],
>[],[],[]}}},
>true},
> undefined,none}
> ** Reason for termination =
> ** {{badarg,[{ets,delete,[139315]},
>  {riak_kv_ets_backend,srv_stop,1},
>  {riak_kv_ets_backend,handle_call,3},
>  {gen_server,handle_msg,5},
>  {proc_lib,init_p_do_apply,3}]},
> {gen_server,call,[<0.149.0>,stop]}}
> ERROR <0.148.0> crash_report [[{initial_call,
> {riak_core_vnode,init,['Argument__1']}},
>{pid,<0.148.0>},
>{registered_name,[]},
>{error_info,
> {exit,
>  {{badarg,
>[{ets,delete,[139315]},
>  {riak_kv_ets_backend,srv_stop,1},
> {riak_kv_ets_backend,handle_call,3},
> {gen_server,handle_msg,5},
> {proc_lib,init_p_do_apply,3}]},
>   {gen_server,call,[<0.149.0>,stop]}},
>  [{gen_fsm,terminate,7},
>   {proc_lib,init_p_do_apply,3}]}},
>{ancestors,
>
>  [riak_core_vnode_sup,riak_core_sup,<0.58.0>]},
>{messages,
> [{'EXIT',<0.149.0>,
>   {badarg,
>[{ets,delete,[139315]},
> {riak_kv_ets_backend,srv_stop,1},
> {riak_kv_ets_backend,handle_call,3},
> {gen_server,handle_msg,5},
> {proc_lib,init_p_do_apply,3}]}}]},
>

Re: Limit on number of buckets

2010-09-16 Thread Alexander Sicular
Hi Scott,

Until Riak gains the ability to constrain list traversals by bucket this will 
continue to be a point of friction. This issue has been broached before and 
there are tickets open on the issues tracking site. As I understand it, one 
solution would potentially modify bitcask to open a 'cask' per bucket. However, 
nothing comes for free and this would come at the expense of file descriptors 
at the os level thereby introducing a constraint on the number of buckets in a 
cluster. This is similar to how the inno backend currently operates, as Sean 
pointed out.

Recognizing this constraint and how you can mitigate it really depends on your 
use case. I hate to sound like a broken record, but recent improvements to key 
traversal notwithstanding, I have been using redis as an intermediary key list 
manager. Augmenting that further I will pull key lists out of redis and write 
them to riak either by cron or explicitly by user action. Admittedly my volume 
is not at a level where this is a considerable problem at the moment. Then 
again, I don't think it ever will be (for my use case). I'm not trying to crawl 
the world or build the next twitter or facebook. That said, what I do is pull 
this key list out of redis (or riak), generate an appropriate inputs array and 
feed that to the mapreduce function. I should note at the moment I do this in 
javascript for ease of development. Other big wins in my book on using redis 
instead of riak for list management is that redis understands certain data 
primitives whereas riak is data agnostic. What this means practically is that 
you can push/pull/pop/slice data in redis (among other things). You just can 
not do that in riak. Data must be written atomically as in if you have a meg 
you write a meg. There are no diff updates in riak. 

Performance wise, the first thing you are going to want to look at if and when 
optimization becomes a concern is moving from the http interface to the 
protobuf interface. After that I would look into rewriting your mapreduce in 
erlang. Marshaling complex data between the native erlang internals and the 
javascript interpreter has a non zero cost associated with it. Forgoing this 
step is a big win. Again, I view all this as a growth path within the riak 
environment and "a good thing" (tm). 

Assuming your most populous zip codes may have on the order of ~200k 
subscribers you could encode your user keys in radix 62 and be able to fit 
those keys in a 3 character space. Move up to 4 characters for way more leg 
room. At 3 characters (standard 8 bit encoding) your ~200k key list is under 1 
MB (something to consider based on how riak allocates ram for this portion of 
the mapreduce in erlang and/or js). Also, I'm a big fan of fixed length keys 
for unrolled loops. Either way, feeding keys explicitly to a mapreduce will 
only get better as your input list shrinks in relation to the total keys in 
your system. Data modeling wise, I would have a user bucket a zip codes bucket 
and a zip_users bucket and the converse, users_zip bucket. The later two having 
the keys of the former as members. I'm also a big fan of explicitly derived 
keys/paths. I would not recommend links here simply because of the unbounded, 
potentially large nature of your problem.

Do keep us posted,

Alexander


On Sep 16, 2010, at 2:49 PM, Scott wrote:

> Thanks for the quick replies Sean and Alexander.  One of our current products 
> allows users to sign up for weather alerts based on their zip code.  When we 
> receive a weather alert for a set of locations, we need to quickly find all 
> users in the zip codes effected. We currently do this with a simple sql query 
> against a relational db.  Being new at this key/value store thing, we are not 
> sure the best way to tackle this with Riak.
> 
> Some zip codes have over 20,000 users, so storing the users in a json array 
> with the zip code as the key would get ugly fast.  One thought was to store 
> the user profiles in one bucket, and then add an key per user in the correct 
> zip code bucket, perhaps with a link back to the users record in the profile 
> bucket.  We could then fetch the keys for the effected zip codes using map 
> reduce.  I am open to all suggestions on how to best model this type of data 
> in Riak.
> 
> Thanks,
> Scott
> 
> 
> Sean Cribbs wrote:
>> Scott,
>> 
>> There is no limit on the number of buckets unless you are changing the 
>> bucket properties, like the replication factor, allow_mult, or the pre- and 
>> post-commit hooks.  Buckets that have properties other than the defaults 
>> consume space in the ring state.  Other than that, they are essentially free 
>> unless you're using a backend that segregates data by bucket - the only one 
>> that does at this time is innostore.
>> 
>> Is there a reason you need so many buckets? 
>> 
>> Sean Cribbs 
>> Developer Advocate
>> Basho Technologies, Inc.
>> http://basho.com/
>> 
>> On Sep 16, 2010, at 2:17 PM, SKester wrote:
>> 
>>> I