Hi Daniel,

"A Little Riak Book" covers the logic behind partition allocation in an
overly simplified way.

Riak will distribute partitions to vnodes in a pseudo-random fashion,
resulting in allocations like you described. These allocations are less
optimal when the number of riak nodes are small, hence we (strongly)
recommend 5+ nodes for production use.

Storing 3 data copies in 3 different servers sounds trivial to do, but not
that easy to scale up once the numbers of servers grows. To cope with
scalability, Riak introduces an "overlay". Data is first placed in
"partitions" (always a power of 2) which are then distributed to different
server nodes. As powers of 2 are not divisible by 3, this approach has a
problem at lower scale: some nodes will hold a few extra partitions (which
were not intended to be stored there).

If you know you are not going to need n_val greater then 3 in your buckets,
one way to hint this to Riak and get a better distribution of partitions to
nodes is to configure [0] target_n_val to 3.


[0]
http://docs.basho.com/riak/latest/ops/advanced/configs/configuration-files/


Regards,
Ciprian


On Fri, Mar 14, 2014 at 12:09 AM, Daniel Iwan <iwan.dan...@gmail.com> wrote:

> Below is an output of my Riak cluster. 3 physical nodes. Ring size 128.
> As far as I can tell when Riak installed fresh it is always place
> partitions
> in the same way on a ring as long as number of vnodes and servers is the
> same.
>
> All presentations including "A Little Riak Book' show pretty picture of
> ring
> and nodes claiming partitions in a  sequential fashion. That's clearly not
> a
> case.
> Output below shows that node2 is picked as favourite, which means replicas
> of certain keys will definitely be on the same hardware. Partitions are
> split 44 + 42 + 42. Why not 43+43+42?
>
> Another thing, why the algorithm selects nodes in 'random' non-sequential
> fashion? When the cluster gets created and nodes 2 & 3 are joined to node
> 1,
> it's a clear situation. Partitions are empty so vnodes could be assigned in
> a way so there's no consecutive partitions on the same hw.
> My issue is that in my case if node2 goes down and I'm storing some data
> with N=2 I will definitely not be able access certain keys and more
> surprisingly all 2i will no longer work for the buckets with N=2 due to
> {error,insufficient_vnodes_available}. That is all 2i's for those buckets.
>
> I understand that when new nodes are attached Riak tries to avoid
> reshuffling everything and just moves certain partitions, and at that point
> you may end up with copies on the same physical nodes. But even then Riak
> should make best effort and try not to put consecutive partitions on the
> same server. If it has to move it anyway it could as well put it on any
> other machine but the one that holds partition with preceding and following
> index.
> I also understand Riak does not guarantee that replicas are on distinct
> servers (why? it should, at least for N=2 and N=3 if possible)
>
> I appreciate minimum recommended setup is 5 nodes and I should be storing
> with N=3 minimum.
> But I just find it confusing when presentations show something that is not
> even remotely close to reality.
>
> Just to be clear I have nothing against Riak, I think it's great though bit
> disappointing that there are no stronger conditions about replica placement
> here.
>
> I'm probably missing something and simplifying too much. Any clarification
> appreciated.
>
> Daniel
>
>
> riak@10.173.240.1)2>
> (riak@10.173.240.1)2> {ok, Ring} = riak_core_ring_manager:get_my_ring().
> {ok,
>  {chstate_v2,'riak@10.173.240.1',
>   [{'riak@10.173.240.1',{303,63561952927}},
>    {'riak@10.173.240.2',{31,63561952907}},
>    {'riak@10.173.240.3',{25,63561952907}}],
>   {128,
>    [{0,'riak@10.173.240.1'},
>     {11417981541647679048466287755595961091061972992,
>      'riak@10.173.240.2'},
>     {22835963083295358096932575511191922182123945984,
>      'riak@10.173.240.2'},
>     {34253944624943037145398863266787883273185918976,
>      'riak@10.173.240.3'},
>     {45671926166590716193865151022383844364247891968,
>      'riak@10.173.240.1'},
>     {57089907708238395242331438777979805455309864960,
>      'riak@10.173.240.2'},
>     {68507889249886074290797726533575766546371837952,
>      'riak@10.173.240.2'},
>     {79925870791533753339264014289171727637433810944,
>      'riak@10.173.240.3'},
>     {91343852333181432387730302044767688728495783936,
>      'riak@10.173.240.1'},
>     {102761833874829111436196589800363649819557756928,
>      'riak@10.173.240.2'},
>     {114179815416476790484662877555959610910619729920,
>      'riak@10.173.240.2'},
>     {125597796958124469533129165311555572001681702912,
>      'riak@10.173.240.3'},
>     {137015778499772148581595453067151533092743675904,
>      'riak@10.173.240.1'},
>     {148433760041419827630061740822747494183805648896,
>      'riak@10.173.240.2'},
>     {159851741583067506678528028578343455274867621888,
>      'riak@10.173.240.2'},
>     {171269723124715185726994316333939416365929594880,
>      'riak@10.173.240.3'},
>     {182687704666362864775460604089535377456991567872,
>      'riak@10.173.240.1'},
>     {194105686208010543823926891845131338548053540864,
>      'riak@10.173.240.2'},
>     {205523667749658222872393179600727299639115513856,
>      'riak@10.173.240.2'},
>     {216941649291305901920859467356323260730177486848,
>
> and so on
>
>
>
> --
> View this message in context:
> http://riak-users.197444.n3.nabble.com/Partitions-placement-tp4030664.html
> Sent from the Riak Users mailing list archive at Nabble.com.
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to