Below is an output of my Riak cluster. 3 physical nodes. Ring size 128. As far as I can tell when Riak installed fresh it is always place partitions in the same way on a ring as long as number of vnodes and servers is the same.
All presentations including "A Little Riak Book' show pretty picture of ring and nodes claiming partitions in a sequential fashion. That's clearly not a case. Output below shows that node2 is picked as favourite, which means replicas of certain keys will definitely be on the same hardware. Partitions are split 44 + 42 + 42. Why not 43+43+42? Another thing, why the algorithm selects nodes in 'random' non-sequential fashion? When the cluster gets created and nodes 2 & 3 are joined to node 1, it's a clear situation. Partitions are empty so vnodes could be assigned in a way so there's no consecutive partitions on the same hw. My issue is that in my case if node2 goes down and I'm storing some data with N=2 I will definitely not be able access certain keys and more surprisingly all 2i will no longer work for the buckets with N=2 due to {error,insufficient_vnodes_available}. That is all 2i's for those buckets. I understand that when new nodes are attached Riak tries to avoid reshuffling everything and just moves certain partitions, and at that point you may end up with copies on the same physical nodes. But even then Riak should make best effort and try not to put consecutive partitions on the same server. If it has to move it anyway it could as well put it on any other machine but the one that holds partition with preceding and following index. I also understand Riak does not guarantee that replicas are on distinct servers (why? it should, at least for N=2 and N=3 if possible) I appreciate minimum recommended setup is 5 nodes and I should be storing with N=3 minimum. But I just find it confusing when presentations show something that is not even remotely close to reality. Just to be clear I have nothing against Riak, I think it's great though bit disappointing that there are no stronger conditions about replica placement here. I'm probably missing something and simplifying too much. Any clarification appreciated. Daniel riak@10.173.240.1)2> (riak@10.173.240.1)2> {ok, Ring} = riak_core_ring_manager:get_my_ring(). {ok, {chstate_v2,'riak@10.173.240.1', [{'riak@10.173.240.1',{303,63561952927}}, {'riak@10.173.240.2',{31,63561952907}}, {'riak@10.173.240.3',{25,63561952907}}], {128, [{0,'riak@10.173.240.1'}, {11417981541647679048466287755595961091061972992, 'riak@10.173.240.2'}, {22835963083295358096932575511191922182123945984, 'riak@10.173.240.2'}, {34253944624943037145398863266787883273185918976, 'riak@10.173.240.3'}, {45671926166590716193865151022383844364247891968, 'riak@10.173.240.1'}, {57089907708238395242331438777979805455309864960, 'riak@10.173.240.2'}, {68507889249886074290797726533575766546371837952, 'riak@10.173.240.2'}, {79925870791533753339264014289171727637433810944, 'riak@10.173.240.3'}, {91343852333181432387730302044767688728495783936, 'riak@10.173.240.1'}, {102761833874829111436196589800363649819557756928, 'riak@10.173.240.2'}, {114179815416476790484662877555959610910619729920, 'riak@10.173.240.2'}, {125597796958124469533129165311555572001681702912, 'riak@10.173.240.3'}, {137015778499772148581595453067151533092743675904, 'riak@10.173.240.1'}, {148433760041419827630061740822747494183805648896, 'riak@10.173.240.2'}, {159851741583067506678528028578343455274867621888, 'riak@10.173.240.2'}, {171269723124715185726994316333939416365929594880, 'riak@10.173.240.3'}, {182687704666362864775460604089535377456991567872, 'riak@10.173.240.1'}, {194105686208010543823926891845131338548053540864, 'riak@10.173.240.2'}, {205523667749658222872393179600727299639115513856, 'riak@10.173.240.2'}, {216941649291305901920859467356323260730177486848, and so on -- View this message in context: http://riak-users.197444.n3.nabble.com/Partitions-placement-tp4030664.html Sent from the Riak Users mailing list archive at Nabble.com. _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com