2010/3/26 Roland Hänel <rol...@haenel.me>

> Jonathan,
>
> I agree with your idea about a tool that could 'propose' good token choices
> for optimal load-balancing.
>
> If I was going to write such a tool: do you think the thrift API provides
> the necessary information? I think with the RandomPartitioner you cannot
> scan all your rows to actually find out how big certain ranges of rows are.
> And even with the OPP (that is the major target for this kind of tool, for
> sure) you would have to fetch all row's content just to find out how large
> it is, right?
>

With the random partitioner there's no need to suggest a token. The key
space is statistically random so you should be able to just split 2^128 into
equal sized segments and get fairly equal storage load. Your read / write
load could get out of whack if you have hot spots and stuff, I guess. But
for a large distributed data set I think that's unlikely.

For order preserving partitioners it's harder. We've been thinking about
this issue at SimpleGeo and were planning on implementing an algorithm that
could determine the median row key statistically without having to inspect
every key. Basically, it would pull a random sample of row keys (maybe from
the Index file?) and then determine the median of that sample. Thoughts?

Mike

Reply via email to