Yes, the ZK method seems much more flexible. Adding a new shard would be simply updating the range assignments in ZK. Where is this currently on the list of things to accomplish? I don't have time to work on this now, but if you (or anyone) could provide direction I'd be willing to work on this when I had spare time. I guess a JIRA detailing where/how to do this could help. Not sure if the design has been thought out that far though.
On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <markrmil...@gmail.com> wrote: > Right now lets say you have one shard - everything there hashes to range X. > > Now you want to split that shard with an Index Splitter. > > You divide range X in two - giving you two ranges - then you start splitting. > This is where the current Splitter needs a little modification. You decide > which doc should go into which new index by rehashing each doc id in the > index you are splitting - if its hash is greater than X/2, it goes into > index1 - if its less, index2. I think there are a couple current Splitter > impls, but one of them does something like, give me an id - now if the id's > in the index are above that id, goto index1, if below, index2. We need to > instead do a quick hash rather than simple id compare. > > Why do you need to do this on every shard? > > The other part we need that we dont have is to store hash range assignments > in zookeeper - we don't do that yet because it's not needed yet. Instead we > currently just simply calculate that on the fly (too often at the moment - on > every request :) I intend to fix that of course). > > At the start, zk would say, for range X, goto this shard. After the split, it > would say, for range less than X/2 goto the old node, for range greater than > X/2 goto the new node. > > - Mark > > On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote: > >> hmmm.....This doesn't sound like the hashing algorithm that's on the >> branch, right? The algorithm you're mentioning sounds like there is >> some logic which is able to tell that a particular range should be >> distributed between 2 shards instead of 1. So seems like a trade off >> between repartitioning the entire index (on every shard) and having a >> custom hashing algorithm which is able to handle the situation where 2 >> or more shards map to a particular range. >> >> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <markrmil...@gmail.com> wrote: >>> >>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote: >>> >>>> I am not familiar with the index splitter that is in contrib, but I'll >>>> take a look at it soon. So the process sounds like it would be to run >>>> this on all of the current shards indexes based on the hash algorithm. >>> >>> Not something I've thought deeply about myself yet, but I think the idea >>> would be to split as many as you felt you needed to. >>> >>> If you wanted to keep the full balance always, this would mean splitting >>> every shard at once, yes. But this depends on how many boxes (partitions) >>> you are willing/able to add at a time. >>> >>> You might just split one index to start - now it's hash range would be >>> handled by two shards instead of one (if you have 3 replicas per shard, >>> this would mean adding 3 more boxes). When you needed to expand again, you >>> would split another index that was still handling its full starting range. >>> As you grow, once you split every original index, you'd start again, >>> splitting one of the now half ranges. >>> >>>> Is there also an index merger in contrib which could be used to merge >>>> indexes? I'm assuming this would be the process? >>> >>> You can merge with IndexWriter.addIndexes (Solr also has an admin command >>> that can do this). But I'm not sure where this fits in? >>> >>> - Mark >>> >>>> >>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <markrmil...@gmail.com> wrote: >>>>> Not yet - we don't plan on working on this until a lot of other stuff is >>>>> working solid at this point. But someone else could jump in! >>>>> >>>>> There are a couple ways to go about it that I know of: >>>>> >>>>> A more long term solution may be to start using micro shards - each index >>>>> starts as multiple indexes. This makes it pretty fast to move mirco shards >>>>> around as you decide to change partitions. It's also less flexible as you >>>>> are limited by the number of micro shards you start with. >>>>> >>>>> A more simple and likely first step is to use an index splitter . We >>>>> already have one in lucene contrib - we would just need to modify it so >>>>> that it splits based on the hash of the document id. This is super >>>>> flexible, but splitting will obviously take a little while on a huge >>>>> index. >>>>> The current index splitter is a multi pass splitter - good enough to start >>>>> with, but most files under codec control these days, we may be able to >>>>> make >>>>> a single pass splitter soon as well. >>>>> >>>>> Eventually you could imagine using both options - micro shards that could >>>>> also be split as needed. Though I still wonder if micro shards will be >>>>> worth the extra complications myself... >>>>> >>>>> Right now though, the idea is that you should pick a good number of >>>>> partitions to start given your expected data ;) Adding more replicas is >>>>> trivial though. >>>>> >>>>> - Mark >>>>> >>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <jej2...@gmail.com> wrote: >>>>> >>>>>> Another question, is there any support for repartitioning of the index >>>>>> if a new shard is added? What is the recommended approach for >>>>>> handling this? It seemed that the hashing algorithm (and probably >>>>>> any) would require the index to be repartitioned should a new shard be >>>>>> added. >>>>>> >>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <jej2...@gmail.com> wrote: >>>>>>> Thanks I will try this first thing in the morning. >>>>>>> >>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <markrmil...@gmail.com> >>>>>> wrote: >>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <jej2...@gmail.com> >>>>>> wrote: >>>>>>>> >>>>>>>>> I am currently looking at the latest solrcloud branch and was >>>>>>>>> wondering if there was any documentation on configuring the >>>>>>>>> DistributedUpdateProcessor? What specifically in solrconfig.xml needs >>>>>>>>> to be added/modified to make distributed indexing work? >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in >>>>>>>> solr/core/src/test-files >>>>>>>> >>>>>>>> You need to enable the update log, add an empty replication handler >>>>>>>> def, >>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it. >>>>>>>> >>>>>>>> -- >>>>>>>> - Mark >>>>>>>> >>>>>>>> http://www.lucidimagination.com >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> - Mark >>>>> >>>>> http://www.lucidimagination.com >>>>> >>> >>> - Mark Miller >>> lucidimagination.com >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> > > - Mark Miller > lucidimagination.com > > > > > > > > > > > >