Re: Configuring the Distributed

Jamie Johnson Thu, 01 Dec 2011 17:51:46 -0800

Yes, the ZK method seems much more flexible.  Adding a new shard would
be simply updating the range assignments in ZK.  Where is this
currently on the list of things to accomplish?  I don't have time to
work on this now, but if you (or anyone) could provide direction I'd
be willing to work on this when I had spare time.  I guess a JIRA
detailing where/how to do this could help.  Not sure if the design has
been thought out that far though.


On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <markrmil...@gmail.com> wrote:
> Right now lets say you have one shard - everything there hashes to range X.
>
> Now you want to split that shard with an Index Splitter.
>
> You divide range X in two - giving you two ranges - then you start splitting. 
> This is where the current Splitter needs a little modification. You decide 
> which doc should go into which new index by rehashing each doc id in the 
> index you are splitting - if its hash is greater than X/2, it goes into 
> index1 - if its less, index2. I think there are a couple current Splitter 
> impls, but one of them does something like, give me an id - now if the id's 
> in the index are above that id, goto index1, if below, index2. We need to 
> instead do a quick hash rather than simple id compare.
>
> Why do you need to do this on every shard?
>
> The other part we need that we dont have is to store hash range assignments 
> in zookeeper - we don't do that yet because it's not needed yet. Instead we 
> currently just simply calculate that on the fly (too often at the moment - on 
> every request :) I intend to fix that of course).
>
> At the start, zk would say, for range X, goto this shard. After the split, it 
> would say, for range less than X/2 goto the old node, for range greater than 
> X/2 goto the new node.
>
> - Mark
>
> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>
>> hmmm.....This doesn't sound like the hashing algorithm that's on the
>> branch, right?  The algorithm you're mentioning sounds like there is
>> some logic which is able to tell that a particular range should be
>> distributed between 2 shards instead of 1.  So seems like a trade off
>> between repartitioning the entire index (on every shard) and having a
>> custom hashing algorithm which is able to handle the situation where 2
>> or more shards map to a particular range.
>>
>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <markrmil...@gmail.com> wrote:
>>>
>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>
>>>> I am not familiar with the index splitter that is in contrib, but I'll
>>>> take a look at it soon.  So the process sounds like it would be to run
>>>> this on all of the current shards indexes based on the hash algorithm.
>>>
>>> Not something I've thought deeply about myself yet, but I think the idea 
>>> would be to split as many as you felt you needed to.
>>>
>>> If you wanted to keep the full balance always, this would mean splitting 
>>> every shard at once, yes. But this depends on how many boxes (partitions) 
>>> you are willing/able to add at a time.
>>>
>>> You might just split one index to start - now it's hash range would be 
>>> handled by two shards instead of one (if you have 3 replicas per shard, 
>>> this would mean adding 3 more boxes). When you needed to expand again, you 
>>> would split another index that was still handling its full starting range. 
>>> As you grow, once you split every original index, you'd start again, 
>>> splitting one of the now half ranges.
>>>
>>>> Is there also an index merger in contrib which could be used to merge
>>>> indexes?  I'm assuming this would be the process?
>>>
>>> You can merge with IndexWriter.addIndexes (Solr also has an admin command 
>>> that can do this). But I'm not sure where this fits in?
>>>
>>> - Mark
>>>
>>>>
>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <markrmil...@gmail.com> wrote:
>>>>> Not yet - we don't plan on working on this until a lot of other stuff is
>>>>> working solid at this point. But someone else could jump in!
>>>>>
>>>>> There are a couple ways to go about it that I know of:
>>>>>
>>>>> A more long term solution may be to start using micro shards - each index
>>>>> starts as multiple indexes. This makes it pretty fast to move mirco shards
>>>>> around as you decide to change partitions. It's also less flexible as you
>>>>> are limited by the number of micro shards you start with.
>>>>>
>>>>> A more simple and likely first step is to use an index splitter . We
>>>>> already have one in lucene contrib - we would just need to modify it so
>>>>> that it splits based on the hash of the document id. This is super
>>>>> flexible, but splitting will obviously take a little while on a huge 
>>>>> index.
>>>>> The current index splitter is a multi pass splitter - good enough to start
>>>>> with, but most files under codec control these days, we may be able to 
>>>>> make
>>>>> a single pass splitter soon as well.
>>>>>
>>>>> Eventually you could imagine using both options - micro shards that could
>>>>> also be split as needed. Though I still wonder if micro shards will be
>>>>> worth the extra complications myself...
>>>>>
>>>>> Right now though, the idea is that you should pick a good number of
>>>>> partitions to start given your expected data ;) Adding more replicas is
>>>>> trivial though.
>>>>>
>>>>> - Mark
>>>>>
>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>>>>
>>>>>> Another question, is there any support for repartitioning of the index
>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>>>> any) would require the index to be repartitioned should a new shard be
>>>>>> added.
>>>>>>
>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>
>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <markrmil...@gmail.com>
>>>>>> wrote:
>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <jej2...@gmail.com>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>> solr/core/src/test-files
>>>>>>>>
>>>>>>>> You need to enable the update log, add an empty replication handler 
>>>>>>>> def,
>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>>>>>
>>>>>>>> --
>>>>>>>> - Mark
>>>>>>>>
>>>>>>>> http://www.lucidimagination.com
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> - Mark
>>>>>
>>>>> http://www.lucidimagination.com
>>>>>
>>>
>>> - Mark Miller
>>> lucidimagination.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>

Re: Configuring the Distributed

Reply via email to