Re: Configuring the Distributed

Jamie Johnson Thu, 01 Dec 2011 18:56:53 -0800

So I couldn't resist, I attempted to do this tonight, I used the
solrconfig you mentioned (as is, no modifications), I setup a 2 shard
cluster in collection1, I sent 1 doc to 1 of the shards, updated it
and sent the update to the other.  I don't see the modifications
though I only see the original document.  The following is the test


public void update() throws Exception {

                String key = "1";

                SolrInputDocument solrDoc = new SolrInputDocument();
                solrDoc.setField("key", key);

                solrDoc.addField("content", "initial value");

                SolrServer server = servers
                                .get("http://localhost:8983/solr/collection1";);
                server.add(solrDoc);

                server.commit();

                solrDoc = new SolrInputDocument();
                solrDoc.addField("key", key);
                solrDoc.addField("content", "updated value");

                server = servers.get("http://localhost:7574/solr/collection1";);

                UpdateRequest ureq = new UpdateRequest();
                ureq.setParam("update.chain", "distrib-update-chain");
                ureq.add(solrDoc);
                ureq.setParam("shards",
                                
"localhost:8983/solr/collection1,localhost:7574/solr/collection1");
                ureq.setParam("self", "foo");
                ureq.setAction(ACTION.COMMIT, true, true);
                server.request(ureq);
                System.out.println("done");
        }

key is my unique field in schema.xml

What am I doing wrong?

On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson <jej2...@gmail.com> wrote:
> Yes, the ZK method seems much more flexible.  Adding a new shard would
> be simply updating the range assignments in ZK.  Where is this
> currently on the list of things to accomplish?  I don't have time to
> work on this now, but if you (or anyone) could provide direction I'd
> be willing to work on this when I had spare time.  I guess a JIRA
> detailing where/how to do this could help.  Not sure if the design has
> been thought out that far though.
>
> On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller <markrmil...@gmail.com> wrote:
>> Right now lets say you have one shard - everything there hashes to range X.
>>
>> Now you want to split that shard with an Index Splitter.
>>
>> You divide range X in two - giving you two ranges - then you start 
>> splitting. This is where the current Splitter needs a little modification. 
>> You decide which doc should go into which new index by rehashing each doc id 
>> in the index you are splitting - if its hash is greater than X/2, it goes 
>> into index1 - if its less, index2. I think there are a couple current 
>> Splitter impls, but one of them does something like, give me an id - now if 
>> the id's in the index are above that id, goto index1, if below, index2. We 
>> need to instead do a quick hash rather than simple id compare.
>>
>> Why do you need to do this on every shard?
>>
>> The other part we need that we dont have is to store hash range assignments 
>> in zookeeper - we don't do that yet because it's not needed yet. Instead we 
>> currently just simply calculate that on the fly (too often at the moment - 
>> on every request :) I intend to fix that of course).
>>
>> At the start, zk would say, for range X, goto this shard. After the split, 
>> it would say, for range less than X/2 goto the old node, for range greater 
>> than X/2 goto the new node.
>>
>> - Mark
>>
>> On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
>>
>>> hmmm.....This doesn't sound like the hashing algorithm that's on the
>>> branch, right?  The algorithm you're mentioning sounds like there is
>>> some logic which is able to tell that a particular range should be
>>> distributed between 2 shards instead of 1.  So seems like a trade off
>>> between repartitioning the entire index (on every shard) and having a
>>> custom hashing algorithm which is able to handle the situation where 2
>>> or more shards map to a particular range.
>>>
>>> On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller <markrmil...@gmail.com> wrote:
>>>>
>>>> On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
>>>>
>>>>> I am not familiar with the index splitter that is in contrib, but I'll
>>>>> take a look at it soon.  So the process sounds like it would be to run
>>>>> this on all of the current shards indexes based on the hash algorithm.
>>>>
>>>> Not something I've thought deeply about myself yet, but I think the idea 
>>>> would be to split as many as you felt you needed to.
>>>>
>>>> If you wanted to keep the full balance always, this would mean splitting 
>>>> every shard at once, yes. But this depends on how many boxes (partitions) 
>>>> you are willing/able to add at a time.
>>>>
>>>> You might just split one index to start - now it's hash range would be 
>>>> handled by two shards instead of one (if you have 3 replicas per shard, 
>>>> this would mean adding 3 more boxes). When you needed to expand again, you 
>>>> would split another index that was still handling its full starting range. 
>>>> As you grow, once you split every original index, you'd start again, 
>>>> splitting one of the now half ranges.
>>>>
>>>>> Is there also an index merger in contrib which could be used to merge
>>>>> indexes?  I'm assuming this would be the process?
>>>>
>>>> You can merge with IndexWriter.addIndexes (Solr also has an admin command 
>>>> that can do this). But I'm not sure where this fits in?
>>>>
>>>> - Mark
>>>>
>>>>>
>>>>> On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller <markrmil...@gmail.com> wrote:
>>>>>> Not yet - we don't plan on working on this until a lot of other stuff is
>>>>>> working solid at this point. But someone else could jump in!
>>>>>>
>>>>>> There are a couple ways to go about it that I know of:
>>>>>>
>>>>>> A more long term solution may be to start using micro shards - each index
>>>>>> starts as multiple indexes. This makes it pretty fast to move mirco 
>>>>>> shards
>>>>>> around as you decide to change partitions. It's also less flexible as you
>>>>>> are limited by the number of micro shards you start with.
>>>>>>
>>>>>> A more simple and likely first step is to use an index splitter . We
>>>>>> already have one in lucene contrib - we would just need to modify it so
>>>>>> that it splits based on the hash of the document id. This is super
>>>>>> flexible, but splitting will obviously take a little while on a huge 
>>>>>> index.
>>>>>> The current index splitter is a multi pass splitter - good enough to 
>>>>>> start
>>>>>> with, but most files under codec control these days, we may be able to 
>>>>>> make
>>>>>> a single pass splitter soon as well.
>>>>>>
>>>>>> Eventually you could imagine using both options - micro shards that could
>>>>>> also be split as needed. Though I still wonder if micro shards will be
>>>>>> worth the extra complications myself...
>>>>>>
>>>>>> Right now though, the idea is that you should pick a good number of
>>>>>> partitions to start given your expected data ;) Adding more replicas is
>>>>>> trivial though.
>>>>>>
>>>>>> - Mark
>>>>>>
>>>>>> On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>>>>>
>>>>>>> Another question, is there any support for repartitioning of the index
>>>>>>> if a new shard is added?  What is the recommended approach for
>>>>>>> handling this?  It seemed that the hashing algorithm (and probably
>>>>>>> any) would require the index to be repartitioned should a new shard be
>>>>>>> added.
>>>>>>>
>>>>>>> On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson <jej2...@gmail.com> wrote:
>>>>>>>> Thanks I will try this first thing in the morning.
>>>>>>>>
>>>>>>>> On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller <markrmil...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson <jej2...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I am currently looking at the latest solrcloud branch and was
>>>>>>>>>> wondering if there was any documentation on configuring the
>>>>>>>>>> DistributedUpdateProcessor?  What specifically in solrconfig.xml 
>>>>>>>>>> needs
>>>>>>>>>> to be added/modified to make distributed indexing work?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Jaime - take a look at solrconfig-distrib-update.xml in
>>>>>>>>> solr/core/src/test-files
>>>>>>>>>
>>>>>>>>> You need to enable the update log, add an empty replication handler 
>>>>>>>>> def,
>>>>>>>>> and an update chain with solr.DistributedUpdateProcessFactory in it.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> - Mark
>>>>>>>>>
>>>>>>>>> http://www.lucidimagination.com
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> - Mark
>>>>>>
>>>>>> http://www.lucidimagination.com
>>>>>>
>>>>
>>>> - Mark Miller
>>>> lucidimagination.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>> - Mark Miller
>> lucidimagination.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Configuring the Distributed

Reply via email to