[ 
https://issues.apache.org/jira/browse/SOLR-5656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053804#comment-14053804
 ] 

Mark Miller commented on SOLR-5656:
-----------------------------------

The approach is fairly simple.

The Overseer class gets a new thread that periodically evaluates live nodes and 
cluster state and fires off SolrCore create commands to add replicas when there 
are not enough replicas up to meet a collections replicationFactor.

The feature is enabled per collection by an additional boolean create 
collections API param called autoAddReplicas.

This feature only works with the Collections API.

In this initial implementation, replicas are not removed if you end up with too 
many for some reason, and replicas are not rebalanced when nodes come back to 
life. You must manually move replicas after restoring a node to rebalance.

There are three settings exposed:

autoReplicaFailoverWorkLoopDelay: How often the Overseer inspects the 
clusterstate and possibly takes action.

autoReplicaFailoverWaitAfterExpiration: Once a replica no longer looks live, it 
won't be replaced until at least this much time has passed after noticing that.

autoReplicaFailoverBadNodeExpiration: Once a replica is marked as looking like 
it needs to be replaced, if it still looks bad on a future cycle, it will be 
replaced. Once a node is marked as looking bad, after this much time it will be 
unmarked.

Additional automated testing needs to be added as initially I have focused on 
manual testing. To aid in that, I have improved the cloud-dev scripts to make 
this type of feature much easier to test. I have once more patch to put up that 
expands on that a bit by starting another Solr node that can run ZooKeeper 
external to the cluster and that can be used to view the Solr Admin Cloud tab 
without actually participating in the cluster. Just makes monitoring while 
testing easier and takes away needing to run zk yourself and internally on one 
of the cluster nodes.



> Add autoAddReplicas feature for shared file systems.
> ----------------------------------------------------
>
>                 Key: SOLR-5656
>                 URL: https://issues.apache.org/jira/browse/SOLR-5656
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>         Attachments: SOLR-5656.patch, SOLR-5656.patch
>
>
> When using HDFS, the Overseer should have the ability to reassign the cores 
> from failed nodes to running nodes.
> Given that the index and transaction logs are in hdfs, it's simple for 
> surviving hardware to take over serving cores for failed hardware.
> There are some tricky issues around having the Overseer handle this for you, 
> but seems a simple first pass is not too difficult.
> This will add another alternative to replicating both with hdfs and solr.
> It shouldn't be specific to hdfs, and would be an option for any shared file 
> system Solr supports.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to