[
https://issues.apache.org/jira/browse/SOLR-5656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053804#comment-14053804
]
Mark Miller commented on SOLR-5656:
-----------------------------------
The approach is fairly simple.
The Overseer class gets a new thread that periodically evaluates live nodes and
cluster state and fires off SolrCore create commands to add replicas when there
are not enough replicas up to meet a collections replicationFactor.
The feature is enabled per collection by an additional boolean create
collections API param called autoAddReplicas.
This feature only works with the Collections API.
In this initial implementation, replicas are not removed if you end up with too
many for some reason, and replicas are not rebalanced when nodes come back to
life. You must manually move replicas after restoring a node to rebalance.
There are three settings exposed:
autoReplicaFailoverWorkLoopDelay: How often the Overseer inspects the
clusterstate and possibly takes action.
autoReplicaFailoverWaitAfterExpiration: Once a replica no longer looks live, it
won't be replaced until at least this much time has passed after noticing that.
autoReplicaFailoverBadNodeExpiration: Once a replica is marked as looking like
it needs to be replaced, if it still looks bad on a future cycle, it will be
replaced. Once a node is marked as looking bad, after this much time it will be
unmarked.
Additional automated testing needs to be added as initially I have focused on
manual testing. To aid in that, I have improved the cloud-dev scripts to make
this type of feature much easier to test. I have once more patch to put up that
expands on that a bit by starting another Solr node that can run ZooKeeper
external to the cluster and that can be used to view the Solr Admin Cloud tab
without actually participating in the cluster. Just makes monitoring while
testing easier and takes away needing to run zk yourself and internally on one
of the cluster nodes.
> Add autoAddReplicas feature for shared file systems.
> ----------------------------------------------------
>
> Key: SOLR-5656
> URL: https://issues.apache.org/jira/browse/SOLR-5656
> Project: Solr
> Issue Type: New Feature
> Reporter: Mark Miller
> Assignee: Mark Miller
> Attachments: SOLR-5656.patch, SOLR-5656.patch
>
>
> When using HDFS, the Overseer should have the ability to reassign the cores
> from failed nodes to running nodes.
> Given that the index and transaction logs are in hdfs, it's simple for
> surviving hardware to take over serving cores for failed hardware.
> There are some tricky issues around having the Overseer handle this for you,
> but seems a simple first pass is not too difficult.
> This will add another alternative to replicating both with hdfs and solr.
> It shouldn't be specific to hdfs, and would be an option for any shared file
> system Solr supports.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]