[jira] [Commented] (SOLR-9835) Create another replication mode for SolrCloud

Shalin Shekhar Mangar (JIRA) Wed, 22 Feb 2017 11:25:18 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879022#comment-15879022
 ]


Shalin Shekhar Mangar commented on SOLR-9835:
---------------------------------------------

Thanks Dat. Sorry it took me a while to finish reviewing. A few 
questions/comments

# LeaderInitiatedRecoveryThread -- What is the reason behind adding 
SocketTimeoutException in the list of communication errors on which no more 
retries are made?
# ZkController.register method -- The condition for !isLeader && 
onlyLeaderIndexes can be replaced by the isReplicaInOnlyLeaderIndexes variable.
# Since there is no log replay on startup on replicas anymore, what if the 
replica is killed (which keeps its state as 'active' in ZK) and then the 
cluster is restarted and the replica becomes leader candidate? If we do not 
replay the discarded log then it could lead to data loss?
# UpdateLog -- Can you please add javadocs outlining the motivation/purpose of 
the new methods such as copyOverBufferingUpdates and switchToNewTlog e.g. why 
does switchToNewTlog require copying over some updates from the old tlog?
# It seems that any commits that might be triggered explicitly by the user can 
interfere with the index replication. Suppose that a replication is in progress 
and a user explicitly calls commit which is distributed to all replicas, in 
such a case the tlogs will be rolled over and then when the ReplicateFromLeader 
calls switchToNewTlog(), the previous tlog may not have all the updates that 
should have been copied over. We should have a way to either disable explicit 
commits or protect against them on the replicas.
# UpdateLog -- why does copyOverBufferUpdates block updates while calling 
switchToNewTlog but ReplicateFromLeader doesn't? How are they both safe?
# Can we add tests for testing CDCR and backup/restore with this new 
replication scheme?
# ZkController.startReplicationFromLeader -- Using a ConcurrentHashMap is not 
enough to prevent two simultaneous replications from happening concurrently. 
You should use the atomic putIfAbsent to put a core to the map before starting 
replication.
# Aren't some of the guarantees of real-time-get are relaxed in this new mode 
especially around delete-by-queries which no longer apply on replicas? Can you 
please document them as a comment on the issue that we can transfer to the ref 
guide in future?

> Create another replication mode for SolrCloud
> ---------------------------------------------
>
>                 Key: SOLR-9835
>                 URL: https://issues.apache.org/jira/browse/SOLR-9835
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>            Assignee: Shalin Shekhar Mangar
>         Attachments: SOLR-9835.patch, SOLR-9835.patch, SOLR-9835.patch, 
> SOLR-9835.patch, SOLR-9835.patch, SOLR-9835.patch, SOLR-9835.patch, 
> SOLR-9835.patch, SOLR-9835.patch, SOLR-9835.patch, SOLR-9835.patch
>
>
> The current replication mechanism of SolrCloud is called state machine, which 
> replicas start in same initial state and for each input, the input is 
> distributed across replicas so all replicas will end up with same next state. 
> But this type of replication have some drawbacks
> - The commit (which costly) have to run on all replicas
> - Slow recovery, because if replica miss more than N updates on its down 
> time, the replica have to download entire index from its leader.
> So we create create another replication mode for SolrCloud called state 
> transfer, which acts like master/slave replication. In basically
> - Leader distribute the update to other replicas, but the leader only apply 
> the update to IW, other replicas just store the update to UpdateLog (act like 
> replication).
> - Replicas frequently polling the latest segments from leader.
> Pros:
> - Lightweight for indexing, because only leader are running the commit, 
> updates.
> - Very fast recovery, replicas just have to download the missing segments.
> On CAP point of view, this ticket will trying to promise to end users a 
> distributed systems :
> - Partition tolerance
> - Weak Consistency for normal query : clusters can serve stale data. This 
> happen when leader finish a commit and slave is fetching for latest segment. 
> This period can at most {{pollInterval + time to fetch latest segment}}.
> - Consistency for RTG : just like original SolrCloud mode
> - Weak Availability : just like original SolrCloud mode. If a leader down, 
> client must wait until new leader being elected.
> To use this new replication mode, a new collection must be created with an 
> additional parameter {{liveReplicas=1}}
> {code}
> http://localhost:8983/solr/admin/collections?action=CREATE&name=newCollection&numShards=2&replicationFactor=1&liveReplicas=1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-9835) Create another replication mode for SolrCloud

Reply via email to