[jira] [Commented] (SOLR-10751) Master/Slave IndexVersion conflict

Hoss Man (JIRA) Fri, 26 May 2017 11:14:17 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-10751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026620#comment-16026620
 ]


Hoss Man commented on SOLR-10751:
---------------------------------

bq. so, "0" is not really the version of the index, but it's that the master 
responds to the slaves when there is no replicable index. 

And to elaborate on our IRC conversation, at the point where we were theorizing 
why the master might return "0" (before tomas had found this particular bit of 
code and verified it matched our theory) -- i posed the following straw man 
suggestion(s) for dealing with this special "sentinal" value of "i have no 
index"...

# We could change solr core/updateHanlder initialization so there is _never_ a 
situation where a solr core is responding to requests, but has no index / 
commitPoint -- thus completely eliminating the need for the sentinal value & 
special case logic on slaves, because they will always have _something_ they 
can fetch
#* ie: on startup, if no index, create & commit immediately
# we could "fix" the semantics of replication on the slave side...
#* if the master returns indexVersion==0, the slave treats that as a "master 
has nothing to replicate, i should do nothing" (and possibly 'fail' if the 
replication was explicitly requested vs/timmer based)
#* as opposed to current logic which is "master has nothing to replicate, i 
will blindly create my own arbitrary index indepdent of master (via deleteAll)

I still think either one of these options would be a good idea -- depending on 
what we want the semantics to be:  

# Should a situation where an external force blows away the master index (or 
someone forces a node w/o an index to be a leader) cause slaves/replicas to 
*immediately* purge all data?
# Or should slaves/replicas keep what they've got until the master/leader 
actually has something for them to replicate?  

Personally i think #2 makes more sense.

As a practical example: Assume someone is doing classic master/slave 
replication and their master has a hardware failure.  the slaves are still 
serving queries just fine.  rather then swap out an existing slave to be the 
new master the admin creates an entirely new serve to be the master and plans 
on rebuilding the index -- but by reusing the master.company.com hostname, the 
new node starts recieving /replication requests immediately from the existing 
slaves.  Should those slaves really be immediately deleting all docs from their 
local indexes even though the master is explicitly telling them "i have nothing 
for you to replicate" ? ... that sounds like a bug to me.

On the flip side: if chaos has rained down on a SolrCloud cluster, and a new 
leader w/o any index at all has popped up -- i think it's "ok" for replicas to 
serve stale data until the leader has new data for them ... but if think that 
in the cloud case it's important that all replicas should _immediately_ 
"recover" the "theoretically empty if it did exist" version of the index from 
their leader, then perhaps the leader election code should involve a special 
case to force a commit on the leader if it has no existing commit points? 

----

Either way, i *ALSO* have the vague impression that tomas's primary suggestion 
of always checking generation is correct as well ... but it seems so obvious 
i'm not sure if there is some good why the code doesn't already do that that 
i'm oblivious too?


> Master/Slave IndexVersion conflict
> ----------------------------------
>
>                 Key: SOLR-10751
>                 URL: https://issues.apache.org/jira/browse/SOLR-10751
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: master (7.0)
>            Reporter: Tomás Fernández Löbbe
>            Assignee: Tomás Fernández Löbbe
>         Attachments: SOLR-10751.patch
>
>
> I’ve been looking at some failures in the replica types tests. One strange 
> failure I noticed is, master and slave share the same version, but have 
> different generation. The IndexFetcher code does more or less this:
> {code}
> masterVersion = fetchMasterVersion()
> masterGeneration = fetchMasterGeneration()
> if (masterVersion == 0 && slaveGeneration != 0 && forceReplication) {
>    delete my index
>    commit locally
>    return
> } 
> if (masterVersion != slaveVersion) {
>   fetchIndexFromMaster(masterGeneration)
> } else {
>   //do nothing, master and slave are in sync.
> }
> {code}
> The problem I see happens with this sequence of events:
> delete index in master (not a DBQ=*:*, I mean a complete removal of the index 
> files and reload of the core)
> replication happens in slave (sees a version 0, deletes local index and 
> commit)
> add document in master and commit
> if the commit in master and in the slave happen at the same millisecond*, 
> they both end up with the same version, but different indices. 
> I think that in addition of checking for the same version, we should validate 
> that slave and master have the same generation and If not, consider them not 
> in sync, and proceed to the replication.
> True, this is a situation that's difficult to happen in a real prod 
> environment and it's more likely to affect tests, but I think the change 
> makes sense. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-10751) Master/Slave IndexVersion conflict

Reply via email to