[ 
https://issues.apache.org/jira/browse/SOLR-7408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated SOLR-7408:
-----------------------------
    Attachment: SOLR-7408.patch

The patch adds a test which reproduces the bug, however only if you place 
breakpoints and simulate the context switches manually. Therefore I'll explain 
the steps to reproduce here, and we can decide if we want to keep the test or 
remove it:

* Put a breakpoint in ZkController.registerConfListenerForCore on 
{{synchronized (confDirectoryListeners) \{}}.

* Put another breakpoint in ZkController.unregister on {{synchronized 
(confDirectoryListeners) \{}}.

* Run the test in Debug and wait for the two threads to get to 
registerConfListenerForCore
** Advance one of them until it reaches the other breakpoint in {{unregister}}.
** Let it remove the entry from the map.
** Now advance the second thread that's in {{registerConfigListenerForCore}}, 
it doesn't find the confDir in the map and throws the "conf directory is not 
valid" exception.

What happens is that the last core which references a configDir is removed, at 
the same time a new core is created. Note that even if both threads call 
{{.watchZKConfDir}}, we still remove the entry from the map, and the current 
code fails w/ the exception.

This patch partially fixes the problem, by moving the removal of the entry from 
the map into a sync'd block as well, guards the map in {{watchZKConfDir}} and 
re-registers a watch inside {{registerConfListenerForCore}} if one doesn't 
exist.

However, this doesn't solve the entire problem. While now the application won't 
hit an exception, the above chain of events can cause the listener to be 
removed, after it has been registered. The reason is that SolrCore registers 
the listener in its constructor, however {{CoreContainer.getCores()}} (which is 
used by {{ZkController.getCores()}}) does not return that SolrCore until it has 
been registered in {{SolrCores.cores}}. That happens in 
{{CoreContainer.registerCore}}, line 460, after the SolrCore was already 
created and registered its listener.

I think that the solution is to register the core listener only when we are 
ready to advertise that core, i.e. not until its put in {{SolrCore.cores}}. 
That is the only way to guarantee that {{ZkController.unregister}} will remove 
the confDir listener when all known/advertised cores don't reference it. And 
since {{registerConfListenerForCore}} re-adds the watch if one doesn't exist, 
we should be safe.

However, this area of the code is completely new to me, so would appreciate a 
review. You don't really have to run the test, it only helped me understand 
what's going on.

> Race condition can cause a config directory listener to be removed
> ------------------------------------------------------------------
>
>                 Key: SOLR-7408
>                 URL: https://issues.apache.org/jira/browse/SOLR-7408
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>         Attachments: SOLR-7408.patch
>
>
> This has been reported here: http://markmail.org/message/ynkm2axkdprppgef, 
> and I was able to reproduce it in a test, although I am only able to 
> reproduce if I put break points and manually simulate the problematic context 
> switches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to