[
https://issues.apache.org/jira/browse/SOLR-7408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shai Erera updated SOLR-7408:
-----------------------------
Attachment: SOLR-7408.patch
The patch adds a test which reproduces the bug, however only if you place
breakpoints and simulate the context switches manually. Therefore I'll explain
the steps to reproduce here, and we can decide if we want to keep the test or
remove it:
* Put a breakpoint in ZkController.registerConfListenerForCore on
{{synchronized (confDirectoryListeners) \{}}.
* Put another breakpoint in ZkController.unregister on {{synchronized
(confDirectoryListeners) \{}}.
* Run the test in Debug and wait for the two threads to get to
registerConfListenerForCore
** Advance one of them until it reaches the other breakpoint in {{unregister}}.
** Let it remove the entry from the map.
** Now advance the second thread that's in {{registerConfigListenerForCore}},
it doesn't find the confDir in the map and throws the "conf directory is not
valid" exception.
What happens is that the last core which references a configDir is removed, at
the same time a new core is created. Note that even if both threads call
{{.watchZKConfDir}}, we still remove the entry from the map, and the current
code fails w/ the exception.
This patch partially fixes the problem, by moving the removal of the entry from
the map into a sync'd block as well, guards the map in {{watchZKConfDir}} and
re-registers a watch inside {{registerConfListenerForCore}} if one doesn't
exist.
However, this doesn't solve the entire problem. While now the application won't
hit an exception, the above chain of events can cause the listener to be
removed, after it has been registered. The reason is that SolrCore registers
the listener in its constructor, however {{CoreContainer.getCores()}} (which is
used by {{ZkController.getCores()}}) does not return that SolrCore until it has
been registered in {{SolrCores.cores}}. That happens in
{{CoreContainer.registerCore}}, line 460, after the SolrCore was already
created and registered its listener.
I think that the solution is to register the core listener only when we are
ready to advertise that core, i.e. not until its put in {{SolrCore.cores}}.
That is the only way to guarantee that {{ZkController.unregister}} will remove
the confDir listener when all known/advertised cores don't reference it. And
since {{registerConfListenerForCore}} re-adds the watch if one doesn't exist,
we should be safe.
However, this area of the code is completely new to me, so would appreciate a
review. You don't really have to run the test, it only helped me understand
what's going on.
> Race condition can cause a config directory listener to be removed
> ------------------------------------------------------------------
>
> Key: SOLR-7408
> URL: https://issues.apache.org/jira/browse/SOLR-7408
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Reporter: Shai Erera
> Assignee: Shai Erera
> Attachments: SOLR-7408.patch
>
>
> This has been reported here: http://markmail.org/message/ynkm2axkdprppgef,
> and I was able to reproduce it in a test, although I am only able to
> reproduce if I put break points and manually simulate the problematic context
> switches.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]