Steven Bower created SOLR-7550:
----------------------------------
Summary: PeerSync fails if a replica returns 500 error
Key: SOLR-7550
URL: https://issues.apache.org/jira/browse/SOLR-7550
Project: Solr
Issue Type: Bug
Components: SolrCloud
Affects Versions: 4.10.2, 4.8.1
Environment: linux
Reporter: Steven Bower
Priority: Critical
4 node cluster we stopped a node and started that node back up. Prior to the
node starting up a schema change was made that was invalid. When the node
started back up the core could not load as the schema was invalid. While in
this state the leader was restarted as well (so now two nodes in this bad
state). When the remaining two nodes attempted to become leader and PeerSync
they were getting a 500 error back from these failed-to-start cores and were
not able to become leaders, which eventually lead to the remaining two nodes
ending up in "recovery_failed" state and the cluster being offline.
Some logs:
{noformat}
2015-05-14 17:03:20.712 INFO ShardLeaderElectionContext [main-EventThread] -
Running the leader process for shard shard1
2015-05-14 17:03:20.720 INFO ShardLeaderElectionContext [main-EventThread] -
Checking if I should try and be the leader.
2015-05-14 17:03:20.720 INFO ShardLeaderElectionContext [main-EventThread] -
My last published State was Active, it's okay to be the leader.
2015-05-14 17:03:20.720 INFO ShardLeaderElectionContext [main-EventThread] - I
may be the new leader - try and sync
2015-05-14 17:03:20.720 WARN RecoveryStrategy [main-EventThread] - Stopping
recovery for zkNodeName=host-a2:12345_solr_xxxxcore=xxxx
2015-05-14 17:03:23.220 INFO SyncStrategy [main-EventThread] - Sync replicas
to http://host-a2:12345/solr/xxxx/
2015-05-14 17:03:23.221 INFO PeerSync [main-EventThread] - PeerSync: core=xxxx
url=http://host-a2:12345/solr START replicas=[http://host-b1:12345/solr/xxxx/,
http://host-a1:12345/solr/xxxx_shard1/] nUpdates=100
2015-05-14 17:03:23.238 INFO PeerSync [main-EventThread] - PeerSync: core=xxxx
url=http://host-a2:12345/solr Received 96 versions from
http://host-b1:12345/solr/xxxx/
2015-05-14 17:03:23.239 INFO PeerSync [main-EventThread] - PeerSync: core=xxxx
url=http://host-a2:12345/solr Our versions are newer.
ourLowThreshold=1501178223728263172 otherHigh=1501178223745040385
2015-05-14 17:03:23.385 WARN PeerSync [main-EventThread] - PeerSync: core=xxxx
url=http://host-a2:12345/solr exception talking to
http://host-a1:12345/solr/xxxx_shard1/, failed
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected
mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 {msg=SolrCore 'xxxx_shard1' is not available due to init
failure: Could not load conf for core xxxx_shard1: Plugin init failure for
[schema.xml] fieldType "text_split_colon": Plugin init failure for [schema.xml]
analyzer/filter: Error loading class 'XXXXXXXXXXXXXX'. Schema file is
/configs/xxxx/schema.xml,trace=org.apache.solr.common.SolrException: SolrCore
'xxxx_shard1' is not available due to init failure: Could not load conf for
core xxxx_shard1: Plugin init failure for [schema.xml] fieldType
"some_field_type": Plugin init failure for [schema.xml] analyzer/filter: Error
loading class 'XXXXXXXXXXXXXXX'. Schema file is /configs/xxxx/schema.xml
at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:299)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
...
...
...
{noformat}
It looks as though the error handling is a bit brittle in that it can tolerate
connection issues, 503 and 404 errors but anything else would cause a cluster
that needed to leader elect and had a node in a bad state to fail.
If just adding support for 500 errors is seen as the best approach that is a
simple fix and I can put a patch up quickly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]