Shalin Shekhar Mangar created SOLR-6095:
-------------------------------------------

             Summary: SolrCloud cluster can end up without an overseer
                 Key: SOLR-6095
                 URL: https://issues.apache.org/jira/browse/SOLR-6095
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
    Affects Versions: 4.8
            Reporter: Shalin Shekhar Mangar
             Fix For: 4.9, 5.0


We have a large cluster running on ec2 which occasionally ends up without an 
overseer after a rolling restart. We always restart our overseer nodes at the 
very last otherwise we end up with a large number of shards that can't recover 
properly.

This cluster is running a custom branch forked from 4.8 and has SOLR-5473, 
SOLR-5495 and SOLR-5468 applied. We have a large number of small collections 
(120 collections each with approx 5M docs) on 16 Solr nodes. We are also using 
the overseer roles feature to designate two specified nodes as overseers. 
However, I think the problem that we're seeing is not specific to the overseer 
roles feature.

As soon as the overseer was shutdown, we saw the following on the node which 
was next in line to become the overseer:
{code}
2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  - 
I am going to be the leader ec2-xxxxxxxxxx.compute-1.amazonaws.com:8987_solr
2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists for /overseer_elect/leader
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
        at 
org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
        at 
org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
        at 
org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
        at 
org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
        at 
org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
        at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
        at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
        at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
        at 
org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
{code}

When the overseer leader node is gracefully shutdown, we get the following in 
the logs:
{code}
2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in 
Overseer main queue loop
org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
        at 
org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
        at 
org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
        at 
org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
        at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:503)
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
        at 
org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
        at 
org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
        at 
org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
        at 
org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
        ... 4 more
2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer Loop 
exiting : ec2-xxxxxxxxxx.compute-1.amazonaws.com:8986_solr
2014-05-20 09:55:39,256 [main-EventThread] WARN  common.cloud.ZkStateReader  - 
ZooKeeper watch triggered, but Solr cannot talk to ZK
2014-05-20 09:55:39,259 [ShutdownMonitor] INFO  server.handler.ContextHandler  
- stopped 
o.e.j.w.WebAppContext{/solr,file:/vol0/cloud86/solr-webapp/webapp/},/vol0/cloud86/webapps/solr.war
{code}

Notice how the overseer kept on running almost till the last point i.e. until 
the jetty context stopped. On some runs, we got the following on the overseer 
leader node on graceful shutdown:
{code}
2014-05-19 21:33:43,657 [Thread-75] ERROR solr.cloud.Overseer  - Exception in 
Overseer main queue loop
org.apache.solr.common.SolrException: Could not load collection from ZK:sm71
        at 
org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
        at 
org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
        at 
org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
        at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:503)
        at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1153)
        at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:277)
        at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:274)
        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
        at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:274)
        at 
org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:769)
        ... 4 more
2014-05-19 21:33:43,662 [main-EventThread] WARN  common.cloud.ZkStateReader  - 
ZooKeeper watch triggered, but Solr cannot talk to ZK
2014-05-19 21:33:43,663 [Thread-75] INFO  solr.cloud.Overseer  - Overseer Loop 
exiting : ec2-xxxxxxxxxxxx.compute-1.amazonaws.com:8987_solr
2014-05-19 21:33:43,664 [OverseerExitThread] ERROR solr.cloud.Overseer  - could 
not read the data
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /overseer_elect/leader
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
        at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:277)
        at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:274)
        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
        at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:274)
        at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:329)
        at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:85)
        at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater$1.run(Overseer.java:293)
2014-05-19 21:33:43,665 [ShutdownMonitor] INFO  server.handler.ContextHandler  
- stopped 
o.e.j.w.WebAppContext{/solr,file:/vol0/cloud87/solr-webapp/webapp/},/vol0/cloud87/webapps/solr.war
{code}
Again the overseer was clinging on till the last moment and by the time it 
exited the ZK session expired and it couldn't delete the /overseer_elect/leader 
node. The exception on the next-in-line node was the same i.e. NodeExists for 
/overseer_elect/leader.

In both cases, we are left with no overseers after restart. I can easily 
reproduce this problem by just restarting overseer leader nodes repeatedly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to