queue is clogged

James Hardwick (JIRA) Thu, 06 Nov 2014 14:17:01 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201049#comment-14201049
 ]


James Hardwick commented on SOLR-6707:
--------------------------------------

My assumption was wrong about the feature. Here is the initial error that 
kicked off the sequence:

{noformat}
2014-11-03 11:13:37,734 [updateExecutor-1-thread-4] ERROR 
update.StreamingSolrServers  - error
org.apache.solr.common.SolrException: Internal Server Error
 
 
 
request: 
http://xxx.xxx.xxx.xxx:8081/app-search/appindex/update?update.chain=updateRequestProcessorChain&update.distrib=TOLEADER&distrib.from=http%3A%2F%2Fxxx.xxx.xxx.xxx%3A8081%2Fapp-search%2Fappindex%2F&wt=javabin&version=2
        at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
2014-11-03 11:13:38,056 [http-bio-8081-exec-336] WARN  
processor.DistributedUpdateProcessor  - Error sending update
org.apache.solr.common.SolrException: Internal Server Error
 
 
 
request: 
http://xxx.xxx.xxx.xxx:8081/app-search/appindex/update?update.chain=updateRequestProcessorChain&update.distrib=TOLEADER&distrib.from=http%3A%2F%2Fxxx.xxx.xxx.xxx%3A8081%2Fapp-search%2Fappindex%2F&wt=javabin&version=2
        at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
2014-11-03 11:13:38,364 [http-bio-8081-exec-324] INFO  update.UpdateHandler  - 
start 
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2014-11-03 11:13:38,364 [http-bio-8081-exec-324] INFO  update.UpdateHandler  - 
No uncommitted changes. Skipping IW.commit.
2014-11-03 11:13:38,365 [http-bio-8081-exec-324] INFO  search.SolrIndexSearcher 
 - Opening Searcher@60515a83[appindex] main
2014-11-03 11:13:38,372 [http-bio-8081-exec-324] INFO  update.UpdateHandler  - 
end_commit_flush
2014-11-03 11:13:38,373 [updateExecutor-1-thread-6] ERROR 
update.SolrCmdDistributor  - 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No space 
left on device
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:550)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(ConcurrentUpdateSolrServer.java:292)
        at 
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:296)
        at 
org.apache.solr.update.SolrCmdDistributor.access$000(SolrCmdDistributor.java:53)
        at 
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:283)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
 
2014-11-03 11:13:40,812 [http-bio-8081-exec-336] WARN  
processor.DistributedUpdateProcessor  - Error sending update
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No space 
left on device
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:550)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(ConcurrentUpdateSolrServer.java:292)
        at 
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:296)
        at 
org.apache.solr.update.SolrCmdDistributor.access$000(SolrCmdDistributor.java:53)
        at 
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:283)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
2014-11-03 11:13:40,814 [http-bio-8081-exec-336] INFO  cloud.SolrZkClient  - 
makePath: /collections/appindex/leader_initiated_recovery/shard1/core_node3
2014-11-03 11:13:40,826 [http-bio-8081-exec-336] INFO  cloud.ZkController  - 
Wrote down to /collections/appindex/leader_initiated_recovery/shard1/core_node3
2014-11-03 11:13:40,826 [http-bio-8081-exec-336] INFO  cloud.ZkController  - 
Put replica core= appindex coreNodeName=core_node3 on 
xxx.xxx.xxx.xxx:8081_app-search into leader-initiated recovery.
2014-11-03 11:13:40,827 [http-bio-8081-exec-336] WARN  cloud.ZkController  - 
Leader is publishing core= appindex coreNodeName =core_node3 state=down on 
behalf of un-reachable replica 
http://xxx.xxx.xxx.xxx:8081/app-search/appindex/; forcePublishState? false
2014-11-03 11:13:40,852 [http-bio-8081-exec-336] ERROR 
processor.DistributedUpdateProcessor  - Setting up to try to start recovery on 
replica http://xxx.xxx.xxx.xxx:8081/app-search/appindex/ after: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No space 
left on device
2014-11-03 11:13:40,864 [updateExecutor-1-thread-5] INFO  
cloud.LeaderInitiatedRecoveryThread  - LeaderInitiatedRecoveryThread-appindex 
started running to send REQUESTRECOVERY command to 
http://xxx.xxx.xxx.xxx:8081/app-search/appindex/; will try for a max of 600 secs
2014-11-03 11:13:40,865 [updateExecutor-1-thread-5] INFO  
cloud.LeaderInitiatedRecoveryThread  - Asking core=appindex 
coreNodeName=core_node3 on http://xxx.xxx.xxx.xxx:8081/app-search to recover
2014-11-03 11:13:40,866 [http-bio-8081-exec-336] WARN  
processor.DistributedUpdateProcessor  - Error sending update
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No space 
left on device
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:550)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
        at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
        at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(ConcurrentUpdateSolrServer.java:292)
        at 
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:296)
        at 
org.apache.solr.update.SolrCmdDistributor.access$000(SolrCmdDistributor.java:53)
        at 
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:283)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
2014-11-03 11:13:40,933 [updateExecutor-1-thread-5] INFO  
cloud.LeaderInitiatedRecoveryThread  - Successfully sent REQUESTRECOVERY 
command to core=appindex coreNodeName=core_node3 on 
http://xxx.xxx.xxx.xxx:8081/app-search
2014-11-03 11:13:40,949 [updateExecutor-1-thread-5] INFO  
cloud.LeaderInitiatedRecoveryThread  - LeaderInitiatedRecoveryThread-appindex 
completed successfully after running for 0 secs
{noformat}

Despite it claiming "No space left on device", we had ~10 GB free. Regardless, 
the subsequent recovery process for our dead core sent things into a tailspin. 

> Recovery/election for invalid core results in rapid-fire re-attempts until 
> /overseer/queue is clogged
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-6707
>                 URL: https://issues.apache.org/jira/browse/SOLR-6707
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.10
>            Reporter: James Hardwick
>
> We experienced an issue the other day that brought a production solr server 
> down, and this is what we found after investigating:
> - Running solr instance with two separate cores, one of which is perpetually 
> down because it's configs are not yet completely updated for Solr-cloud. This 
> was thought to be harmless since it's not currently in use. 
> - Solr experienced an "internal server error" supposedly because of "No space 
> left on device" even though we appeared to have ~10GB free. 
> - Solr immediately went into recovery, and subsequent leader election for 
> each shard of each core. 
> - Our primary core recovered immediately. Our additional core which was never 
> active in the first place, attempted to recover but of course couldn't due to 
> the improper configs. 
> - Solr then began rapid-fire reattempting recovery of said node, trying maybe 
> 20-30 times per second.
> - This in turn bombarded zookeepers /overseer/queue into oblivion
> - At some point /overseer/queue becomes so backed up that normal cluster 
> coordination can no longer play out, and Solr topples over. 
> I know this is a bit of an unusual circumstance due to us keeping the dead 
> core around, and our quick solution has been to remove said core. However I 
> can see other potential scenarios that might cause the same issue to arise. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-6707) Recovery/election for invalid core results in rapid-fire re-attempts until /overseer/queue is clogged

Reply via email to