[
https://issues.apache.org/jira/browse/SOLR-6707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201049#comment-14201049
]
James Hardwick commented on SOLR-6707:
--------------------------------------
My assumption was wrong about the feature. Here is the initial error that
kicked off the sequence:
{noformat}
2014-11-03 11:13:37,734 [updateExecutor-1-thread-4] ERROR
update.StreamingSolrServers - error
org.apache.solr.common.SolrException: Internal Server Error
request:
http://xxx.xxx.xxx.xxx:8081/app-search/appindex/update?update.chain=updateRequestProcessorChain&update.distrib=TOLEADER&distrib.from=http%3A%2F%2Fxxx.xxx.xxx.xxx%3A8081%2Fapp-search%2Fappindex%2F&wt=javabin&version=2
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2014-11-03 11:13:38,056 [http-bio-8081-exec-336] WARN
processor.DistributedUpdateProcessor - Error sending update
org.apache.solr.common.SolrException: Internal Server Error
request:
http://xxx.xxx.xxx.xxx:8081/app-search/appindex/update?update.chain=updateRequestProcessorChain&update.distrib=TOLEADER&distrib.from=http%3A%2F%2Fxxx.xxx.xxx.xxx%3A8081%2Fapp-search%2Fappindex%2F&wt=javabin&version=2
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2014-11-03 11:13:38,364 [http-bio-8081-exec-324] INFO update.UpdateHandler -
start
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2014-11-03 11:13:38,364 [http-bio-8081-exec-324] INFO update.UpdateHandler -
No uncommitted changes. Skipping IW.commit.
2014-11-03 11:13:38,365 [http-bio-8081-exec-324] INFO search.SolrIndexSearcher
- Opening Searcher@60515a83[appindex] main
2014-11-03 11:13:38,372 [http-bio-8081-exec-324] INFO update.UpdateHandler -
end_commit_flush
2014-11-03 11:13:38,373 [updateExecutor-1-thread-6] ERROR
update.SolrCmdDistributor -
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No space
left on device
at
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:550)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(ConcurrentUpdateSolrServer.java:292)
at
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:296)
at
org.apache.solr.update.SolrCmdDistributor.access$000(SolrCmdDistributor.java:53)
at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:283)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2014-11-03 11:13:40,812 [http-bio-8081-exec-336] WARN
processor.DistributedUpdateProcessor - Error sending update
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No space
left on device
at
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:550)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(ConcurrentUpdateSolrServer.java:292)
at
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:296)
at
org.apache.solr.update.SolrCmdDistributor.access$000(SolrCmdDistributor.java:53)
at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:283)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2014-11-03 11:13:40,814 [http-bio-8081-exec-336] INFO cloud.SolrZkClient -
makePath: /collections/appindex/leader_initiated_recovery/shard1/core_node3
2014-11-03 11:13:40,826 [http-bio-8081-exec-336] INFO cloud.ZkController -
Wrote down to /collections/appindex/leader_initiated_recovery/shard1/core_node3
2014-11-03 11:13:40,826 [http-bio-8081-exec-336] INFO cloud.ZkController -
Put replica core= appindex coreNodeName=core_node3 on
xxx.xxx.xxx.xxx:8081_app-search into leader-initiated recovery.
2014-11-03 11:13:40,827 [http-bio-8081-exec-336] WARN cloud.ZkController -
Leader is publishing core= appindex coreNodeName =core_node3 state=down on
behalf of un-reachable replica
http://xxx.xxx.xxx.xxx:8081/app-search/appindex/; forcePublishState? false
2014-11-03 11:13:40,852 [http-bio-8081-exec-336] ERROR
processor.DistributedUpdateProcessor - Setting up to try to start recovery on
replica http://xxx.xxx.xxx.xxx:8081/app-search/appindex/ after:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No space
left on device
2014-11-03 11:13:40,864 [updateExecutor-1-thread-5] INFO
cloud.LeaderInitiatedRecoveryThread - LeaderInitiatedRecoveryThread-appindex
started running to send REQUESTRECOVERY command to
http://xxx.xxx.xxx.xxx:8081/app-search/appindex/; will try for a max of 600 secs
2014-11-03 11:13:40,865 [updateExecutor-1-thread-5] INFO
cloud.LeaderInitiatedRecoveryThread - Asking core=appindex
coreNodeName=core_node3 on http://xxx.xxx.xxx.xxx:8081/app-search to recover
2014-11-03 11:13:40,866 [http-bio-8081-exec-336] WARN
processor.DistributedUpdateProcessor - Error sending update
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No space
left on device
at
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:550)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.request(ConcurrentUpdateSolrServer.java:292)
at
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:296)
at
org.apache.solr.update.SolrCmdDistributor.access$000(SolrCmdDistributor.java:53)
at
org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:283)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2014-11-03 11:13:40,933 [updateExecutor-1-thread-5] INFO
cloud.LeaderInitiatedRecoveryThread - Successfully sent REQUESTRECOVERY
command to core=appindex coreNodeName=core_node3 on
http://xxx.xxx.xxx.xxx:8081/app-search
2014-11-03 11:13:40,949 [updateExecutor-1-thread-5] INFO
cloud.LeaderInitiatedRecoveryThread - LeaderInitiatedRecoveryThread-appindex
completed successfully after running for 0 secs
{noformat}
Despite it claiming "No space left on device", we had ~10 GB free. Regardless,
the subsequent recovery process for our dead core sent things into a tailspin.
> Recovery/election for invalid core results in rapid-fire re-attempts until
> /overseer/queue is clogged
> -----------------------------------------------------------------------------------------------------
>
> Key: SOLR-6707
> URL: https://issues.apache.org/jira/browse/SOLR-6707
> Project: Solr
> Issue Type: Bug
> Affects Versions: 4.10
> Reporter: James Hardwick
>
> We experienced an issue the other day that brought a production solr server
> down, and this is what we found after investigating:
> - Running solr instance with two separate cores, one of which is perpetually
> down because it's configs are not yet completely updated for Solr-cloud. This
> was thought to be harmless since it's not currently in use.
> - Solr experienced an "internal server error" supposedly because of "No space
> left on device" even though we appeared to have ~10GB free.
> - Solr immediately went into recovery, and subsequent leader election for
> each shard of each core.
> - Our primary core recovered immediately. Our additional core which was never
> active in the first place, attempted to recover but of course couldn't due to
> the improper configs.
> - Solr then began rapid-fire reattempting recovery of said node, trying maybe
> 20-30 times per second.
> - This in turn bombarded zookeepers /overseer/queue into oblivion
> - At some point /overseer/queue becomes so backed up that normal cluster
> coordination can no longer play out, and Solr topples over.
> I know this is a bit of an unusual circumstance due to us keeping the dead
> core around, and our quick solution has been to remove said core. However I
> can see other potential scenarios that might cause the same issue to arise.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]