Re: Recovery Issue - Solr 6.6.1 and HDFS

Joe Obernberger Tue, 21 Nov 2017 10:53:27 -0800

A clever idea. Normally what we do when we need to do a restart, is tohalt indexing, and then wait about 30 minutes. If we do not wait, andstop the cluster, the default scripts 180 second timeout is not enoughand we'll have lock files to clean up. We use puppet to start and stopthe nodes, but at this point that is not working well since we need tostart one node at a time. With each one taking hours, this is a lengthyprocess! I'd love to see your script!

This new error is now coming up - see screen shot. For some reason someof the shards have no leader assigned:


http://lovehorsepower.com/SolrClusterErrors.jpg

-Joe


On 11/21/2017 1:34 PM, Hendrik Haddorp wrote:

Hi,
the write.lock issue I see as well when Solr is not been stoppedgracefully. The write.lock files are then left in the HDFS as they donot get removed automatically when the client disconnects like aephemeral node in ZooKeeper. Unfortunately Solr does also not realizethat it should be owning the lock as it is marked in the state storedin ZooKeeper as the owner and is also not willing to retry, which iswhy you need to restart the whole Solr instance after the cleanup. Iadded some logic to my Solr start up script which scans the log filesin HDFS and compares that with the state in ZooKeeper and then deleteall lock files that belong to the node that I'm starting.
regards,
Hendrik

On 21.11.2017 14:07, Joe Obernberger wrote:
Hi All - we have a system with 45 physical boxes running solr 6.6.1using HDFS as the index. The current index size is about 31TBytes.With 3x replication that takes up 93TBytes of disk. Our maincollection is split across 100 shards with 3 replicas each. Theissue that we're running into is when restarting the solr6 cluster. The shards go into recovery and start to utilize nearly all of theirnetwork interfaces. If we start too many of the nodes at once, theshards will go into a recovery, fail, and retry loop and never comeup. The errors are related to HDFS not responding fast enough andwarnings from the DFSClient. If we stop a node when this ishappening, the script will force a stop (180 second timeout) and uponrestart, we have lock files (write.lock) inside of HDFS.
The process at this point is to start one node, find out the lockfiles, wait for it to come up completely (hours), stop it, delete thewrite.lock files, and restart. Usually this second restart isfaster, but it still can take 20-60 minutes.
The smaller indexes recover much faster (less than 5 minutes). Shouldwe have not used so many replicas with HDFS? Is there a better waywe should have built the solr6 cluster?
Thank you for any insight!

-Joe
---
This email has been checked for viruses by AVG.
http://www.avg.com

Re: Recovery Issue - Solr 6.6.1 and HDFS

Reply via email to