[
https://issues.apache.org/jira/browse/SOLR-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100991#comment-14100991
]
Steve Davids commented on SOLR-5986:
------------------------------------
We came across the issue again and added a lot more probes to get a grasp on
what exactly is happening, I believe further tickets might be necessary to
address various pieces.
#1) We are setting the "timeout" request parameter which tells the
TimeLimitingCollector to throw a TimeExceededException, though in our logs we
see the error messages thrown after about an hour for one of the queries we
tried, even though the timeout is set for a couple of minutes. This is
presumably due to the query parsing taking about an hour and once the query is
finally parsed and handed to the collector the TimeLimitingCollector
immediately throws in exception. We should have something similar throw the
same exception while in the query building phase (this way the partial results
warnings will continue to just work). It looks like the current work is more in
the realm of solving this issue which may fix the problems we saw described in
#2.
#2) We set socket read timeouts on HTTPClient which causes the same query to be
sent into the cluster multiple times giving it a slow, painful death. This is
even more problematic while using the SolrJ API, what ends up happening from
SolrJ's LBHttpSolrServer is that it will loop through *every* host in the
cluster and if a socket read timeout happens it tries the next item in the
list. Internally every single request made to the cluster from an outside SolrJ
client will try to gather the results for all shards in the cluster, once a
socket read timeout happens internal to the cluster the same retry logic will
attempt to gather results from the next replica in the list. So, if we
hypothetically had 10 shards with 3 replicas, and made a request from an
outside client it would make 30 (external SolrJ call to each host to request a
distributed search) * 30 (each host will be called at least once for the
internal distributed request) = 900 overall requests (each individual search
host will handle 30 requests). This should probably become it's own ticket to
track, to either a) don't retry on a socket read timeout or b) specify a retry
timeout of some sort in the LBHttpSolrServer (this is something we did
internally for simplicity sake).
> Don't allow runaway queries from harming Solr cluster health or search
> performance
> ----------------------------------------------------------------------------------
>
> Key: SOLR-5986
> URL: https://issues.apache.org/jira/browse/SOLR-5986
> Project: Solr
> Issue Type: Improvement
> Components: search
> Reporter: Steve Davids
> Assignee: Anshum Gupta
> Priority: Critical
> Fix For: 4.10
>
> Attachments: SOLR-5986.patch
>
>
> The intent of this ticket is to have all distributed search requests stop
> wasting CPU cycles on requests that have already timed out or are so
> complicated that they won't be able to execute. We have come across a case
> where a nasty wildcard query within a proximity clause was causing the
> cluster to enumerate terms for hours even though the query timeout was set to
> minutes. This caused a noticeable slowdown within the system which made us
> restart the replicas that happened to service that one request, the worst
> case scenario are users with a relatively low zk timeout value will have
> nodes start dropping from the cluster due to long GC pauses.
> [~amccurry] Built a mechanism into Apache Blur to help with the issue in
> BLUR-142 (see commit comment for code, though look at the latest code on the
> trunk for newer bug fixes).
> Solr should be able to either prevent these problematic queries from running
> by some heuristic (possibly estimated size of heap usage) or be able to
> execute a thread interrupt on all query threads once the time threshold is
> met. This issue mirrors what others have discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%[email protected]%3E
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]