[ 
https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862794#comment-16862794
 ] 

Suril Shah commented on SOLR-13532:
-----------------------------------

[~gus_heck]: We can atleast increase the timeout values to 15000 ms as a 
temporary fix for this particular issue because 1000 ms is really bad as 
recoveries won't happen if a ping from the leader is not obtained in 1000 ms 
which could happen because of many reasons, like network issues, low commit 
times, etc.
Let me know your thoughts on this.

> Unable to start core recovery due to timeout in ping request
> ------------------------------------------------------------
>
>                 Key: SOLR-13532
>                 URL: https://issues.apache.org/jira/browse/SOLR-13532
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 7.6
>            Reporter: Suril Shah
>            Priority: Major
>
> Discovered following issue with the core recovery:
>  * Core recovery is not being initialized and throwing following exception 
> message :
> {code:java}
> 2019-06-07 00:53:12.436 INFO  
> (recoveryExecutor-4-thread-1-processing-n:<solr_ip>:8983_solr 
> x:<collection_name>_shard41_replica_n2777 c:<collection_name> s:shard41 
> r:core_node2778) x:<collection_name>_shard41_replica_n2777 
> o.a.s.c.RecoveryStrategy Failed to connect leader http://<solr_ip>:8983/solr 
> on recovery, try again{code}
>  * Above error occurs when ping request takes time more than a timeout period 
> which is hard-coded to one second in solr source code. However In a general 
> production setting it is common to have ping time more than one second, 
> hence, the core recovery never starts and exception is thrown.
>  * Also the other major concern is that this exception is logged as an info 
> message, hence it is very difficult to identify the error if info logging is 
> not enabled.
>  * Please refer to following code snippet from the [source 
> code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803]
>  to understand the above issue.
> {code:java}
>       try (HttpSolrClient httpSolrClient = new 
> HttpSolrClient.Builder(leaderReplica.getCoreUrl())
>           .withSocketTimeout(1000)
>           .withConnectionTimeout(1000)
>           
> .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
>           .build()) {
>         SolrPingResponse resp = httpSolrClient.ping();
>         return leaderReplica;
>       } catch (IOException e) {
>         log.info("Failed to connect leader {} on recovery, try again", 
> leaderReplica.getBaseUrl());
>         Thread.sleep(500);
>       } catch (Exception e) {
>         if (e.getCause() instanceof IOException) {
>           log.info("Failed to connect leader {} on recovery, try again", 
> leaderReplica.getBaseUrl());
>           Thread.sleep(500);
>         } else {
>           return leaderReplica;
>         }
>       }
> {code}
> The above issue will have high impact in production level clusters, since 
> cores not being able to recover may lead to data loss.
> Following improvements would be really helpful:
>  1. The [timeout for ping 
> request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791]
>  in *RecoveryStrategy.java* should be configurable and the defaults set to 
> high values like 15seconds.
>  2. The exception message in [line 
> 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797]
>  and [line 
> 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801]
>  in *RecoveryStrategy.java* should be logged as *error* messages instead of 
> *info* messages



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to