[ 
https://issues.apache.org/jira/browse/SOLR-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15808282#comment-15808282
 ] 

Mark Miller commented on SOLR-9936:
-----------------------------------

We would want to do some vetting and testing, but perhaps currently okay to 
limit the recovery executor.

> Allow configuration for recoveryExecutor thread pool size
> ---------------------------------------------------------
>
>                 Key: SOLR-9936
>                 URL: https://issues.apache.org/jira/browse/SOLR-9936
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: replication (java)
>    Affects Versions: 6.3
>            Reporter: Tim Owen
>         Attachments: SOLR-9936.patch
>
>
> There are two executor services in {{UpdateShardHandler}}, the 
> {{updateExecutor}} whose size is unbounded for reasons explained in the code 
> comments. There is also the {{recoveryExecutor}} which was added later, and 
> is the one that executes the {{RecoveryStrategy}} code to actually fetch 
> index files and store to disk, eventually calling an {{fsync}} thread to 
> ensure the data is written.
> We found that with a fast network such as 10GbE it's very easy to overload 
> the local disk storage when doing a restart of Solr instances after some 
> downtime, if they have many cores to load. Typically we have each physical 
> server containing 6 SSDs and 6 Solr instances, so each Solr has its home dir 
> on a dedicated SSD. With 100+ cores (shard replicas) on each instance, 
> startup can really hammer the SSD as it's writing in parallel from as many 
> cores as Solr is recovering. This made recovery time bad enough that replicas 
> were down for a long time, and even shards marked as down if none of its 
> replicas have recovered (usually when many machines have been restarted). The 
> very slow IO times (10s of seconds or worse) also made the JVM pause, so that 
> disconnects from ZK, which didn't help recovery either.
> This patch allowed us to throttle how much parallelism there would be writing 
> to a disk - in practice we're using a pool size of 4 threads, to prevent the 
> SSD getting overloaded, and that worked well enough to make recovery of all 
> cores in reasonable time.
> Due to the comment on the other thread pool size, I'd like some comments on 
> whether it's OK to do this for the {{recoveryExecutor}} though?
> It's configured in solr.xml with e.g.
> {noformat}
>   <updateshardhandler>
>     <int name="maxRecoveryThreads">${solr.recovery.threads:4}</int>
>   </updateshardhandler>
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to