[ 
https://issues.apache.org/jira/browse/SOLR-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347411#comment-14347411
 ] 

Ted Sullivan commented on SOLR-7188:
------------------------------------

[[email protected]] thanks - there are similar issues in Solr - OOM, 
Stop-the-world GC events, etc. if the Solr jvm is encumbered by custom 
processing.

[~elyograg] - separating out a DIH component that could be used with both Solr 
and SolrJ makes the most sense to me. A cursory examination of the code 
suggests that a lot of what SolrCore provides is support for things like 
resource loading, schema, Zookeeper awareness and such. I haven't done a full 
analysis yet but the bulk of the dependencies appear to be in DataImporter and 
DocBuilder. That solution would need to be backward compatible with existing 
Solr DIH architectures. That is a larger problem than the one that I set out to 
solve (when has this happened before, right?)

There is a trivial reason that my DataImportHandlerClient class is in the 
dataimport package - DataImporter includes some package-private methods. I did 
not want to make changes to existing code for this solution.

> Run Data Import Handler processes in a SolrJ client
> ---------------------------------------------------
>
>                 Key: SOLR-7188
>                 URL: https://issues.apache.org/jira/browse/SOLR-7188
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>            Reporter: Ted Sullivan
>         Attachments: SOLR-7188.patch, SOLR-7188.patch
>
>
> Adds a DataImportHandlerClient class that wraps an EmbeddedSolrServer and 
> adds a DIHCloudWriter implementation of DIHWriter that sends documents to a 
> remote SolrCloud cluster.  This enables existing DIH processes to run outside 
> of the Solr JVM which should enable better scalability.
> The current architecture of DIH imposes several restrictions on scalability. 
> First, the DIH runs in the same process space as Solr itself and competes for 
> resources (CPU and memory) with normal Solr processes devoted to indexing and 
> querying. Second, the DIH cannot be multi-threaded which means that 
> parallelizing it requires splitting the processing amongst nodes in a 
> SolrCloud cluster. Since the incoming data is sent through an 
> UpdateRequestProcessor chain (via the SolrWriter implementation of 
> DIHWriter), additional routing is done internally as the documents are 
> forwarded to the current shard leader nodes once the ID hash is computed. 
> This causes additional network traffic within the SolrCloud cluster. Scaling 
> the DIH is limited by the number of nodes in the cluster and any heavy-duty 
> processing due to entity processors or transformation elements shares the 
> processing resources of Solr itself. This is known to be a source of 
> bottlenecks in Solr installations (SolrCloud or Master-Slave) that use DIH.
> The DataImportHandlerClient uses native DIH functionality - DataImporter, 
> etc. but can be run externally to Solr. This means that as many processes as 
> are needed to achieve necessary performance at scale can be added and the 
> processing that occurs within the DataImportHandler is done outside of the 
> Solr JVM. The same benefits that accrue with multiple SolrJ clients can now 
> be realized with DIH without the necessity of porting code from DIH to a 
> SolrJ client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to