[jira] [Commented] (SOLR-7188) Run Data Import Handler processes in a SolrJ client

Shawn Heisey (JIRA) Wed, 04 Mar 2015 09:23:07 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347177#comment-14347177
 ]


Shawn Heisey commented on SOLR-7188:
------------------------------------

bq. Wouldn't it be a better option to refactor DIH as a component that uses 
only SolrJ client APIs instead of internal Solr APIs?

I mean that DIH should be changed so that its only Solr/Lucene dependency is 
solr-solrj, specificially the SolrClient class (and any transitive 
dependencies).  The dependency on solr-core should disappear.

bq. What are your concerns about using EmbeddedSolrServer?

The embedded object has dependencies outside of the solrj directory that are 
not obvious and as far as I can tell are not well documented.  I don't know if 
any of the Lucene jars are required for the usage you have implemented, but I 
suspect they probably are.  If so, none of those jars are included as distinct 
files in the Solr download.

Rebuilding DIH so compiling only requires SolrClient would allow any 
implementation.  With some care (and additional jars added by the user), that 
probably would include EmbeddedSolrServer.


> Run Data Import Handler processes in a SolrJ client
> ---------------------------------------------------
>
>                 Key: SOLR-7188
>                 URL: https://issues.apache.org/jira/browse/SOLR-7188
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>            Reporter: Ted Sullivan
>         Attachments: SOLR-7188.patch, SOLR-7188.patch
>
>
> Adds a DataImportHandlerClient class that wraps an EmbeddedSolrServer and 
> adds a DIHCloudWriter implementation of DIHWriter that sends documents to a 
> remote SolrCloud cluster.  This enables existing DIH processes to run outside 
> of the Solr JVM which should enable better scalability.
> The current architecture of DIH imposes several restrictions on scalability. 
> First, the DIH runs in the same process space as Solr itself and competes for 
> resources (CPU and memory) with normal Solr processes devoted to indexing and 
> querying. Second, the DIH cannot be multi-threaded which means that 
> parallelizing it requires splitting the processing amongst nodes in a 
> SolrCloud cluster. Since the incoming data is sent through an 
> UpdateRequestProcessor chain (via the SolrWriter implementation of 
> DIHWriter), additional routing is done internally as the documents are 
> forwarded to the current shard leader nodes once the ID hash is computed. 
> This causes additional network traffic within the SolrCloud cluster. Scaling 
> the DIH is limited by the number of nodes in the cluster and any heavy-duty 
> processing due to entity processors or transformation elements shares the 
> processing resources of Solr itself. This is known to be a source of 
> bottlenecks in Solr installations (SolrCloud or Master-Slave) that use DIH.
> The DataImportHandlerClient uses native DIH functionality - DataImporter, 
> etc. but can be run externally to Solr. This means that as many processes as 
> are needed to achieve necessary performance at scale can be added and the 
> processing that occurs within the DataImportHandler is done outside of the 
> Solr JVM. The same benefits that accrue with multiple SolrJ clients can now 
> be realized with DIH without the necessity of porting code from DIH to a 
> SolrJ client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-7188) Run Data Import Handler processes in a SolrJ client

Reply via email to