Ted Sullivan created SOLR-7188:
----------------------------------

             Summary: Run Data Import Handler processes in a SolrJ client
                 Key: SOLR-7188
                 URL: https://issues.apache.org/jira/browse/SOLR-7188
             Project: Solr
          Issue Type: Bug
          Components: contrib - DataImportHandler
            Reporter: Ted Sullivan


Adds a DataImportHandlerClient class that wraps an EmbeddedSolrServer and adds 
a DIHCloudWriter implementation of DIHWriter that sends documents to a remote 
SolrCloud cluster.  This enables existing DIH processes to run outside of the 
Solr JVM which should enable better scalability.

The current architecture of DIH imposes several restrictions on scalability. 
First, the DIH runs in the same process space as Solr itself and competes for 
resources (CPU and memory) with normal Solr processes devoted to indexing and 
querying. Second, the DIH cannot be multi-threaded which means that 
parallelizing it requires splitting the processing amongst nodes in a SolrCloud 
cluster. Since the incoming data is sent through an UpdateRequestProcessor 
chain (via the SolrWriter implementation of DIHWriter), additional routing is 
done internally as the documents are forwarded to the current shard leader 
nodes once the ID hash is computed. This causes additional network traffic 
within the SolrCloud cluster. Scaling the DIH is limited by the number of nodes 
in the cluster and any heavy-duty processing due to entity processors or 
transformation elements shares the processing resources of Solr itself. This is 
known to be a source of bottlenecks in Solr installations (SolrCloud or 
Master-Slave) that use DIH.

The DataImportHandlerClient uses native DIH functionality - DataImporter, etc. 
but can be run externally to Solr. This means that as many processes as are 
needed to achieve necessary performance at scale can be added and the 
processing that occurs within the DataImportHandler is done outside of the Solr 
JVM. The same benefits that accrue with multiple SolrJ clients can now be 
realized with DIH without the necessity of porting code from DIH to a SolrJ 
client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to