Hi
I am trying to index 12MM docs faster than is currently happening in Solr
(using solrj). We have identified solr's add method as the bottleneck (and not
commit - which is tuned ok through mergeFactor and maxRamBufferSize and jvm
ram).
Adding 1000 docs is taking approximately 25 seconds. We are making sure we add
and commit in batches. And we've tried both CommonsHttpSolrServer and
EmbeddedSolrServer (assuming removing http overhead would speed things up with
embedding) but the differences is marginal.
The docs being indexed are on average 20 fields long, mostly indexed but none
stored. The major size contributors are two fields:
- content, and
- shingledContent (populated using copyField of content).
The length of the content field is (likely) gaussian distributed (few large
docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to
support phrase queries and content for unigram queries (following the advice of
Solr Enterprise search server advice - p. 305, section "The Solution:
Shingling").
Clearly the size of the docs is a contributor to the slow adds (confirmed by
removing these 2 fields resulting in halving the indexing time). We've tried
compressed=true also but that is not working.
Any guidance on how to support our application logic (without having to change
the schema too much) and speed the indexing speed (from current 212 days for
12MM docs) would be much appreciated.
thank you
Peyman