Re: BufferedUpdateStreams breaks high performance indexing

Bernd Fehling Fri, 29 Jul 2016 00:22:14 -0700

Yes, with default of 10 it performs very much better.
I didn't take into count that DIH uses updateDocument for adding new
documents but after thinking about the "why" I assume that
this might be because you don't know if a document already exists in the
index.
Conclusion, using DIH and setting segmentsPerTier to a high value is a killer.


One question still remains about messages in INFOSTREAM, I have lines saying
BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345 deleted 
queries
           bytesUsed=2313024 delGen=2265 packetCount=69 totBytesUsed=262526720
...
BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted terms 
(unique count=0)
           97142 deleted queries bytesUsed=3108576]; coalesced deletes=
           
[CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)]
            newDelCount=0

Do you know what these deleted terms and deleted queries are?

Best regards,
Bernd


Am 28.07.2016 um 17:34 schrieb Michael McCandless:
> Hmm, your merge policy changes are dangerous: that will cause too many
> segments in the index, which makes it longer to apply deletes.
> 
> Can you revert that and re-test?
> 
> I'm not sure why DIH is using updateDocument instead of addDocument ...
> maybe ask on the solr-user list?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
> 
>> Currently I use concurrent DIH but will write some SolrJ for testing
>> or even as replacement for DIH.
>> Don't know whats behind DIH if only documents are added.
>>
>> Not tried any newer release yet, but after reading LUCENE-6161 I really
>> should.
>> At least a version > 5.1
>> May be before writing some SolrJ.
>>
>>
>> Yes IndexWriterConfig is changed from default:
>> <indexConfig>
>>     <maxIndexingThreads>8</maxIndexingThreads>
>>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>>     <maxBufferedDocs>-1</maxBufferedDocs>
>>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>>       <int name="maxMergeAtOnce">8</int>
>>       <int name="segmentsPerTier">100</int>
>>       <int name="maxMergedSegmentMB">512</int>
>>     </mergePolicy>
>>     <mergeFactor>8</mergeFactor>
>>     <mergeScheduler
>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>>     <lockType>${solr.lock.type:native}</lockType>
>>     ...
>> </indexConfig>
>>
>> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
>> Somewhere between 20 and 50 characters in length.
>>
>> Thanks for your help,
>> Bernd
>>
>>
>> Am 28.07.2016 um 15:35 schrieb Michael McCandless:
>>> Hmm not good.
>>>
>>> If you are really only adding documents, you should be using
>>> IndexWriter.addDocument, which won't buffer any deleted terms and that
>>> method call should be a no-op.  It also makes flushes more efficient
>> since
>>> all of your indexing buffer goes to the added documents, not buffered
>>> delete terms.  Are you using updateDocument?
>>>
>>> Can you reproduce this slowness on a newer release?  There have been
>>> performance issues fixed in newer releases in this method, e.g
>>> https://issues.apache.org/jira/browse/LUCENE-6161
>>>
>>> Have you changed any IndexWriterConfig settings from defaults?
>>>
>>> What are your unique id fields like?  How many bytes in length?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
>>> bernd.fehl...@uni-bielefeld.de> wrote:
>>>
>>>> While trying to get higher performance for indexing it turned out that
>>>> BufferedUpdateStreams is breaking indexing performance.
>>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
>>>>
>>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
>> 4.10.4
>>>> API states:
>>>> "Determines the amount of RAM that may be used for buffering added
>>>> documents and deletions before they are flushed to the Directory.
>>>> Generally for faster indexing performance it's best to flush by RAM
>>>> usage instead of document count and use as large a RAM buffer as you
>> can."
>>>>
>>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
>>>>
>>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
>>>> infos=...
>>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
>> took
>>>> 3411845 msec
>>>>
>>>> About 56 minutes no indexing and only applying deletes.
>>>> What is it deleting?
>>>>
>>>> If the index gets bigger the time gets longer, currently 2.5 hours of
>>>> waiting.
>>>> I'm adding 96 million docs with uniq id, no duplicates, only add, no
>>>> deletes.
>>>>
>>>> Any suggestions which config is _really_ going for high performance
>>>> indexing?
>>>>
>>>> Best regards,
>>>> Bernd
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universitätsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: BufferedUpdateStreams breaks high performance indexing

Reply via email to