Re: BufferedUpdateStreams breaks high performance indexing

2016-08-04 Thread Bernd Fehling
After updating to version 5.5.3 it looks good now.
Thanks a lot for your help and advise.

Best regards
Bernd

Am 29.07.2016 um 15:04 schrieb Michael McCandless:
> The deleted terms accumulate whenever you use updateDocument(Term, Doc), or
> when you do deleteDocuments(Term).
> 
> Deleted queries are when you delete by query, but I don't think DIH would
> be doing that unless you asked it to ... maybe a Solr user/dev knows better?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
> 
>> Yes, with default of 10 it performs very much better.
>> I didn't take into count that DIH uses updateDocument for adding new
>> documents but after thinking about the "why" I assume that
>> this might be because you don't know if a document already exists in the
>> index.
>> Conclusion, using DIH and setting segmentsPerTier to a high value is a
>> killer.
>>
>> One question still remains about messages in INFOSTREAM, I have lines
>> saying
>> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345
>> deleted queries
>>bytesUsed=2313024 delGen=2265 packetCount=69
>> totBytesUsed=262526720
>> ...
>> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted
>> terms (unique count=0)
>>97142 deleted queries bytesUsed=3108576]; coalesced deletes=
>>
>>  
>> [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)]
>> newDelCount=0
>>
>> Do you know what these deleted terms and deleted queries are?
>>
>> Best regards,
>> Bernd
>>
>>
>> Am 28.07.2016 um 17:34 schrieb Michael McCandless:
>>> Hmm, your merge policy changes are dangerous: that will cause too many
>>> segments in the index, which makes it longer to apply deletes.
>>>
>>> Can you revert that and re-test?
>>>
>>> I'm not sure why DIH is using updateDocument instead of addDocument ...
>>> maybe ask on the solr-user list?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
>>> bernd.fehl...@uni-bielefeld.de> wrote:
>>>
 Currently I use concurrent DIH but will write some SolrJ for testing
 or even as replacement for DIH.
 Don't know whats behind DIH if only documents are added.

 Not tried any newer release yet, but after reading LUCENE-6161 I really
 should.
 At least a version > 5.1
 May be before writing some SolrJ.


 Yes IndexWriterConfig is changed from default:
 
 8
 1024
 -1
 
   8
   100
   512
 
 8
 >>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
 ${solr.lock.type:native}
 ...
 

 A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
 Somewhere between 20 and 50 characters in length.

 Thanks for your help,
 Bernd


 Am 28.07.2016 um 15:35 schrieb Michael McCandless:
> Hmm not good.
>
> If you are really only adding documents, you should be using
> IndexWriter.addDocument, which won't buffer any deleted terms and that
> method call should be a no-op.  It also makes flushes more efficient
 since
> all of your indexing buffer goes to the added documents, not buffered
> delete terms.  Are you using updateDocument?
>
> Can you reproduce this slowness on a newer release?  There have been
> performance issues fixed in newer releases in this method, e.g
> https://issues.apache.org/jira/browse/LUCENE-6161
>
> Have you changed any IndexWriterConfig settings from defaults?
>
> What are your unique id fields like?  How many bytes in length?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
>
>> While trying to get higher performance for indexing it turned out that
>> BufferedUpdateStreams is breaking indexing performance.
>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
>>
>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
 4.10.4
>> API states:
>> "Determines the amount of RAM that may be used for buffering added
>> documents and deletions before they are flushed to the Directory.
>> Generally for faster indexing performance it's best to flush by RAM
>> usage instead of document count and use as large a RAM buffer as you
 can."
>>
>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
>>
>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
>> infos=...
>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
 took
>> 3411845 msec
>>
>> About 56 minutes no indexing and only applying deletes.
>> What is it deleting?
>>
>> If the index gets bigg

no concurrent merging?

2016-08-04 Thread Bernd Fehling
While increasing the indexing load of version 5.5.3 I see
threads where one merging thread is blocking other merging threads.
But is this concurrent merging?

Bernd

"Lucene Merge Thread #6" - Thread t@40280java.lang.Thread.State: BLOCKED
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4008)
 - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter) owned by 
"Lucene Merge Thread #8" t@53896
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655)
 at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
 at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
 Locked ownable synchronizers:  - None

"Lucene Merge Thread #7" - Thread t@40281java.lang.Thread.State: BLOCKED
 at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4166)
 - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter) owned by 
"Lucene Merge Thread #8" t@53896
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655)
 at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
 at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
 Locked ownable synchronizers:  - None

"Lucene Merge Thread #8" - Thread t@53896java.lang.Thread.State: RUNNABLE
 at java.lang.System.identityHashCode(Native Method)
 at org.apache.lucene.index.IndexReader.hashCode(IndexReader.java:302)
 at java.util.WeakHashMap.hash(WeakHashMap.java:298)
 at java.util.WeakHashMap.put(WeakHashMap.java:449)
 at java.util.Collections$SetFromMap.add(Collections.java:5461)
 at java.util.Collections$SynchronizedCollection.add(Collections.java:2035)
 - locked <4c8b5399> (a java.util.Collections$SynchronizedSet)
 at 
org.apache.lucene.index.IndexReader.registerParentReader(IndexReader.java:138)
 at org.apache.lucene.index.FilterLeafReader.(FilterLeafReader.java:306)
 at 
org.apache.lucene.uninverting.UninvertingReader.(UninvertingReader.java:184)
 at 
org.apache.solr.update.DeleteByQueryWrapper.wrap(DeleteByQueryWrapper.java:52)
 at 
org.apache.solr.update.DeleteByQueryWrapper.createWeight(DeleteByQueryWrapper.java:72)
 at org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:904)
 at 
org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:887)
 at 
org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:713)
 at 
org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:246)
 - locked <9f8d81c> (a org.apache.lucene.index.BufferedUpdatesStream)
 at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3834)
 - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter)
 at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3792)
 - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3646)
 at 
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
 at 
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: BufferedUpdateStreams breaks high performance indexing

2016-08-04 Thread Michael McCandless
Wonderful, thanks for bringing closure!

Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 4, 2016 at 3:14 AM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

> After updating to version 5.5.3 it looks good now.
> Thanks a lot for your help and advise.
>
> Best regards
> Bernd
>
> Am 29.07.2016 um 15:04 schrieb Michael McCandless:
> > The deleted terms accumulate whenever you use updateDocument(Term, Doc),
> or
> > when you do deleteDocuments(Term).
> >
> > Deleted queries are when you delete by query, but I don't think DIH would
> > be doing that unless you asked it to ... maybe a Solr user/dev knows
> better?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling <
> > bernd.fehl...@uni-bielefeld.de> wrote:
> >
> >> Yes, with default of 10 it performs very much better.
> >> I didn't take into count that DIH uses updateDocument for adding new
> >> documents but after thinking about the "why" I assume that
> >> this might be because you don't know if a document already exists in the
> >> index.
> >> Conclusion, using DIH and setting segmentsPerTier to a high value is a
> >> killer.
> >>
> >> One question still remains about messages in INFOSTREAM, I have lines
> >> saying
> >> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345
> >> deleted queries
> >>bytesUsed=2313024 delGen=2265 packetCount=69
> >> totBytesUsed=262526720
> >> ...
> >> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted
> >> terms (unique count=0)
> >>97142 deleted queries bytesUsed=3108576]; coalesced deletes=
> >>
> >>
> [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)]
> >> newDelCount=0
> >>
> >> Do you know what these deleted terms and deleted queries are?
> >>
> >> Best regards,
> >> Bernd
> >>
> >>
> >> Am 28.07.2016 um 17:34 schrieb Michael McCandless:
> >>> Hmm, your merge policy changes are dangerous: that will cause too many
> >>> segments in the index, which makes it longer to apply deletes.
> >>>
> >>> Can you revert that and re-test?
> >>>
> >>> I'm not sure why DIH is using updateDocument instead of addDocument ...
> >>> maybe ask on the solr-user list?
> >>>
> >>> Mike McCandless
> >>>
> >>> http://blog.mikemccandless.com
> >>>
> >>> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
> >>> bernd.fehl...@uni-bielefeld.de> wrote:
> >>>
>  Currently I use concurrent DIH but will write some SolrJ for testing
>  or even as replacement for DIH.
>  Don't know whats behind DIH if only documents are added.
> 
>  Not tried any newer release yet, but after reading LUCENE-6161 I
> really
>  should.
>  At least a version > 5.1
>  May be before writing some SolrJ.
> 
> 
>  Yes IndexWriterConfig is changed from default:
>  
>  8
>  1024
>  -1
>  
>    8
>    100
>    512
>  
>  8
>    class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>  ${solr.lock.type:native}
>  ...
>  
> 
>  A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
>  Somewhere between 20 and 50 characters in length.
> 
>  Thanks for your help,
>  Bernd
> 
> 
>  Am 28.07.2016 um 15:35 schrieb Michael McCandless:
> > Hmm not good.
> >
> > If you are really only adding documents, you should be using
> > IndexWriter.addDocument, which won't buffer any deleted terms and
> that
> > method call should be a no-op.  It also makes flushes more efficient
>  since
> > all of your indexing buffer goes to the added documents, not buffered
> > delete terms.  Are you using updateDocument?
> >
> > Can you reproduce this slowness on a newer release?  There have been
> > performance issues fixed in newer releases in this method, e.g
> > https://issues.apache.org/jira/browse/LUCENE-6161
> >
> > Have you changed any IndexWriterConfig settings from defaults?
> >
> > What are your unique id fields like?  How many bytes in length?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
> > bernd.fehl...@uni-bielefeld.de> wrote:
> >
> >> While trying to get higher performance for indexing it turned out
> that
> >> BufferedUpdateStreams is breaking indexing performance.
> >> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
> >>
> >> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
>  4.10.4
> >> API states:
> >> "Determines the amount of RAM that may be used for buffering added
> >> documents and deletions before they are flushed to the Directory.
> >> Generally for faster indexing performance it's best to flush by RAM
> >> usage instead of document count and use as large a R

Re: no concurrent merging?

2016-08-04 Thread Michael McCandless
Lucene's merging is concurrent, but Solr unfortunately uses
UninvertingReader on each DBQ ... I'm not sure why.  I think you should ask
on the solr-user list?

Or maybe try to change your deletes to be by Term instead of Query?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 4, 2016 at 7:03 AM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

> While increasing the indexing load of version 5.5.3 I see
> threads where one merging thread is blocking other merging threads.
> But is this concurrent merging?
>
> Bernd
>
> "Lucene Merge Thread #6" - Thread t@40280java.lang.Thread.State:
> BLOCKED
> at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4008)
>  - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter)
> owned by "Lucene Merge Thread #8" t@53896
>  at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655)
>  at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>  at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>  Locked ownable synchronizers:  - None
>
> "Lucene Merge Thread #7" - Thread t@40281java.lang.Thread.State:
> BLOCKED
>  at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4166)
>  - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter)
> owned by "Lucene Merge Thread #8" t@53896
>  at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655)
>  at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>  at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>  Locked ownable synchronizers:  - None
>
> "Lucene Merge Thread #8" - Thread t@53896java.lang.Thread.State:
> RUNNABLE
>  at java.lang.System.identityHashCode(Native Method)
>  at org.apache.lucene.index.IndexReader.hashCode(IndexReader.java:302)
>  at java.util.WeakHashMap.hash(WeakHashMap.java:298)
>  at java.util.WeakHashMap.put(WeakHashMap.java:449)
>  at java.util.Collections$SetFromMap.add(Collections.java:5461)
>  at java.util.Collections$SynchronizedCollection.add(Collections.java:2035)
>  - locked <4c8b5399> (a java.util.Collections$SynchronizedSet)
>  at
> org.apache.lucene.index.IndexReader.registerParentReader(IndexReader.java:138)
>  at
> org.apache.lucene.index.FilterLeafReader.(FilterLeafReader.java:306)
>  at
> org.apache.lucene.uninverting.UninvertingReader.(UninvertingReader.java:184)
>  at
> org.apache.solr.update.DeleteByQueryWrapper.wrap(DeleteByQueryWrapper.java:52)
>  at
> org.apache.solr.update.DeleteByQueryWrapper.createWeight(DeleteByQueryWrapper.java:72)
>  at
> org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:904)
>  at
> org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:887)
>  at
> org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:713)
>  at
> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:246)
>  - locked <9f8d81c> (a org.apache.lucene.index.BufferedUpdatesStream)
>  at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3834)
>  - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter)
>  at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3792)
>  - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter)
>  at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3646)
>  at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>  at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: no concurrent merging?

2016-08-04 Thread Mikhail Khludnev
Hello,
There is https://issues.apache.org/jira/browse/LUCENE-7049


On Thu, Aug 4, 2016 at 4:35 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Lucene's merging is concurrent, but Solr unfortunately uses
> UninvertingReader on each DBQ ... I'm not sure why.  I think you should ask
> on the solr-user list?
>
> Or maybe try to change your deletes to be by Term instead of Query?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Thu, Aug 4, 2016 at 7:03 AM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
>
> > While increasing the indexing load of version 5.5.3 I see
> > threads where one merging thread is blocking other merging threads.
> > But is this concurrent merging?
> >
> > Bernd
> >
> > "Lucene Merge Thread #6" - Thread t@40280java.lang.Thread.State:
> > BLOCKED
> > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4008)
> >  - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter)
> > owned by "Lucene Merge Thread #8" t@53896
> >  at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655)
> >  at
> >
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
> >  at
> >
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
> >  Locked ownable synchronizers:  - None
> >
> > "Lucene Merge Thread #7" - Thread t@40281java.lang.Thread.State:
> > BLOCKED
> >  at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4166)
> >  - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter)
> > owned by "Lucene Merge Thread #8" t@53896
> >  at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655)
> >  at
> >
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
> >  at
> >
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
> >  Locked ownable synchronizers:  - None
> >
> > "Lucene Merge Thread #8" - Thread t@53896java.lang.Thread.State:
> > RUNNABLE
> >  at java.lang.System.identityHashCode(Native Method)
> >  at org.apache.lucene.index.IndexReader.hashCode(IndexReader.java:302)
> >  at java.util.WeakHashMap.hash(WeakHashMap.java:298)
> >  at java.util.WeakHashMap.put(WeakHashMap.java:449)
> >  at java.util.Collections$SetFromMap.add(Collections.java:5461)
> >  at
> java.util.Collections$SynchronizedCollection.add(Collections.java:2035)
> >  - locked <4c8b5399> (a java.util.Collections$SynchronizedSet)
> >  at
> >
> org.apache.lucene.index.IndexReader.registerParentReader(IndexReader.java:138)
> >  at
> >
> org.apache.lucene.index.FilterLeafReader.(FilterLeafReader.java:306)
> >  at
> >
> org.apache.lucene.uninverting.UninvertingReader.(UninvertingReader.java:184)
> >  at
> >
> org.apache.solr.update.DeleteByQueryWrapper.wrap(DeleteByQueryWrapper.java:52)
> >  at
> >
> org.apache.solr.update.DeleteByQueryWrapper.createWeight(DeleteByQueryWrapper.java:72)
> >  at
> >
> org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:904)
> >  at
> >
> org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:887)
> >  at
> >
> org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:713)
> >  at
> >
> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:246)
> >  - locked <9f8d81c> (a org.apache.lucene.index.BufferedUpdatesStream)
> >  at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3834)
> >  - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter)
> >  at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3792)
> >  - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter)
> >  at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3646)
> >  at
> >
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
> >  at
> >
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>



-- 
Sincerely yours
Mikhail Khludnev


Re: no concurrent merging?

2016-08-04 Thread Bernd Fehling
Yes, excactly, that's it.
But is it a Lucene or a Solr problem?

Should Solr use a different reader from DBQ or can Lucene
do something to solve this because it is reported as a
Lucene issue?

Regards
Bernd


Am 04.08.2016 um 16:02 schrieb Mikhail Khludnev:
> Hello,
> There is https://issues.apache.org/jira/browse/LUCENE-7049
> 
> 
> On Thu, Aug 4, 2016 at 4:35 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
> 
>> Lucene's merging is concurrent, but Solr unfortunately uses
>> UninvertingReader on each DBQ ... I'm not sure why.  I think you should ask
>> on the solr-user list?
>>
>> Or maybe try to change your deletes to be by Term instead of Query?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Aug 4, 2016 at 7:03 AM, Bernd Fehling <
>> bernd.fehl...@uni-bielefeld.de> wrote:
>>
>>> While increasing the indexing load of version 5.5.3 I see
>>> threads where one merging thread is blocking other merging threads.
>>> But is this concurrent merging?
>>>
>>> Bernd
>>>
>>> "Lucene Merge Thread #6" - Thread t@40280java.lang.Thread.State:
>>> BLOCKED
>>> at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4008)
>>>  - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter)
>>> owned by "Lucene Merge Thread #8" t@53896
>>>  at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655)
>>>  at
>>>
>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>>>  at
>>>
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>>>  Locked ownable synchronizers:  - None
>>>
>>> "Lucene Merge Thread #7" - Thread t@40281java.lang.Thread.State:
>>> BLOCKED
>>>  at
>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4166)
>>>  - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter)
>>> owned by "Lucene Merge Thread #8" t@53896
>>>  at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655)
>>>  at
>>>
>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>>>  at
>>>
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>>>  Locked ownable synchronizers:  - None
>>>
>>> "Lucene Merge Thread #8" - Thread t@53896java.lang.Thread.State:
>>> RUNNABLE
>>>  at java.lang.System.identityHashCode(Native Method)
>>>  at org.apache.lucene.index.IndexReader.hashCode(IndexReader.java:302)
>>>  at java.util.WeakHashMap.hash(WeakHashMap.java:298)
>>>  at java.util.WeakHashMap.put(WeakHashMap.java:449)
>>>  at java.util.Collections$SetFromMap.add(Collections.java:5461)
>>>  at
>> java.util.Collections$SynchronizedCollection.add(Collections.java:2035)
>>>  - locked <4c8b5399> (a java.util.Collections$SynchronizedSet)
>>>  at
>>>
>> org.apache.lucene.index.IndexReader.registerParentReader(IndexReader.java:138)
>>>  at
>>>
>> org.apache.lucene.index.FilterLeafReader.(FilterLeafReader.java:306)
>>>  at
>>>
>> org.apache.lucene.uninverting.UninvertingReader.(UninvertingReader.java:184)
>>>  at
>>>
>> org.apache.solr.update.DeleteByQueryWrapper.wrap(DeleteByQueryWrapper.java:52)
>>>  at
>>>
>> org.apache.solr.update.DeleteByQueryWrapper.createWeight(DeleteByQueryWrapper.java:72)
>>>  at
>>>
>> org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:904)
>>>  at
>>>
>> org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:887)
>>>  at
>>>
>> org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:713)
>>>  at
>>>
>> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:246)
>>>  - locked <9f8d81c> (a org.apache.lucene.index.BufferedUpdatesStream)
>>>  at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3834)
>>>  - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter)
>>>  at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3792)
>>>  - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter)
>>>  at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3646)
>>>  at
>>>
>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>>>  at
>>>
>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
> 
> 
> 

-- 
*
Bernd FehlingBielefeld University Library
Dipl.-Inform. (FH)LibTec - Library Technology
Universitätsstr. 25  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
***

Re: Why we need org.apache.lucene.codecs.Codec

2016-08-04 Thread Greg Bowyer
Codecs are loaded with the java service loader interface. That file is
the hook used to tell the service loader that this jar implements Codec.

Lucene internally calls service loader and asks what codecs are there.

On Wed, Aug 3, 2016, at 11:23 PM, aravinth thangasami wrote:
> I don't understand why we need to add custom codec name in this file
> 
> Thanks & Regards
> Aravinth
> 
> On Thu, Aug 4, 2016 at 11:52 AM, aravinth thangasami <
> aravinththangas...@gmail.com> wrote:
> 
> > Hi all,
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Why we need org.apache.lucene.codecs.Codec

2016-08-04 Thread Aravinth T
I understand that, my question is different why we are loading it with SPI,

why we explicitly controlling the loading of Codecs 




 On Thu, 04 Aug 2016 20:39:46 +0530 Greg Bowyer 
wrote  




Codecs are loaded with the java service loader interface. That file is 

the hook used to tell the service loader that this jar implements Codec. 

 

Lucene internally calls service loader and asks what codecs are there. 

 

On Wed, Aug 3, 2016, at 11:23 PM, aravinth thangasami wrote: 

> I don't understand why we need to add custom codec name in this file 

> 

> Thanks & Regards 

> Aravinth 

> 

> On Thu, Aug 4, 2016 at 11:52 AM, aravinth thangasami < 

> aravinththangas...@gmail.com> wrote: 

> 

> > Hi all, 

> > 

> > 

 

- 

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org 

For additional commands, e-mail: java-user-h...@lucene.apache.org 

 








Re: Why we need org.apache.lucene.codecs.Codec

2016-08-04 Thread Greg Bowyer
Not quite sure what you mean, lucene needs some way to load a codec, and
give parts of an index written with different codecs it would need
tonload and select the right code at the right time.

Consider, for example the upgrade path. Let's say you have segments
written with code 5.x and we in place upgrade to 6.x, lucene is going to
need to know how to load up the codec for 5.x and 6.x.

On Thu, Aug 4, 2016, at 09:03 AM, Aravinth T wrote:
> I understand that, my question is different why we are loading it with
> SPI,
> 
> why we explicitly controlling the loading of Codecs 
> 
> 
> 
> 
>  On Thu, 04 Aug 2016 20:39:46 +0530 Greg Bowyer
> wrote  
> 
> 
> 
> 
> Codecs are loaded with the java service loader interface. That file is 
> 
> the hook used to tell the service loader that this jar implements Codec. 
> 
>  
> 
> Lucene internally calls service loader and asks what codecs are there. 
> 
>  
> 
> On Wed, Aug 3, 2016, at 11:23 PM, aravinth thangasami wrote: 
> 
> > I don't understand why we need to add custom codec name in this file 
> 
> > 
> 
> > Thanks & Regards 
> 
> > Aravinth 
> 
> > 
> 
> > On Thu, Aug 4, 2016 at 11:52 AM, aravinth thangasami < 
> 
> > aravinththangas...@gmail.com> wrote: 
> 
> > 
> 
> > > Hi all, 
> 
> > > 
> 
> > > 
> 
>  
> 
> - 
> 
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org 
> 
> For additional commands, e-mail: java-user-h...@lucene.apache.org 
> 
>  
> 
> 
> 
> 
> 
> 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Why we need org.apache.lucene.codecs.Codec

2016-08-04 Thread Uwe Schindler
Hi,

The Codec class is the abstract base class for all index codecs. The 
implementation is loaded via SPI from classpath. To understand how this works 
read API doc's of Java ServiceLoader which describes the process.

Uwe

Am 4. August 2016 17:09:46 MESZ, schrieb Greg Bowyer :
>Codecs are loaded with the java service loader interface. That file is
>the hook used to tell the service loader that this jar implements
>Codec.
>
>Lucene internally calls service loader and asks what codecs are there.
>
>On Wed, Aug 3, 2016, at 11:23 PM, aravinth thangasami wrote:
>> I don't understand why we need to add custom codec name in this file
>> 
>> Thanks & Regards
>> Aravinth
>> 
>> On Thu, Aug 4, 2016 at 11:52 AM, aravinth thangasami <
>> aravinththangas...@gmail.com> wrote:
>> 
>> > Hi all,
>> >
>> >
>
>-
>To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>For additional commands, e-mail: java-user-h...@lucene.apache.org

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Dubious error message?

2016-08-04 Thread Trejkaz
Trying to add a document, someone saw:

java.lang.IllegalArgumentException: Document contains at least one
immense term in field="bcc-address" (whose UTF8 encoding is longer
than the max length 32766), all of which were skipped.  Please correct
the analyzer to not produce such terms.  The prefix of the first
immense term is: '[00, --omitted--]...', original message: bytes can
be at most 32766 in length; got 115597

Question 1: It says the bytes are being skipped, but to me "skipped"
means it's just going to continue, yet I get this exception. Is that
intentional?

Question 2: Can we turn this check off?

Question 2.1: Why limit in the first place? Every time I have ever
seen someone introduce a limit, it has only been a matter of time
until someone hits it, no matter how improbable it seemed when it was
put in.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Dubious error message?

2016-08-04 Thread Erick Erickson
Question 2: Not that I know of

Question 2.1. It's actually pretty difficult to understand why a single _term_
can be over 32K and still make sense. This is not to say that a
single _text_ field can't be over 32K, each term within that field
is (usually) much less than that.

Do you have a real-world use-case where you have a 115K term
that can _only_ be matched by searching for exactly that
sequence of 115K characters? Not substrings. Not wildcards. A
"string" type (as opposed to anything based on solr.Textfield).

As far as the error message is concerned, that does seem somewhat opaque.
Care to raise a JIRA on it (and, if you're really ambitious attach a patch)?

Best,
Erick

On Thu, Aug 4, 2016 at 8:20 PM, Trejkaz  wrote:
> Trying to add a document, someone saw:
>
> java.lang.IllegalArgumentException: Document contains at least one
> immense term in field="bcc-address" (whose UTF8 encoding is longer
> than the max length 32766), all of which were skipped.  Please correct
> the analyzer to not produce such terms.  The prefix of the first
> immense term is: '[00, --omitted--]...', original message: bytes can
> be at most 32766 in length; got 115597
>
> Question 1: It says the bytes are being skipped, but to me "skipped"
> means it's just going to continue, yet I get this exception. Is that
> intentional?
>
> Question 2: Can we turn this check off?
>
> Question 2.1: Why limit in the first place? Every time I have ever
> seen someone introduce a limit, it has only been a matter of time
> until someone hits it, no matter how improbable it seemed when it was
> put in.
>
> TX
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Dubious error message?

2016-08-04 Thread Trejkaz
On Fri, Aug 5, 2016 at 2:51 PM, Erick Erickson  wrote:
> Question 2: Not that I know of
>
> Question 2.1. It's actually pretty difficult to understand why a single _term_
> can be over 32K and still make sense. This is not to say that a
> single _text_ field can't be over 32K, each term within that field
> is (usually) much less than that.
>
> Do you have a real-world use-case where you have a 115K term
> that can _only_ be matched by searching for exactly that
> sequence of 115K characters? Not substrings. Not wildcards. A
> "string" type (as opposed to anything based on solr.Textfield).

This particular field is used to store unique addresses, and for
precision reasons we wanted to search for addresses without tokenising
them, as if you tokenised them, b...@example.com could accidentally
match b...@example.com.au, even though they're two different people. It
also makes statistics faster to calculate.

Now, addresses in SMTP email are fairly short, limited to something
like 254 characters, but sometimes you get data that violates the
standard, and we store more than just that one kind of address, and
maybe one of the other sorts can be longer.

In this situation, it isn't clear whether you can truncate the data,
because if you truncate it, now two addresses are considered equal
when they're not the same string. But then again, if the old version
of Lucene was already truncating it, people might be fine with it
being truncated in the new version. But if they didn't know that,
there would definitely be someone who objects.

So I'm not really saying that the term "makes sense" - I'm just saying
we encountered it in real-world data, and an error occurred. Someone
then complained about the error.

> As far as the error message is concerned, that does seem somewhat opaque.
> Care to raise a JIRA on it (and, if you're really ambitious attach a patch)?

I'll see. :)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org