Re: BufferedUpdateStreams breaks high performance indexing
After updating to version 5.5.3 it looks good now. Thanks a lot for your help and advise. Best regards Bernd Am 29.07.2016 um 15:04 schrieb Michael McCandless: > The deleted terms accumulate whenever you use updateDocument(Term, Doc), or > when you do deleteDocuments(Term). > > Deleted queries are when you delete by query, but I don't think DIH would > be doing that unless you asked it to ... maybe a Solr user/dev knows better? > > Mike McCandless > > http://blog.mikemccandless.com > > On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling < > bernd.fehl...@uni-bielefeld.de> wrote: > >> Yes, with default of 10 it performs very much better. >> I didn't take into count that DIH uses updateDocument for adding new >> documents but after thinking about the "why" I assume that >> this might be because you don't know if a document already exists in the >> index. >> Conclusion, using DIH and setting segmentsPerTier to a high value is a >> killer. >> >> One question still remains about messages in INFOSTREAM, I have lines >> saying >> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345 >> deleted queries >>bytesUsed=2313024 delGen=2265 packetCount=69 >> totBytesUsed=262526720 >> ... >> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted >> terms (unique count=0) >>97142 deleted queries bytesUsed=3108576]; coalesced deletes= >> >> >> [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)] >> newDelCount=0 >> >> Do you know what these deleted terms and deleted queries are? >> >> Best regards, >> Bernd >> >> >> Am 28.07.2016 um 17:34 schrieb Michael McCandless: >>> Hmm, your merge policy changes are dangerous: that will cause too many >>> segments in the index, which makes it longer to apply deletes. >>> >>> Can you revert that and re-test? >>> >>> I'm not sure why DIH is using updateDocument instead of addDocument ... >>> maybe ask on the solr-user list? >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling < >>> bernd.fehl...@uni-bielefeld.de> wrote: >>> Currently I use concurrent DIH but will write some SolrJ for testing or even as replacement for DIH. Don't know whats behind DIH if only documents are added. Not tried any newer release yet, but after reading LUCENE-6161 I really should. At least a version > 5.1 May be before writing some SolrJ. Yes IndexWriterConfig is changed from default: 8 1024 -1 8 100 512 8 >>> class="org.apache.lucene.index.ConcurrentMergeScheduler"/> ${solr.lock.type:native} ... A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1" Somewhere between 20 and 50 characters in length. Thanks for your help, Bernd Am 28.07.2016 um 15:35 schrieb Michael McCandless: > Hmm not good. > > If you are really only adding documents, you should be using > IndexWriter.addDocument, which won't buffer any deleted terms and that > method call should be a no-op. It also makes flushes more efficient since > all of your indexing buffer goes to the added documents, not buffered > delete terms. Are you using updateDocument? > > Can you reproduce this slowness on a newer release? There have been > performance issues fixed in newer releases in this method, e.g > https://issues.apache.org/jira/browse/LUCENE-6161 > > Have you changed any IndexWriterConfig settings from defaults? > > What are your unique id fields like? How many bytes in length? > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling < > bernd.fehl...@uni-bielefeld.de> wrote: > >> While trying to get higher performance for indexing it turned out that >> BufferedUpdateStreams is breaking indexing performance. >> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...) >> >> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene 4.10.4 >> API states: >> "Determines the amount of RAM that may be used for buffering added >> documents and deletions before they are flushed to the Directory. >> Generally for faster indexing performance it's best to flush by RAM >> usage instead of document count and use as large a RAM buffer as you can." >> >> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1. >> >> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes: >> infos=... >> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes took >> 3411845 msec >> >> About 56 minutes no indexing and only applying deletes. >> What is it deleting? >> >> If the index gets bigg
no concurrent merging?
While increasing the indexing load of version 5.5.3 I see threads where one merging thread is blocking other merging threads. But is this concurrent merging? Bernd "Lucene Merge Thread #6" - Thread t@40280java.lang.Thread.State: BLOCKED at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4008) - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter) owned by "Lucene Merge Thread #8" t@53896 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) Locked ownable synchronizers: - None "Lucene Merge Thread #7" - Thread t@40281java.lang.Thread.State: BLOCKED at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4166) - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter) owned by "Lucene Merge Thread #8" t@53896 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) Locked ownable synchronizers: - None "Lucene Merge Thread #8" - Thread t@53896java.lang.Thread.State: RUNNABLE at java.lang.System.identityHashCode(Native Method) at org.apache.lucene.index.IndexReader.hashCode(IndexReader.java:302) at java.util.WeakHashMap.hash(WeakHashMap.java:298) at java.util.WeakHashMap.put(WeakHashMap.java:449) at java.util.Collections$SetFromMap.add(Collections.java:5461) at java.util.Collections$SynchronizedCollection.add(Collections.java:2035) - locked <4c8b5399> (a java.util.Collections$SynchronizedSet) at org.apache.lucene.index.IndexReader.registerParentReader(IndexReader.java:138) at org.apache.lucene.index.FilterLeafReader.(FilterLeafReader.java:306) at org.apache.lucene.uninverting.UninvertingReader.(UninvertingReader.java:184) at org.apache.solr.update.DeleteByQueryWrapper.wrap(DeleteByQueryWrapper.java:52) at org.apache.solr.update.DeleteByQueryWrapper.createWeight(DeleteByQueryWrapper.java:72) at org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:904) at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:887) at org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:713) at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:246) - locked <9f8d81c> (a org.apache.lucene.index.BufferedUpdatesStream) at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3834) - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter) at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3792) - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3646) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: BufferedUpdateStreams breaks high performance indexing
Wonderful, thanks for bringing closure! Mike McCandless http://blog.mikemccandless.com On Thu, Aug 4, 2016 at 3:14 AM, Bernd Fehling < bernd.fehl...@uni-bielefeld.de> wrote: > After updating to version 5.5.3 it looks good now. > Thanks a lot for your help and advise. > > Best regards > Bernd > > Am 29.07.2016 um 15:04 schrieb Michael McCandless: > > The deleted terms accumulate whenever you use updateDocument(Term, Doc), > or > > when you do deleteDocuments(Term). > > > > Deleted queries are when you delete by query, but I don't think DIH would > > be doing that unless you asked it to ... maybe a Solr user/dev knows > better? > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling < > > bernd.fehl...@uni-bielefeld.de> wrote: > > > >> Yes, with default of 10 it performs very much better. > >> I didn't take into count that DIH uses updateDocument for adding new > >> documents but after thinking about the "why" I assume that > >> this might be because you don't know if a document already exists in the > >> index. > >> Conclusion, using DIH and setting segmentsPerTier to a high value is a > >> killer. > >> > >> One question still remains about messages in INFOSTREAM, I have lines > >> saying > >> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345 > >> deleted queries > >>bytesUsed=2313024 delGen=2265 packetCount=69 > >> totBytesUsed=262526720 > >> ... > >> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted > >> terms (unique count=0) > >>97142 deleted queries bytesUsed=3108576]; coalesced deletes= > >> > >> > [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)] > >> newDelCount=0 > >> > >> Do you know what these deleted terms and deleted queries are? > >> > >> Best regards, > >> Bernd > >> > >> > >> Am 28.07.2016 um 17:34 schrieb Michael McCandless: > >>> Hmm, your merge policy changes are dangerous: that will cause too many > >>> segments in the index, which makes it longer to apply deletes. > >>> > >>> Can you revert that and re-test? > >>> > >>> I'm not sure why DIH is using updateDocument instead of addDocument ... > >>> maybe ask on the solr-user list? > >>> > >>> Mike McCandless > >>> > >>> http://blog.mikemccandless.com > >>> > >>> On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling < > >>> bernd.fehl...@uni-bielefeld.de> wrote: > >>> > Currently I use concurrent DIH but will write some SolrJ for testing > or even as replacement for DIH. > Don't know whats behind DIH if only documents are added. > > Not tried any newer release yet, but after reading LUCENE-6161 I > really > should. > At least a version > 5.1 > May be before writing some SolrJ. > > > Yes IndexWriterConfig is changed from default: > > 8 > 1024 > -1 > > 8 > 100 > 512 > > 8 > class="org.apache.lucene.index.ConcurrentMergeScheduler"/> > ${solr.lock.type:native} > ... > > > A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1" > Somewhere between 20 and 50 characters in length. > > Thanks for your help, > Bernd > > > Am 28.07.2016 um 15:35 schrieb Michael McCandless: > > Hmm not good. > > > > If you are really only adding documents, you should be using > > IndexWriter.addDocument, which won't buffer any deleted terms and > that > > method call should be a no-op. It also makes flushes more efficient > since > > all of your indexing buffer goes to the added documents, not buffered > > delete terms. Are you using updateDocument? > > > > Can you reproduce this slowness on a newer release? There have been > > performance issues fixed in newer releases in this method, e.g > > https://issues.apache.org/jira/browse/LUCENE-6161 > > > > Have you changed any IndexWriterConfig settings from defaults? > > > > What are your unique id fields like? How many bytes in length? > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling < > > bernd.fehl...@uni-bielefeld.de> wrote: > > > >> While trying to get higher performance for indexing it turned out > that > >> BufferedUpdateStreams is breaking indexing performance. > >> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...) > >> > >> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene > 4.10.4 > >> API states: > >> "Determines the amount of RAM that may be used for buffering added > >> documents and deletions before they are flushed to the Directory. > >> Generally for faster indexing performance it's best to flush by RAM > >> usage instead of document count and use as large a R
Re: no concurrent merging?
Lucene's merging is concurrent, but Solr unfortunately uses UninvertingReader on each DBQ ... I'm not sure why. I think you should ask on the solr-user list? Or maybe try to change your deletes to be by Term instead of Query? Mike McCandless http://blog.mikemccandless.com On Thu, Aug 4, 2016 at 7:03 AM, Bernd Fehling < bernd.fehl...@uni-bielefeld.de> wrote: > While increasing the indexing load of version 5.5.3 I see > threads where one merging thread is blocking other merging threads. > But is this concurrent merging? > > Bernd > > "Lucene Merge Thread #6" - Thread t@40280java.lang.Thread.State: > BLOCKED > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4008) > - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter) > owned by "Lucene Merge Thread #8" t@53896 > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) > Locked ownable synchronizers: - None > > "Lucene Merge Thread #7" - Thread t@40281java.lang.Thread.State: > BLOCKED > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4166) > - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter) > owned by "Lucene Merge Thread #8" t@53896 > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) > Locked ownable synchronizers: - None > > "Lucene Merge Thread #8" - Thread t@53896java.lang.Thread.State: > RUNNABLE > at java.lang.System.identityHashCode(Native Method) > at org.apache.lucene.index.IndexReader.hashCode(IndexReader.java:302) > at java.util.WeakHashMap.hash(WeakHashMap.java:298) > at java.util.WeakHashMap.put(WeakHashMap.java:449) > at java.util.Collections$SetFromMap.add(Collections.java:5461) > at java.util.Collections$SynchronizedCollection.add(Collections.java:2035) > - locked <4c8b5399> (a java.util.Collections$SynchronizedSet) > at > org.apache.lucene.index.IndexReader.registerParentReader(IndexReader.java:138) > at > org.apache.lucene.index.FilterLeafReader.(FilterLeafReader.java:306) > at > org.apache.lucene.uninverting.UninvertingReader.(UninvertingReader.java:184) > at > org.apache.solr.update.DeleteByQueryWrapper.wrap(DeleteByQueryWrapper.java:52) > at > org.apache.solr.update.DeleteByQueryWrapper.createWeight(DeleteByQueryWrapper.java:72) > at > org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:904) > at > org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:887) > at > org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:713) > at > org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:246) > - locked <9f8d81c> (a org.apache.lucene.index.BufferedUpdatesStream) > at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3834) > - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter) > at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3792) > - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3646) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: no concurrent merging?
Hello, There is https://issues.apache.org/jira/browse/LUCENE-7049 On Thu, Aug 4, 2016 at 4:35 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > Lucene's merging is concurrent, but Solr unfortunately uses > UninvertingReader on each DBQ ... I'm not sure why. I think you should ask > on the solr-user list? > > Or maybe try to change your deletes to be by Term instead of Query? > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Aug 4, 2016 at 7:03 AM, Bernd Fehling < > bernd.fehl...@uni-bielefeld.de> wrote: > > > While increasing the indexing load of version 5.5.3 I see > > threads where one merging thread is blocking other merging threads. > > But is this concurrent merging? > > > > Bernd > > > > "Lucene Merge Thread #6" - Thread t@40280java.lang.Thread.State: > > BLOCKED > > at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4008) > > - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter) > > owned by "Lucene Merge Thread #8" t@53896 > > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655) > > at > > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) > > at > > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) > > Locked ownable synchronizers: - None > > > > "Lucene Merge Thread #7" - Thread t@40281java.lang.Thread.State: > > BLOCKED > > at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4166) > > - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter) > > owned by "Lucene Merge Thread #8" t@53896 > > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655) > > at > > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) > > at > > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) > > Locked ownable synchronizers: - None > > > > "Lucene Merge Thread #8" - Thread t@53896java.lang.Thread.State: > > RUNNABLE > > at java.lang.System.identityHashCode(Native Method) > > at org.apache.lucene.index.IndexReader.hashCode(IndexReader.java:302) > > at java.util.WeakHashMap.hash(WeakHashMap.java:298) > > at java.util.WeakHashMap.put(WeakHashMap.java:449) > > at java.util.Collections$SetFromMap.add(Collections.java:5461) > > at > java.util.Collections$SynchronizedCollection.add(Collections.java:2035) > > - locked <4c8b5399> (a java.util.Collections$SynchronizedSet) > > at > > > org.apache.lucene.index.IndexReader.registerParentReader(IndexReader.java:138) > > at > > > org.apache.lucene.index.FilterLeafReader.(FilterLeafReader.java:306) > > at > > > org.apache.lucene.uninverting.UninvertingReader.(UninvertingReader.java:184) > > at > > > org.apache.solr.update.DeleteByQueryWrapper.wrap(DeleteByQueryWrapper.java:52) > > at > > > org.apache.solr.update.DeleteByQueryWrapper.createWeight(DeleteByQueryWrapper.java:72) > > at > > > org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:904) > > at > > > org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:887) > > at > > > org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:713) > > at > > > org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:246) > > - locked <9f8d81c> (a org.apache.lucene.index.BufferedUpdatesStream) > > at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3834) > > - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter) > > at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3792) > > - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter) > > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3646) > > at > > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) > > at > > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > -- Sincerely yours Mikhail Khludnev
Re: no concurrent merging?
Yes, excactly, that's it. But is it a Lucene or a Solr problem? Should Solr use a different reader from DBQ or can Lucene do something to solve this because it is reported as a Lucene issue? Regards Bernd Am 04.08.2016 um 16:02 schrieb Mikhail Khludnev: > Hello, > There is https://issues.apache.org/jira/browse/LUCENE-7049 > > > On Thu, Aug 4, 2016 at 4:35 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> Lucene's merging is concurrent, but Solr unfortunately uses >> UninvertingReader on each DBQ ... I'm not sure why. I think you should ask >> on the solr-user list? >> >> Or maybe try to change your deletes to be by Term instead of Query? >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Thu, Aug 4, 2016 at 7:03 AM, Bernd Fehling < >> bernd.fehl...@uni-bielefeld.de> wrote: >> >>> While increasing the indexing load of version 5.5.3 I see >>> threads where one merging thread is blocking other merging threads. >>> But is this concurrent merging? >>> >>> Bernd >>> >>> "Lucene Merge Thread #6" - Thread t@40280java.lang.Thread.State: >>> BLOCKED >>> at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4008) >>> - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter) >>> owned by "Lucene Merge Thread #8" t@53896 >>> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655) >>> at >>> >> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) >>> at >>> >> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) >>> Locked ownable synchronizers: - None >>> >>> "Lucene Merge Thread #7" - Thread t@40281java.lang.Thread.State: >>> BLOCKED >>> at >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4166) >>> - waiting to lock <6d75db> (a org.apache.solr.update.SolrIndexWriter) >>> owned by "Lucene Merge Thread #8" t@53896 >>> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3655) >>> at >>> >> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) >>> at >>> >> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) >>> Locked ownable synchronizers: - None >>> >>> "Lucene Merge Thread #8" - Thread t@53896java.lang.Thread.State: >>> RUNNABLE >>> at java.lang.System.identityHashCode(Native Method) >>> at org.apache.lucene.index.IndexReader.hashCode(IndexReader.java:302) >>> at java.util.WeakHashMap.hash(WeakHashMap.java:298) >>> at java.util.WeakHashMap.put(WeakHashMap.java:449) >>> at java.util.Collections$SetFromMap.add(Collections.java:5461) >>> at >> java.util.Collections$SynchronizedCollection.add(Collections.java:2035) >>> - locked <4c8b5399> (a java.util.Collections$SynchronizedSet) >>> at >>> >> org.apache.lucene.index.IndexReader.registerParentReader(IndexReader.java:138) >>> at >>> >> org.apache.lucene.index.FilterLeafReader.(FilterLeafReader.java:306) >>> at >>> >> org.apache.lucene.uninverting.UninvertingReader.(UninvertingReader.java:184) >>> at >>> >> org.apache.solr.update.DeleteByQueryWrapper.wrap(DeleteByQueryWrapper.java:52) >>> at >>> >> org.apache.solr.update.DeleteByQueryWrapper.createWeight(DeleteByQueryWrapper.java:72) >>> at >>> >> org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:904) >>> at >>> >> org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:887) >>> at >>> >> org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:713) >>> at >>> >> org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:246) >>> - locked <9f8d81c> (a org.apache.lucene.index.BufferedUpdatesStream) >>> at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3834) >>> - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter) >>> at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3792) >>> - locked <6d75db> (a org.apache.solr.update.SolrIndexWriter) >>> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3646) >>> at >>> >> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588) >>> at >>> >> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626) >>> >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> > > > -- * Bernd FehlingBielefeld University Library Dipl.-Inform. (FH)LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net ***
Re: Why we need org.apache.lucene.codecs.Codec
Codecs are loaded with the java service loader interface. That file is the hook used to tell the service loader that this jar implements Codec. Lucene internally calls service loader and asks what codecs are there. On Wed, Aug 3, 2016, at 11:23 PM, aravinth thangasami wrote: > I don't understand why we need to add custom codec name in this file > > Thanks & Regards > Aravinth > > On Thu, Aug 4, 2016 at 11:52 AM, aravinth thangasami < > aravinththangas...@gmail.com> wrote: > > > Hi all, > > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Why we need org.apache.lucene.codecs.Codec
I understand that, my question is different why we are loading it with SPI, why we explicitly controlling the loading of Codecs On Thu, 04 Aug 2016 20:39:46 +0530 Greg Bowyerwrote Codecs are loaded with the java service loader interface. That file is the hook used to tell the service loader that this jar implements Codec. Lucene internally calls service loader and asks what codecs are there. On Wed, Aug 3, 2016, at 11:23 PM, aravinth thangasami wrote: > I don't understand why we need to add custom codec name in this file > > Thanks & Regards > Aravinth > > On Thu, Aug 4, 2016 at 11:52 AM, aravinth thangasami < > aravinththangas...@gmail.com> wrote: > > > Hi all, > > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Why we need org.apache.lucene.codecs.Codec
Not quite sure what you mean, lucene needs some way to load a codec, and give parts of an index written with different codecs it would need tonload and select the right code at the right time. Consider, for example the upgrade path. Let's say you have segments written with code 5.x and we in place upgrade to 6.x, lucene is going to need to know how to load up the codec for 5.x and 6.x. On Thu, Aug 4, 2016, at 09:03 AM, Aravinth T wrote: > I understand that, my question is different why we are loading it with > SPI, > > why we explicitly controlling the loading of Codecs > > > > > On Thu, 04 Aug 2016 20:39:46 +0530 Greg Bowyer >wrote > > > > > Codecs are loaded with the java service loader interface. That file is > > the hook used to tell the service loader that this jar implements Codec. > > > > Lucene internally calls service loader and asks what codecs are there. > > > > On Wed, Aug 3, 2016, at 11:23 PM, aravinth thangasami wrote: > > > I don't understand why we need to add custom codec name in this file > > > > > > Thanks & Regards > > > Aravinth > > > > > > On Thu, Aug 4, 2016 at 11:52 AM, aravinth thangasami < > > > aravinththangas...@gmail.com> wrote: > > > > > > > Hi all, > > > > > > > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Why we need org.apache.lucene.codecs.Codec
Hi, The Codec class is the abstract base class for all index codecs. The implementation is loaded via SPI from classpath. To understand how this works read API doc's of Java ServiceLoader which describes the process. Uwe Am 4. August 2016 17:09:46 MESZ, schrieb Greg Bowyer : >Codecs are loaded with the java service loader interface. That file is >the hook used to tell the service loader that this jar implements >Codec. > >Lucene internally calls service loader and asks what codecs are there. > >On Wed, Aug 3, 2016, at 11:23 PM, aravinth thangasami wrote: >> I don't understand why we need to add custom codec name in this file >> >> Thanks & Regards >> Aravinth >> >> On Thu, Aug 4, 2016 at 11:52 AM, aravinth thangasami < >> aravinththangas...@gmail.com> wrote: >> >> > Hi all, >> > >> > > >- >To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >For additional commands, e-mail: java-user-h...@lucene.apache.org -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Dubious error message?
Trying to add a document, someone saw: java.lang.IllegalArgumentException: Document contains at least one immense term in field="bcc-address" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[00, --omitted--]...', original message: bytes can be at most 32766 in length; got 115597 Question 1: It says the bytes are being skipped, but to me "skipped" means it's just going to continue, yet I get this exception. Is that intentional? Question 2: Can we turn this check off? Question 2.1: Why limit in the first place? Every time I have ever seen someone introduce a limit, it has only been a matter of time until someone hits it, no matter how improbable it seemed when it was put in. TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Dubious error message?
Question 2: Not that I know of Question 2.1. It's actually pretty difficult to understand why a single _term_ can be over 32K and still make sense. This is not to say that a single _text_ field can't be over 32K, each term within that field is (usually) much less than that. Do you have a real-world use-case where you have a 115K term that can _only_ be matched by searching for exactly that sequence of 115K characters? Not substrings. Not wildcards. A "string" type (as opposed to anything based on solr.Textfield). As far as the error message is concerned, that does seem somewhat opaque. Care to raise a JIRA on it (and, if you're really ambitious attach a patch)? Best, Erick On Thu, Aug 4, 2016 at 8:20 PM, Trejkaz wrote: > Trying to add a document, someone saw: > > java.lang.IllegalArgumentException: Document contains at least one > immense term in field="bcc-address" (whose UTF8 encoding is longer > than the max length 32766), all of which were skipped. Please correct > the analyzer to not produce such terms. The prefix of the first > immense term is: '[00, --omitted--]...', original message: bytes can > be at most 32766 in length; got 115597 > > Question 1: It says the bytes are being skipped, but to me "skipped" > means it's just going to continue, yet I get this exception. Is that > intentional? > > Question 2: Can we turn this check off? > > Question 2.1: Why limit in the first place? Every time I have ever > seen someone introduce a limit, it has only been a matter of time > until someone hits it, no matter how improbable it seemed when it was > put in. > > TX > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Dubious error message?
On Fri, Aug 5, 2016 at 2:51 PM, Erick Erickson wrote: > Question 2: Not that I know of > > Question 2.1. It's actually pretty difficult to understand why a single _term_ > can be over 32K and still make sense. This is not to say that a > single _text_ field can't be over 32K, each term within that field > is (usually) much less than that. > > Do you have a real-world use-case where you have a 115K term > that can _only_ be matched by searching for exactly that > sequence of 115K characters? Not substrings. Not wildcards. A > "string" type (as opposed to anything based on solr.Textfield). This particular field is used to store unique addresses, and for precision reasons we wanted to search for addresses without tokenising them, as if you tokenised them, b...@example.com could accidentally match b...@example.com.au, even though they're two different people. It also makes statistics faster to calculate. Now, addresses in SMTP email are fairly short, limited to something like 254 characters, but sometimes you get data that violates the standard, and we store more than just that one kind of address, and maybe one of the other sorts can be longer. In this situation, it isn't clear whether you can truncate the data, because if you truncate it, now two addresses are considered equal when they're not the same string. But then again, if the old version of Lucene was already truncating it, people might be fine with it being truncated in the new version. But if they didn't know that, there would definitely be someone who objects. So I'm not really saying that the term "makes sense" - I'm just saying we encountered it in real-world data, and an error occurred. Someone then complained about the error. > As far as the error message is concerned, that does seem somewhat opaque. > Care to raise a JIRA on it (and, if you're really ambitious attach a patch)? I'll see. :) TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org