Hi Uwe, thank you very much! That indeed was the issue and did the trick!
Best Regards Kai --- Original Nachricht --- Absender: Uwe Schindler Datum: 24.02.14 20:42 > Hi, > > it looks like your filters are implemented in a wrong way: > > - First, in Lucene 3 and 4, filters are applied by segment. Means, they have > to calculate the DocIdSet of matched documents for each index segment > separately. On updating, the document is "deleted" (hidden) on the old > segment and re-added to a new index segment. This is why you see it two times > in the filter. > - Second, in Lucene 4, Filters now get (Bits acceptDocs) in their getDocIdSet > method. This is new, before the deleted documents were applied *after* the > filters, now together with the filters. If acceptDocs is non-null, these are > "hidden" deleted documents. If you filter does not applies those accept docs > correctly to the returned DocIdSet, the deleted document suddenly reappear. > In Lucene 4, all deleted documents is just an addition filter applied while > searching: A filter that marks still accessible documents and hides all > deleted documents. If your new filter does not chain in this additional > filter, the deletions are ignored. A quick fix is to use "return > BitsFilteredDocIdSet.wrap(yourFilterBitSet, acceptDocs)" instead of "return > yourFilterBitSet". > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -----Original Message----- >> From: nos...@kaigrabfelder.de [mailto:nos...@kaigrabfelder.de] >> Sent: Monday, February 24, 2014 7:14 PM >> To: java-user@lucene.apache.org >> Subject: Re: updateDocument (somtimes) no longer deleting documents >> after Update to 4.6 >> >> Hm it looks like this is somehow caused by the filters we are using for >> searching. >> >> I took one of the MY_UNIQUE_BUSINESS_ID ids, used in our applications >> search functionality and debuged the lucene search a little more. If I >> specify >> null for the filters I only get one result (which is correct). >> If I add the two filters that we usually use in our application I notice >> that the >> filters are triggered twice - for two different segments - and the result is >> contained in both segments. Looks like the first segment contains all >> documents in the index with the second segment containing only one - the >> document that should have been deleted upfront. >> >> This can be reproduced even after restarting the application and even after >> indexWriter.commit is triggered >> >> Could this be a bug? Or is this the desired behaviour? >> >> Best Regards >> >> Kai >> >> >> Am 2014-02-24 13:54, schrieb nos...@kaigrabfelder.de: >> > I'll see if I can dig a little bit deeper into the 3.6 behavior, for >> > now I'm trying to get it running on 4.6 (as the index file is also a >> > lot smaller - on 3.6 it was about 2 GB for about 9000 documents, with >> > 4.6 it's only about 200 MB). >> > >> > And yes the business ID is indexed - otherwhise I wouldn't be able to >> > find it at all - The problem is not that I can't find it but I find it >> > twice. And to make matters worse not consistently all the bime but >> > only sometimes. Somehow it looks like the delete (before the update) >> > does sometimes work and sometimes not. Do you know any chances why >> > this could happen? Maybe something related to the MergePoliy (which we >> > don't set e.g. we are using the default) >> > >> > Best Regards >> > >> > Kai >> > >> > >> > Am 2014-02-24 12:10, schrieb Michael McCandless: >> >> The 30 second turnaround time in 3.6.x is absurd; if you turn on >> >> IndexWriter's infoStream maybe it'd give a clue. Or, capture a few >> >> stack traces and post them. >> >> >> >> How are you creating the luceneDocumentToIndex? You must ensure >> that >> >> the business ID is in fact indexed as a field in the document, >> >> otherwise the update won't find it. >> >> >> >> >> >> Mike McCandless >> >> >> >> http://blog.mikemccandless.com >> >> >> >> >> >> On Mon, Feb 24, 2014 at 5:33 AM, <nos...@kaigrabfelder.de> wrote: >> >>> Hi there, >> >>> >> >>> we recently updated our application from lucene 3.0 to 3.6 with the >> >>> effect that (albeit using the SearchManager functionality as >> >>> described on >> >>> >> >>> http://blog.mikemccandless.com/2011/09/lucenes-searchermanager- >> simpl >> >>> ifies.html) calls to searcherManager.maybeRefresh() were incredibly >> >>> slow. e.g. >> >>> taking >> >>> about 30 seconds after adding one document to the index with an >> >>> index of about 9000 documents. I assumed that we did something wrong >> >>> with the configuration as 30 seconds could not be meant with NRT ;-) >> >>> >> >>> Thus we migrated to the latest 4.6 version and indexing speed was >> >>> indeed very good now (with the >> >>> searcherManager.maybeRefreshBlocking() call only taking milliseconds >> >>> to complete). But after some wore testing we discovered that somehow >> >>> the indexWriter.updateDocument( term, documentToIndex >> >>> ) >> >>> functionality wasn't working anymore as expected - at least >> >>> somtetimes. It looks like either the updateDocument method does not >> >>> longer reliably delete the old document before adding a new one - >> >>> with the result that older documents are beeing returned by searches >> >>> breaking our application. >> >>> >> >>> Unfortunately I'm not able to reproduce the issues in a simple unit >> >>> test but maybe somebody of the lucene experts knows what we are >> >>> doing wrong here. Not sure if it is of any relevance but we are >> >>> running on Windows with a >> >>> 64 bit >> >>> JDK 7 thus MMapDirectory is beeing used. >> >>> >> >>> Our Index Writer is configured like this: >> >>> >> >>> IndexWriterConfig conf = new IndexWriterConfig( >> >>> Version.LUCENE_46, new LimitTokenCountAnalyzer( new >> >>> DefaultAnalyzer(), Integer.MAX_VALUE ) ); >> >>> >> >>> >> >>> conf.setOpenMode( OpenMode.APPEND ); >> >>> >> >>> IndexWriter indexWriter = new IndexWriter( >> >>> FSDirectory.open( new >> >>> File( directoryPath )), conf ); >> >>> >> >>> SearcherManager is configured like this: >> >>> >> >>> searcherManager = new SearcherManager(indexWriter, true, >> >>> null); >> >>> >> >>> // The anlyzer that we are using looks like this: >> >>> >> >>> public class DefaultAnalyzer extends Analyzer >> >>> { >> >>> @Override >> >>> protected TokenStreamComponents createComponents(final >> >>> String >> >>> fieldName, >> >>> final Reader reader) { >> >>> return new TokenStreamComponents(new >> >>> WhitespaceTokenizer(LuceneSearchService.LUCENE_VERSION, reader)); >> >>> } >> >>> } >> >>> >> >>> The update of the index looks like this: >> >>> >> >>> // instead of 42 the unique business identifier is used >> >>> Long myUniqueBusinessId = 42l; >> >>> BytesRef ref = new BytesRef(NumericUtils.BUF_SIZE_LONG); >> >>> NumericUtils.longToPrefixCoded( >> >>> myUniqueBusinessId.longValue(), 0, >> >>> ref ); >> >>> Term term = new Term( "MY_UNIQUE_BUSINESS_ID", ref ); >> >>> >> >>> // this method may be called multiple times with the same >> >>> term and >> >>> luceneDocumentToIndex parameter >> >>> indexWriter.updateDocument( term, luceneDocumentToIndex); >> >>> >> >>> // After performing a couple of updates we execute >> >>> searcherManager.maybeRefreshBlocking(); >> >>> >> >>> >> >>> // For searching we are using the following code >> >>> searcher = searcherManager.acquire(); >> >>> // luceneQuery is the query, filter is some sort of >> >>> filtering that >> >>> we apply, luceneSort is some sorting query >> >>> TopDocs topDocs = searcher.search( luceneQuery, filter, >> >>> 1000, >> >>> luceneSort ); >> >>> >> >>> // If we perform a query for MY_UNIQUE_BUSINESS_ID it will return >> >>> multiple >> >>> results instead of just one - this was neither the case with lucene >> >>> 3.0 nor >> >>> 3.6 >> >>> >> >>> >> >>> In order to fix the issue I tried couple of things but to now >> >>> avail. It >> >>> still happens (not all the time though) that the lucene returns two >> >>> documents when querying for MY_UNIQUE_BUSINESS_ID instead of >> just >> >>> one >> >>> - setting setMaxBufferedDeleteTerms to 1 in the config >> >>> conf.setMaxBufferedDeleteTerms( 1 ); >> >>> - explicetly deleting instead of just updating >> >>> indexWriter.deleteDocuments( term ); >> >>> - ensuring that the field MY_UNIQUE_BUSINESS_ID is stored in the >> >>> index and >> >>> not just analysed >> >>> - trying to delete the document via indexWriter.tryDeleteDocument() >> >>> - calling indexWriter.maybeMerge() after the update >> >>> - calling indexWriter.commit() after the update >> >>> >> >>> >> >>> Sorry for the lenghty post but I wanted to include as much >> >>> information as >> >>> possible. Let me know if something is missing... >> >>> >> >>> Thanks for helping in advance ;-) >> >>> >> >>> Kai >> >>> >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org