Re: updateDocument (somtimes) no longer deleting documents after Update to 4.6

Kai Grabfelder Tue, 25 Feb 2014 12:51:38 -0800

Hi Uwe,

thank you very much! That indeed was the issue and did the trick!


Best Regards

Kai


--- Original Nachricht ---
Absender: Uwe Schindler
Datum: 24.02.14 20:42
> Hi,
> 
> it looks like your filters are implemented in a wrong way:
> 
> - First, in Lucene 3 and 4, filters are applied by segment. Means, they have 
> to calculate the DocIdSet of matched documents for each index segment 
> separately. On updating, the document is "deleted" (hidden) on the old 
> segment and re-added to a new index segment. This is why you see it two times 
> in the filter.
> - Second, in Lucene 4, Filters now get (Bits acceptDocs) in their getDocIdSet 
> method. This is new, before the deleted documents were applied *after* the 
> filters, now together with the filters. If acceptDocs is non-null, these are 
> "hidden" deleted documents. If you filter does not applies those accept docs 
> correctly to the returned DocIdSet, the deleted document suddenly reappear. 
> In Lucene 4, all deleted documents is just an addition filter applied while 
> searching: A filter that marks still accessible documents and hides all 
> deleted documents. If your new filter does not chain in this additional 
> filter, the deletions are ignored. A quick fix is to use "return 
> BitsFilteredDocIdSet.wrap(yourFilterBitSet, acceptDocs)" instead of "return 
> yourFilterBitSet".
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
>> -----Original Message-----
>> From: nos...@kaigrabfelder.de [mailto:nos...@kaigrabfelder.de]
>> Sent: Monday, February 24, 2014 7:14 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: updateDocument (somtimes) no longer deleting documents
>> after Update to 4.6
>> 
>> Hm it looks like this is somehow caused by the filters we are using for
>> searching.
>> 
>> I took one of the MY_UNIQUE_BUSINESS_ID ids, used in our applications
>> search functionality and debuged the lucene search a little more. If I 
>> specify
>> null for the filters I only get one result (which is correct).
>> If I add the two filters that we usually use in our application I notice 
>> that the
>> filters are triggered twice - for two different segments - and the result is
>> contained in both segments. Looks like the first segment contains all
>> documents in the index with the second segment containing only one - the
>> document that should have been deleted upfront.
>> 
>> This can be reproduced even after restarting the application and even after
>> indexWriter.commit is triggered
>> 
>> Could this be a bug? Or is this the desired behaviour?
>> 
>> Best Regards
>> 
>> Kai
>> 
>> 
>> Am 2014-02-24 13:54, schrieb nos...@kaigrabfelder.de:
>> > I'll see if I can dig a little bit deeper into the 3.6 behavior, for
>> > now I'm trying to get it running on 4.6 (as the index file is also a
>> > lot smaller - on 3.6 it was about 2 GB for about 9000 documents, with
>> > 4.6 it's only about 200 MB).
>> >
>> > And yes the business ID is indexed - otherwhise I wouldn't be able to
>> > find it at all - The problem is not that I can't find it but I find it
>> > twice. And to make matters worse not consistently all the bime but
>> > only sometimes. Somehow it looks like the delete (before the update)
>> > does sometimes work and sometimes not. Do you know any chances why
>> > this could happen? Maybe something related to the MergePoliy (which we
>> > don't set e.g. we are using the default)
>> >
>> > Best Regards
>> >
>> > Kai
>> >
>> >
>> > Am 2014-02-24 12:10, schrieb Michael McCandless:
>> >> The 30 second turnaround time in 3.6.x is absurd; if you turn on
>> >> IndexWriter's infoStream maybe it'd give a clue.  Or, capture a few
>> >> stack traces and post them.
>> >>
>> >> How are you creating the luceneDocumentToIndex?  You must ensure
>> that
>> >> the business ID is in fact indexed as a field in the document,
>> >> otherwise the update won't find it.
>> >>
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Mon, Feb 24, 2014 at 5:33 AM,  <nos...@kaigrabfelder.de> wrote:
>> >>> Hi there,
>> >>>
>> >>> we recently updated our application from lucene 3.0 to 3.6 with the
>> >>> effect that (albeit using the SearchManager functionality as
>> >>> described on
>> >>>
>> >>> http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-
>> simpl
>> >>> ifies.html) calls to searcherManager.maybeRefresh() were incredibly
>> >>> slow. e.g.
>> >>> taking
>> >>> about 30 seconds after adding one document to the index with an
>> >>> index of about 9000 documents. I assumed that we did something wrong
>> >>> with the configuration as 30 seconds could not be meant with NRT ;-)
>> >>>
>> >>> Thus we migrated to the latest 4.6 version and indexing speed was
>> >>> indeed very good now (with the
>> >>> searcherManager.maybeRefreshBlocking() call only taking milliseconds
>> >>> to complete). But after some wore testing we discovered that somehow
>> >>> the indexWriter.updateDocument( term, documentToIndex
>> >>> )
>> >>> functionality wasn't working anymore as expected - at least
>> >>> somtetimes. It looks like either the updateDocument method does not
>> >>> longer reliably delete the old document before adding a new one -
>> >>> with the result that older documents are beeing returned by searches
>> >>> breaking our application.
>> >>>
>> >>> Unfortunately I'm not able to reproduce the issues in a simple unit
>> >>> test but maybe somebody of the lucene experts knows what we are
>> >>> doing wrong here. Not sure if it is of any relevance but we are
>> >>> running on Windows with a
>> >>> 64 bit
>> >>> JDK 7 thus MMapDirectory is beeing used.
>> >>>
>> >>> Our Index Writer is configured like this:
>> >>>
>> >>>         IndexWriterConfig conf = new IndexWriterConfig(
>> >>> Version.LUCENE_46, new LimitTokenCountAnalyzer( new
>> >>> DefaultAnalyzer(), Integer.MAX_VALUE ) );
>> >>>
>> >>>
>> >>>         conf.setOpenMode( OpenMode.APPEND );
>> >>>
>> >>>         IndexWriter indexWriter = new IndexWriter(
>> >>> FSDirectory.open( new
>> >>> File( directoryPath )), conf );
>> >>>
>> >>> SearcherManager is configured like this:
>> >>>
>> >>>         searcherManager = new SearcherManager(indexWriter, true,
>> >>> null);
>> >>>
>> >>> // The anlyzer that we are using looks like this:
>> >>>
>> >>>         public class DefaultAnalyzer extends Analyzer
>> >>>         {
>> >>>            @Override
>> >>>            protected TokenStreamComponents createComponents(final
>> >>> String
>> >>> fieldName,
>> >>>                    final Reader reader) {
>> >>>                  return new TokenStreamComponents(new
>> >>> WhitespaceTokenizer(LuceneSearchService.LUCENE_VERSION, reader));
>> >>>            }
>> >>>         }
>> >>>
>> >>> The update of the index looks like this:
>> >>>
>> >>>         // instead of 42 the unique business identifier is used
>> >>>         Long myUniqueBusinessId = 42l;
>> >>>         BytesRef ref = new BytesRef(NumericUtils.BUF_SIZE_LONG);
>> >>>         NumericUtils.longToPrefixCoded(
>> >>> myUniqueBusinessId.longValue(), 0,
>> >>> ref );
>> >>>         Term term = new Term( "MY_UNIQUE_BUSINESS_ID", ref );
>> >>>
>> >>>         // this method may be called multiple times with the same
>> >>> term and
>> >>> luceneDocumentToIndex parameter
>> >>>         indexWriter.updateDocument( term, luceneDocumentToIndex);
>> >>>
>> >>>         // After performing a couple of updates we execute
>> >>>         searcherManager.maybeRefreshBlocking();
>> >>>
>> >>>
>> >>> // For searching we are using the following code
>> >>>         searcher = searcherManager.acquire();
>> >>>         // luceneQuery is the query, filter is some sort of
>> >>> filtering that
>> >>> we apply, luceneSort is some sorting query
>> >>>         TopDocs topDocs = searcher.search( luceneQuery, filter,
>> >>> 1000,
>> >>> luceneSort );
>> >>>
>> >>> // If we perform a query for MY_UNIQUE_BUSINESS_ID it will return
>> >>> multiple
>> >>> results instead of just one - this was neither the case with lucene
>> >>> 3.0 nor
>> >>> 3.6
>> >>>
>> >>>
>> >>> In order to fix the issue I tried couple of things but to now
>> >>> avail. It
>> >>> still happens (not all the time though) that the lucene returns two
>> >>> documents when querying for MY_UNIQUE_BUSINESS_ID instead of
>> just
>> >>> one
>> >>> -       setting setMaxBufferedDeleteTerms to 1 in the config
>> >>>         conf.setMaxBufferedDeleteTerms( 1 );
>> >>> - explicetly deleting instead of just updating
>> >>>         indexWriter.deleteDocuments( term );
>> >>> - ensuring that the field MY_UNIQUE_BUSINESS_ID is stored in the
>> >>> index and
>> >>> not just analysed
>> >>> - trying to delete the document via indexWriter.tryDeleteDocument()
>> >>> - calling indexWriter.maybeMerge() after the update
>> >>> - calling indexWriter.commit() after the update
>> >>>
>> >>>
>> >>> Sorry for the lenghty post but I wanted to include as much
>> >>> information as
>> >>> possible. Let me know if something is missing...
>> >>>
>> >>> Thanks for helping in advance ;-)
>> >>>
>> >>> Kai
>> >>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: updateDocument (somtimes) no longer deleting documents after Update to 4.6

Reply via email to