Hi there,
we recently updated our application from lucene 3.0 to 3.6 with the
effect
that (albeit using the SearchManager functionality as described on
http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html)
calls to searcherManager.maybeRefresh() were incredibly slow. e.g.
taking
about 30 seconds after adding one document to the index with an
index of
about 9000 documents. I assumed that we did something wrong with the
configuration as 30 seconds could not be meant with NRT ;-)
Thus we migrated to the latest 4.6 version and indexing speed was
indeed
very good now (with the searcherManager.maybeRefreshBlocking() call
only
taking milliseconds to complete). But after some wore testing we
discovered
that somehow the indexWriter.updateDocument( term, documentToIndex )
functionality wasn't working anymore as expected - at least
somtetimes. It
looks like either the updateDocument method does not longer reliably
delete
the old document before adding a new one - with the result that
older
documents are beeing returned by searches breaking our application.
Unfortunately I'm not able to reproduce the issues in a simple unit
test but
maybe somebody of the lucene experts knows what we are doing wrong
here. Not
sure if it is of any relevance but we are running on Windows with a
64 bit
JDK 7 thus MMapDirectory is beeing used.
Our Index Writer is configured like this:
IndexWriterConfig conf = new IndexWriterConfig(
Version.LUCENE_46,
new LimitTokenCountAnalyzer( new DefaultAnalyzer(),
Integer.MAX_VALUE ) );
conf.setOpenMode( OpenMode.APPEND );
IndexWriter indexWriter = new IndexWriter( FSDirectory.open(
new
File( directoryPath )), conf );
SearcherManager is configured like this:
searcherManager = new SearcherManager(indexWriter, true,
null);
// The anlyzer that we are using looks like this:
public class DefaultAnalyzer extends Analyzer
{
@Override
protected TokenStreamComponents createComponents(final
String
fieldName,
final Reader reader) {
return new TokenStreamComponents(new
WhitespaceTokenizer(LuceneSearchService.LUCENE_VERSION, reader));
}
}
The update of the index looks like this:
// instead of 42 the unique business identifier is used
Long myUniqueBusinessId = 42l;
BytesRef ref = new BytesRef(NumericUtils.BUF_SIZE_LONG);
NumericUtils.longToPrefixCoded(
myUniqueBusinessId.longValue(), 0,
ref );
Term term = new Term( "MY_UNIQUE_BUSINESS_ID", ref );
// this method may be called multiple times with the same
term and
luceneDocumentToIndex parameter
indexWriter.updateDocument( term, luceneDocumentToIndex);
// After performing a couple of updates we execute
searcherManager.maybeRefreshBlocking();
// For searching we are using the following code
searcher = searcherManager.acquire();
// luceneQuery is the query, filter is some sort of
filtering that
we apply, luceneSort is some sorting query
TopDocs topDocs = searcher.search( luceneQuery, filter,
1000,
luceneSort );
// If we perform a query for MY_UNIQUE_BUSINESS_ID it will return
multiple
results instead of just one - this was neither the case with lucene
3.0 nor
3.6
In order to fix the issue I tried couple of things but to now avail.
It
still happens (not all the time though) that the lucene returns two
documents when querying for MY_UNIQUE_BUSINESS_ID instead of just
one
- setting setMaxBufferedDeleteTerms to 1 in the config
conf.setMaxBufferedDeleteTerms( 1 );
- explicetly deleting instead of just updating
indexWriter.deleteDocuments( term );
- ensuring that the field MY_UNIQUE_BUSINESS_ID is stored in the
index and
not just analysed
- trying to delete the document via indexWriter.tryDeleteDocument()
- calling indexWriter.maybeMerge() after the update
- calling indexWriter.commit() after the update
Sorry for the lenghty post but I wanted to include as much
information as
possible. Let me know if something is missing...
Thanks for helping in advance ;-)
Kai
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org