PhraseQuery Performance Issues [Lucene 2.9.0]

Daniel Shane Fri, 19 Mar 2010 13:57:21 -0700

I'm running a medium size web search with a index size just shy of 9GB with 
800000 docs in it.


We are suing Lucene version 2.9.0 (we have not checked yet to see if this 
applies to older versions as well).

By looking at my logs, I'm finding that phrase queries are especially long to 
perform. In our index, we do not remove stopwords, so things like "the" and 
"is" are getting indexed on purpose.

If I try a phrase search like "The The" it will take about 10 seconds in Luke 
to get some results back, and a bit less afterwards (7sec). 

More complete phrases that match maybe only 1 document can also take >10 secs 
if they have many stopwords in them.

I was wondering if this a normal behavior considering the fact that we do not 
remove stopwords? 

Also, on some phrase queries (not all), the difference between the first call 
and any subsequent calls can be very big. For example, it could take 5 seconds 
to do one query and then less than 1 second to perform it again. 

Does Lucene, by default, cache anything when a (phrase) query is made or is 
this simply file system caching at work?

If this is a normal behavior, I assume that the solution is either to remove 
stopwords from the index or shard it and ParallelMultiSearch it.

What do you think?

Daniel Shane




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

PhraseQuery Performance Issues [Lucene 2.9.0]

Reply via email to