Indeed! I found a very good article on this as well at :
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1 It really sums up what you are saying. Thanks for the help! Daniel Shane ----- Original Message ----- From: "Michael McCandless" <luc...@mikemccandless.com> To: java-user@lucene.apache.org Sent: Friday, March 19, 2010 7:14:06 PM Subject: Re: PhraseQuery Performance Issues [Lucene 2.9.0] Nutch/Solr's CommonGrams is the right way to solve this. It combines frequent terms (eg stopwords) with adjacent terms. So "the wizard of oz" will be indexed eg as the_wizard wizard_of of_oz. It'll require a full re-index though, and you have to fixup searching so that the same term expansion works. Lucene does no caching that'd speed up PhraseQuery (unless that was the very first query against the field since you opened the reader) -- this is probably the OS's IO cache. Mike On Fri, Mar 19, 2010 at 3:56 PM, Daniel Shane <sha...@lexum.umontreal.ca> wrote: > I'm running a medium size web search with a index size just shy of 9GB with > 800000 docs in it. > > We are suing Lucene version 2.9.0 (we have not checked yet to see if this > applies to older versions as well). > > By looking at my logs, I'm finding that phrase queries are especially long to > perform. In our index, we do not remove stopwords, so things like "the" and > "is" are getting indexed on purpose. > > If I try a phrase search like "The The" it will take about 10 seconds in Luke > to get some results back, and a bit less afterwards (7sec). > > More complete phrases that match maybe only 1 document can also take >10 secs > if they have many stopwords in them. > > I was wondering if this a normal behavior considering the fact that we do not > remove stopwords? > > Also, on some phrase queries (not all), the difference between the first call > and any subsequent calls can be very big. For example, it could take 5 > seconds to do one query and then less than 1 second to perform it again. > > Does Lucene, by default, cache anything when a (phrase) query is made or is > this simply file system caching at work? > > If this is a normal behavior, I assume that the solution is either to remove > stopwords from the index or shard it and ParallelMultiSearch it. > > What do you think? > > Daniel Shane > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org