Nigel, Based on the description, I'd suspect unnecessarily(?) large JVM heap and insufficient RAM for caching the actual index. Run vmstat while querying the index and watch columns: bi, bo, si, so, wa, and id. :) If what I said above is correct, then you should see more data loaded from disk during those clow queries and probably a jump in the wa column if you are running multiple concurrent queries.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Nigel <nigelspl...@gmail.com> > To: java-user@lucene.apache.org > Sent: Tuesday, June 23, 2009 4:53:09 PM > Subject: Analyzing performance and memory consumption for boolean queries > > Our query performance is surprisingly inconsistent, and I'm trying to figure > out why. I've realized that I need to better understand what's going on > internally in Lucene when we're searching. I'd be grateful for any answers > (including pointers to existing docs, if any). > > Our situation is this: We have roughly 250 million docs spread across four > indexes. Each doc has about a dozen fields, all stored and most indexed. > (They're the usual document things like author, date, title, contents, > etc.) Queries differ in complexity but always have at least a few terms in > boolean combination, up to some larger queries with dozens or even hundreds > of terms combined with ands, ors, nots, and parens. There's no sorting, > even by relevance: we just want to know what matches. Query performance is > often sub-second, but not infrequently it can take over 20 seconds (we the > time-limited hit collector, so anything over 20 seconds is stopped). > Obviously the more complex queries are slower on average, but a given query > can sometimes be much slower or much faster. > > My assumption is that we're having memory problems or disk utilization > problems or both. Our app has a 5gb JVM heap on an 8gb server with no other > user processes running, so we shouldn't be paging and should have some room > for Linux disk cache. The server is lightly loaded and concurrent queries > are the exception rather than the norm. Two of the four indexes are updated > a few times a day via rsync and subsequently closed and re-opened, but poor > query performance doesn't seem to be correlated with these times. > > So, getting to some specific questions: > > 1) How is the inverted index for a given field structured in terms of what's > in memory and what's on disk? Is it dynamic, based on available memory, or > tuneable, or fixed? Is there a rule of thumb that could be used to estimate > how much memory is required per indexed field, based on the number of terms > and documents? Likewise, is there a rule of thumb to estimate how many disk > accesses are required to retrieve the hits for that field? (I'm thinking, > by perhaps false analogy, of how a database maintains a b-tree structure > that may reside partially in RAM cache and partially in disk pages.) > > 2) When boolean queries are searched, is it as simple as iterating the hits > for each ANDed or ORed term and applying the appropriate logical operators > to the results? For example, is searching for "foo AND bar" pretty much the > same resource-wise as doing two separate searches, and therefore should the > query performance be a linear function of the number the number of search > terms? Or is there some other caching and/or decision logic (perhaps kind > of like a database's query optimizer) at work here that makes the I/O and > RAM requirements more difficult to model from the query? (Remember that > we're not doing any sorting.) > > I'm hoping that with some of this knowledge, I'll be able to better model > the RAM and I/O usage of the indexes and queries, and thus eventually > understand why things are slow or fast. > > Thanks, > Chris --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org