Somnath, In addition to everything said, you could try and use: - BooleanQuery.setUseScorer14(true), which works for long OR like queries as yours, - a search method that returns a TopDocs instead of a Hits, and with that - a FieldCache to retrieve the (primary key) values of the documents.
Regards, Paul Elschot P.S. Instead of setUseScorer14(true) you might try this patch, which should be just as quick: http://issues.apache.org/jira/browse/LUCENE-730 . On Monday 22 January 2007 15:36, mark harwood wrote: > "MoreLikeThis.java" is in the "contrib" section of SVN and this will help > optimise your queries to searching for only the most discriminating terms. > On a large index, very common terms can really kill performance (reading lots > of docIds from disk) and the MoreLikeThis class will help to trim your query > terms down to avoid such words. > > This will still take some time to run but at least the job sounds like one > that can easily be split to run in parallel on multiple machines (assuming > you have some to hand!) > > Cheers > Mark > > > ----- Original Message ---- > From: Somnath Banerjee <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Monday, 22 January, 2007 1:28:20 PM > Subject: Re: Long Query Performance > > Thanks for the reply. Good guess I think. > > DB (Index) is basically a collection of encyclopedia documents. Queries are > also a collection of documents but of various domains. My task is to find > out for each "query document" top 100 matching encyclopedia contents. > > I tried by using only the title of (5-8 words) the query documents instead > of full text of the document. But that is also taking 0.5-1 sec for each > query. That's mean it will also take nearly 6 and half days to run > 0.72Mqueries (and expectedly the precision will suffer). > > Thanks, > Somnath > > On 1/22/07, Michael D. Curtin <[EMAIL PROTECTED]> wrote: > > > > Somnath Banerjee wrote: > > > > > I have created a 8GB index of almost 2 million documents. My > > > requirement is to run nearly 0.72 million query on this index. Each > > query > > > consists of 200 - 400 words. I have created a Boolean Query by ORing > > these > > > words. But each query is taking nearly 5 - 10 seconds to execute ( 2.78 > > > GHz, > > > 1.5 GB RAM). That's mean the entire batch of 0.72M query will take more > > > than > > > 70 days to execute. Is it expected or there is a way to improve the > > > performance? From earlier posts I gathered that complex query is > > > expected to > > > take more time (this much???). > > > > A back of the envelope calculation: > > 8GB / 2M docs = 4KB per doc, on avg > > / 5 B per word, on avg = 800 words per doc, on avg > > > > So, each query is a quarter to half the size of the average document. I > > suspect that just about every query is hitting almost every document in > > the db, i.e. the queries are not very selective at all. That's going to > > be slow, no two ways about it. > > > > Could you tell us a bit more about the db and what your application is > > looking for in it, at a higher level of abstraction? > > > > --MDC > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > ___________________________________________________________ > New Yahoo! Mail is the ultimate force in competitive emailing. Find out more > at the Yahoo! Mail Championships. Plus: play games and win prizes. > http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > >