Somnath,

In addition to everything said, you could try and use:
- BooleanQuery.setUseScorer14(true), which works for long OR like
  queries as yours,
- a search method that returns a TopDocs instead of a Hits, and with that
- a FieldCache to retrieve the (primary key) values of the documents.

Regards,
Paul Elschot

P.S. Instead of setUseScorer14(true) you might try this patch, which  should be 
just as quick:  http://issues.apache.org/jira/browse/LUCENE-730 .


On Monday 22 January 2007 15:36, mark harwood wrote:
> "MoreLikeThis.java" is in the "contrib" section of SVN and this will help 
> optimise your queries to searching for only the most discriminating terms.
> On a large index, very common terms can really kill performance (reading lots 
> of docIds from disk) and the MoreLikeThis class will help to trim your query 
> terms down to avoid such words.
> 
> This will still take some time to run but at least the job sounds like one 
> that can easily be split to run in parallel on multiple machines (assuming 
> you have some to hand!)
> 
> Cheers
> Mark
> 
> 
> ----- Original Message ----
> From: Somnath Banerjee <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Monday, 22 January, 2007 1:28:20 PM
> Subject: Re: Long Query Performance
> 
> Thanks for the reply. Good guess I think.
> 
> DB (Index) is basically a collection of encyclopedia documents. Queries are
> also a collection of documents but of various domains. My task is to find
> out for each "query document" top 100 matching encyclopedia contents.
> 
> I tried by using only the title of (5-8 words) the query documents instead
> of full text of the document. But that is also taking 0.5-1 sec for each
> query. That's mean it will also take nearly 6 and half days to run
> 0.72Mqueries (and expectedly the precision will suffer).
> 
> Thanks,
> Somnath
> 
> On 1/22/07, Michael D. Curtin <[EMAIL PROTECTED]> wrote:
> >
> > Somnath Banerjee wrote:
> >
> > >            I have created a 8GB index of almost 2 million documents. My
> > > requirement is to run nearly 0.72 million query on this index. Each
> > query
> > > consists of 200 - 400 words. I have created a Boolean Query by ORing
> > these
> > > words. But each query is taking nearly 5 - 10 seconds to execute ( 2.78
> > > GHz,
> > > 1.5 GB RAM). That's mean the entire batch of 0.72M query will take more
> > > than
> > > 70 days to execute. Is it expected or there is a way to improve the
> > > performance? From earlier posts I gathered that complex query is
> > > expected to
> > > take more time (this much???).
> >
> > A back of the envelope calculation:
> >      8GB / 2M docs = 4KB per doc, on avg
> >                    / 5 B per word, on avg = 800 words per doc, on avg
> >
> > So, each query is a quarter to half the size of the average document.  I
> > suspect that just about every query is hitting almost every document in
> > the db, i.e. the queries are not very selective at all.  That's going to
> > be slow, no two ways about it.
> >
> > Could you tell us a bit more about the db and what your application is
> > looking for in it, at a higher level of abstraction?
> >
> > --MDC
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> 
> 
> 
> 
>       
>       
>               
> ___________________________________________________________ 
> New Yahoo! Mail is the ultimate force in competitive emailing. Find out more 
> at the Yahoo! Mail Championships. Plus: play games and win prizes. 
> http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 

Reply via email to