Re: number of hits of pages containing two terms

2009-03-17 Thread Chris Hostetter
: The final "production" computation is one-time, still, I have to recurrently : come back and correct some errors, then retry... this doesn't really seem like a problem ideally suited for Lucene ... this seems like the type of problem sequential batch crunching could solve better... first pas

Re: number of hits of pages containing two terms

2009-03-17 Thread Paul Elschot
You may want to try Filters (starting from TermFilter) for this, especially those based on the default OpenBitSet (see the intersection count method) because of your interest in stop words. 10k OpenBitSets for 39 M docs will probably not fit in memory in one go, but that can be worked around by kee

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Michael McCandless wrote: Is this a one-time computation? If so, couldn't you wait a long time for the machine to simply finish it? The final "production" computation is one-time, still, I have to recurrently come back and correct some errors, then retry... With the simple approach (doing 100

Re: number of hits of pages containing two terms

2009-03-17 Thread Michael McCandless
Is this a one-time computation? If so, couldn't you wait a long time for the machine to simply finish it? With the simple approach (doing 100 million 2-term AND queries), how long do you estimate it'd take? I think you could do this with your own analyzer (as you suggested)... it would run norm

Re: number of hits of pages containing two terms

2009-03-17 Thread Ian Lea
OK - thanks for the explanation. So this is not just a simple search ... I'll go away and leave you and Michael and the other experts to talk about clever solutions. -- Ian. On Tue, Mar 17, 2009 at 11:35 AM, Adrian Dimulescu wrote: > Ian Lea wrote: >> >> Adrian - have you looked any further

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Ian Lea wrote: Adrian - have you looked any further into why your original two term query was too slow? My experience is that simple queries are usually extremely fast. Let me first point out that it is not "too slow" in absolute terms, it is only for my particular needs of attempting the num

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Michael McCandless wrote: I don't understand how this would address the "docFreq does not reflect deletions". Bad mail-quoting, sorry. I am not interested by document deletion, I just index Wikipedia once, and want to get a co-occurrence-based similarity distance between words called NGD (norm

Re: number of hits of pages containing two terms

2009-03-17 Thread Ian Lea
This is all getting very complicated! Adrian - have you looked any further into why your original two term query was too slow? My experience is that simple queries are usually extremely fast. Standard questions: have you warmed up the searcher? How large is the index? How many occurrences of yo

Re: number of hits of pages containing two terms

2009-03-17 Thread Michael McCandless
Adrian Dimulescu wrote: Thank you. I suppose the solution for this is to not create an index but to store co-occurence frequencies at Analyzer level. I don't understand how this would address the "docFreq does not reflect deletions". You can use the shingles analyzer (under contrib/analyzer

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Thank you. I suppose the solution for this is to not create an index but to store co-occurence frequencies at Analyzer level. Adrian. On Mon, Mar 16, 2009 at 11:37 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > > Be careful: docFreq does not take deletions into account. >

Re: number of hits of pages containing two terms

2009-03-16 Thread Michael McCandless
Adrian Dimulescu wrote: Hello, I need the number of pages that contain two terms. Only the number of hits, I don't care about retrieving the pages. Right now I am using the following code in order to get it: Term first, second; TermQuery q1 = new TermQuery(first); TermQuery q2 = new Te