Do deleted documents affect scores?

2010-02-10 Thread Yuval Feinstein
I want to focus my previous question. Say we have two Lucene indexes: A and B. Index A contains documents a and b. Index B used to contain documents a, b and c, But c was deleted. All documents share some vocabulary. If we search using terms common to documents b and c, Can we get a different score

Re: Flex & Docs/AndPositionsEnum

2010-02-10 Thread Marvin Humphrey
On Wed, Feb 10, 2010 at 12:33:27PM -0500, Michael McCandless wrote: > In Lucene, skipping is done through the aggregator, I had a look at MultiDocsEnum in the flex blanch. It doesn't know when sub-enum is reading skip data. > > I suppose another possibility would have been to have the aggregato

Re: Flex & Docs/AndPositionsEnum

2010-02-10 Thread Michael McCandless
On Wed, Feb 10, 2010 at 8:27 AM, Marvin Humphrey wrote: >> But why didn't you have the Multi*Enums layer add the offset (so >> that the codec need not know who's consuming it)? Performance? > > That would have involved something like this within the aggregator: > >posting.setDocID(posting.ge

Re: Flex & Docs/AndPositionsEnum

2010-02-10 Thread Michael McCandless
On Wed, Feb 10, 2010 at 9:47 AM, Renaud Delbru wrote: > On 10/02/10 13:15, Uwe Schindler wrote: >>> >>> Could you provide pointers to search code that uses the segment-level >>> enum ? >>> As I explained in my last answer to Michael, the TermScorer is using >>> the >>> DocsEnum interface, and ther

Re: problem:lucene did not delete old index file after optimize method called

2010-02-10 Thread Michael McCandless
OK I opened this issue: https://issues.apache.org/jira/browse/LUCENE-2259 And put a patch on. If you can try the patch, that'd be great :) You should be able to apply the patch, build a new jar, then run your test again unmodified, and 0.cfs, 1.cfs should then be removed. Mike 2010/2/10 Mic

Re: problem:lucene did not delete old index file after optimize method called

2010-02-10 Thread Michael McCandless
>From this test, I would expect all 3 files to be left, because IndexWriter never gets another chance to remove the files. IndexWriter only attempts to remove unreferenced files in roughly 3 places: * On open * On flushing a new segment * On finishing a merge So, the moment your optimize f

RE: problem:lucene did not delete old index file after optimize method called

2010-02-10 Thread luocanrao
here a small test case I watched there are three compound files. 0.cfs6786kb 1.cfs2044kb 2.cfs8790kb(the optimize file) I think in this testcase only 2.cfs left(the optimize file left),Is that right?? import java.io.File; import java.io.IOException; import org.apache.lucene.analysis.

RE: here a small test case problem:lucene did not delete old index file after optimize method called

2010-02-10 Thread luocan19826164
I watched there are three compound files. 0.cfs  6786kb 1.cfs  2044kb 2.cfs  8790kb(the optimize file) I think in this testcase only 2.cfs left(the optimize file left),Is that right??   import java.io.File; import java.io.IOException;   import org.apache.lucene.analysis.standard.StandardAnalyzer;

Re: Flex & Docs/AndPositionsEnum

2010-02-10 Thread Renaud Delbru
On 10/02/10 13:15, Uwe Schindler wrote: Could you provide pointers to search code that uses the segment-level enum ? As I explained in my last answer to Michael, the TermScorer is using the DocsEnum interface, and therefore do not know if it manipulates segment-level enum or a Multi*Enums. What s

Re: read more tokens during analysis

2010-02-10 Thread Grant Ingersoll
On Feb 10, 2010, at 8:33 AM, Rohit Banga wrote: > basically i want to use my own filter wrapping around a standard analyzer. > > the kind explained on page 166 of Lucene in Action, uses input.next() which > is perhaps not available in lucene 3.0 > > what is the substitute method. captureState(

Re: TREC Data and Topic-Specific Index

2010-02-10 Thread Robert Muir
Hi, so you mean around 15% and 24% respectively? i think you could fairly say either of these is an improvement over your baseline of 0.141 what i mean by large difference, is while I think its safe to say that using either of these methods improves over your baseline, i am not sure you can conclu

Re: TREC Data and Topic-Specific Index

2010-02-10 Thread Ivan Provalov
Robert, Thank you for your reply. What would be considered a large difference? We started applying the Sweet Spot Similarity. It gives us an improvement of 0.163-0.141=0.022 MAP so far. LnbLtcSimilarity gets us more improvement: 0.175-0.141=0.034. Thanks, Ivan --- On Sun, 2/7/10, Robert

Re: read more tokens during analysis

2010-02-10 Thread Rohit Banga
basically i want to use my own filter wrapping around a standard analyzer. the kind explained on page 166 of Lucene in Action, uses input.next() which is perhaps not available in lucene 3.0 what is the substitute method. Rohit Banga On Wed, Feb 10, 2010 at 6:46 PM, Rohit Banga wrote: > i want

Re: Flex & Docs/AndPositionsEnum

2010-02-10 Thread Marvin Humphrey
On Wed, Feb 10, 2010 at 06:58:01AM -0500, Michael McCandless wrote: > But why didn't you have the Multi*Enums layer add the offset (so that > the codec need not know who's consuming it)? Performance? That would have involved something like this within the aggregator: posting.setDocID(pos

read more tokens during analysis

2010-02-10 Thread Rohit Banga
i want to consider the current word & the next as a single term. when analyzing "Arun Kumar" i want my analyzer to consider "Arun", "Arun Kumar" as synonyms. in the tokenstream method, how do we read the next token "Kumar" i am going through the setPositionIncrements method for considering them

RE: Flex & Docs/AndPositionsEnum

2010-02-10 Thread Uwe Schindler
> Could you provide pointers to search code that uses the segment-level > enum ? > As I explained in my last answer to Michael, the TermScorer is using > the > DocsEnum interface, and therefore do not know if it manipulates > segment-level enum or a Multi*Enums. What search (or query operators) > i

Re: Flex & Docs/AndPositionsEnum

2010-02-10 Thread Renaud Delbru
On 10/02/10 09:47, Uwe Schindler wrote: Positions as attributes would be good. For positions we need a new Attribute (not PositionIncrement), but e.g. for offsets and payloads we can use the standard attributes from the analysis, which is really cool. This would also make it possible to add al

Re: Flex & Docs/AndPositionsEnum

2010-02-10 Thread Renaud Delbru
Hi Michael, On 09/02/10 20:47, Michael McCandless wrote: But, then, it's very convenient when you need it and don't care about performance. EG in Renaud's usage, a test case that is trying to assert that all indexed docs look right, why should you be forced to operate per segment? He shouldn't

Re: Contrib Lucene Analyzers & Stemming

2010-02-10 Thread Robert Muir
hi, what does your test code look like? The Russian stemmer still stems as of 3.0: assertAnalyzesToReuse(a, "Но знание это хранилось в тайне", new String[] { "знан", "хран", "тайн" }); On Wed, Feb 10, 2010 at 4:16 AM, Jamie wrote: > Hi There > > We are having problems with some of the

Re: Problems with IndexWriter#commit() on Linux

2010-02-10 Thread Michael McCandless
Yes. Mike On Wed, Feb 10, 2010 at 6:36 AM, Naama Kraus wrote: > Do you mean by calling > > IndexWriter#*setInfoStream*(PrintStream > > infoStream) > > ? > > Naama > > > On Mon, Feb 8, 2010 at 3:22 PM, Michael McCandless < > luc...@

Re: Flex & Docs/AndPositionsEnum

2010-02-10 Thread Michael McCandless
On Tue, Feb 9, 2010 at 4:44 PM, Marvin Humphrey wrote: >> Interesting... and segment merging just does its own private >> concatenation/mapping-around-deletes of the doc/positions? > > I think the answer is yes, but I'm not sure I understand the > question completely since I'm not sure why you'd

Re: Problems with IndexWriter#commit() on Linux

2010-02-10 Thread Naama Kraus
Do you mean by calling IndexWriter#*setInfoStream*(PrintStream infoStream) ? Naama On Mon, Feb 8, 2010 at 3:22 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > Hmmm... I think that means you're using the default data

Re: problem:lucene did not delete old index file after optimize method called

2010-02-10 Thread Michael McCandless
My guess is there is accidentally still a reader open, at the time that IW tries to delete these unreferenced files. Eg if you close & reopen your reader, always, then there is always a reader open on the index. Try closing all readers, then close IW, then open & close a new IW, and see if the fi

Re: problem:lucene did not delete old index file after optimize method called

2010-02-10 Thread luocan19826164
thanks for your reply! but I don't think there is an IndexReader still reading those files,because I call indexReader close and reopen every 1 minute .   IW also deletes unreferenced files,but why it delete the optimize file,not delete the old index file. the merged file is what I wanted.   ((aft

Re: problem:lucene did not delete old index file after optimize method called

2010-02-10 Thread Michael McCandless
This happens, on Windows, when there is an IndexReader still reading those files. IndexWriter will periodically (after a merge completes or a new segment is flushed) retry deleting those files, but it won't succeed until no reader has a given file open anymore. IW also deletes unreferenced files

RE: Flex & Docs/AndPositionsEnum

2010-02-10 Thread Uwe Schindler
> > And we don't return "objects or aggregates" with Multi*Enum now... > > Yeah, this is different. In KS right now, we use a generic > PostingList, which > conveys different information depending on what class of Posting it > contains. > > > In flex right now the codec is unware that it's being

problem:lucene did not delete old index file after optimize method called

2010-02-10 Thread luocan19826164
lucene did not delete old index file after optimize method called. ps:I call IndexWriter.getReader() and then call old IndexReader.close() every 1 minute, a long time pass, I watche old index file did not disappear. after I restart my program, optimize index file disappear,but old index file

Contrib Lucene Analyzers & Stemming

2010-02-10 Thread Jamie
Hi There We are having problems with some of the Lucene analyzers in the contributions package. For instance, it appears that the Russian analyzer supports stemming, although, when we test it it does not. Is there a specific switch that we must enable to enable the stemming of words? When we