RE: Lucene 2.9.0 / BooleanQuery problem

2009-10-29 Thread Uwe Schindler
The BooleanQuery Does not work, because the Sector field is analyzed and you
are searching with a simple TermQuery which is not anylzed. So "Computing"
is not lowercased and will not hit any terms (try luke and look into your
terms you have indexed). Such field like the "sector" one should be made
NOT_ANYLZED during indexing when intended to be hit with TermQueries that
are case-sensitive.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Michel Nadeau [mailto:aka...@gmail.com]
> Sent: Thursday, October 29, 2009 5:08 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene 2.9.0 / BooleanQuery problem
> 
> OMG, it's SO OBVIOUS! For the normal search (sector:IT AND
> group:group)
> the problem was indeed that IT is "it", stopword. Thanks, I was so not
> seeing it!
> 
> But what about the BooleanQuery? It should work fine too now...
> 
> //
> // Test BooleanQuery
> //
> BooleanQuery query2 = new BooleanQuery();
> query2.add(new TermQuery(new Term("sector", "Computing")),
> Occur.MUST);
> query2.add(new TermQuery(new Term("group", "group")), Occur.MUST);
> 
> // Search!
> System.out.println("BooleanQuery:\n");
> doStreamingSearch(searcher, query2);
> System.out.println("Done!");
> 
> Thanks,
> 
> - Mike
> aka...@gmail.com
> 
> 
> On Wed, Oct 28, 2009 at 11:57 PM, Jake Mannix 
> wrote:
> 
> > Hi Michel,
> >
> >  I don't have time to look in too much detail right now, but I'll bet ya
> $5
> > it's because
> > your query is for "sector:IT" - 'IT' lowercases to 'it' which is in the
> > default stopword
> > list, and if you're not careful about how you query with this, you'll
> end
> > up
> > with TermQuery
> > instances which hit nothing, because this term may be stop-listed out of
> > your index!
> >
> >  Can you run the test again with no stop words in your query, and see
> what
> > it
> > gives?
> >
> >  -jake
> >
> > On Wed, Oct 28, 2009 at 7:12 PM, Michel Nadeau  wrote:
> >
> > > Hi !
> > >
> > > I spent all night trying to get a simple BooleanQuery working and I
> > really
> > > can't figure out what is my problem. See this very simple program :
> > >
> > > public class test {
> > >
> > >@SuppressWarnings("deprecation")
> > >public static void main(String[] args) throws ParseException,
> > > CorruptIndexException, LockObtainFailedException, IOException
> > >{
> > >// Open index directory
> > >File idx = new File("d:/Java/index/");
> > >
> > >// Create IndexWriter
> > >IndexWriter writer = new IndexWriter(idx, new
> > > StandardAnalyzer(Version.LUCENE_CURRENT), true,
> > > IndexWriter.MaxFieldLength.LIMITED);
> > >
> > >//
> > >// Add some documents to the index
> > >//
> > >
> > >Document doc = new Document();
> > >doc.add(new Field("firstname", "Mike", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("lastname", "Nadeau", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("phone", "111-222-", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("sector", "IT", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("group", "group", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("content", "blue this is some content",
> > > Field.Store.YES, Field.Index.ANALYZED));
> > >writer.addDocument(doc);
> > >
> > >doc = new Document();
> > >doc.add(new Field("firstname", "Pascale", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("lastname", "Lavoie", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("phone", "333-222-", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("sector", "Accounting", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("group", "othergroup", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("content", "red this is some content",
> > > Field.Store.YES, Field.Index.ANALYZED));
> > >writer.addDocument(doc);
> > >
> > >doc = new Document();
> > >doc.add(new Field("firstname", "Kaven", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("lastname", "Rouseau", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("phone", "222-333-", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("sector", "IT", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("group", "group", Field.Store.YES,
> > > Field.Index.ANALYZED));
> > >doc.add(new Field("content", "red this is some content",
> > > Field.Store.YES, Field.Index.ANALYZED));
> > >writer.addDocument(doc);
> > >
> > >doc = new Document();

search problem

2009-10-29 Thread m.harig

hello all

 i've a doubt in search , i've a word in my index welcomelucene (without
spaces) , when i search for welcome lucene(with a space) , am not able to
get the hits. It should pick the document welcomelucene.. is there anyway to
do it ? i've used wildcard option too. but no results , please anyone help
me..
-- 
View this message in context: 
http://www.nabble.com/search-problem-tp26111084p26111084.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search problem

2009-10-29 Thread Erick Erickson
Why would you expect to get a hit on your document? There are
three distinct tokens here:
welcomlucene
welcome
lucene

Lucene searches for *matching* tokens, so searching
for the tokens 'welcome' and 'lucene' essentially asks
"are there two tokens in the document that exactly match
these?" and the answer is "no", so no hits (assuming
AND here, the OR argument is similar).


As for the wlidcard question, we'd need to see the code
to be able to help there

You might go up on the Wiki and review tokenization to
understand this issue better.

Best
Erick



On Thu, Oct 29, 2009 at 7:12 AM, m.harig  wrote:

>
> hello all
>
> i've a doubt in search , i've a word in my index welcomelucene (without
> spaces) , when i search for welcome lucene(with a space) , am not able to
> get the hits. It should pick the document welcomelucene.. is there anyway
> to
> do it ? i've used wildcard option too. but no results , please anyone help
> me..
> --
> View this message in context:
> http://www.nabble.com/search-problem-tp26111084p26111084.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: IO exception during merge/optimize

2009-10-29 Thread Peter Keegan
A handful of the source documents did contain the U+ character. The
patch from *LUCENE-2016 
*fixed the problem.
Thanks Mike!

Peter

On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hmm, only a few affected terms, and all this particular
> "literals:cfid196$" term, with optional suffixes.  Really strange.
>
> One things that's odd is the exact term "literals:cfid196$" is printed
> twice, which should never happen (every unique term should be stored
> only once, in the terms dict).
>
> And, otherwise, CheckIndex got through the index just fine.
>
> Try searching a TermQuery with these affected terms and see if it
> succeeds?  If so, maybe trying making an index with one or two of
> them, alone, and see if that index shows the problem?
>
> OK I'm attaching more mods.  Can you re-run your CheckIndex?  It will
> produce an enormous amount of output, but if you can excise the few
> lines around when that warning comes out & post back that'd be great.
>
> Mike
>
> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan 
> wrote:
> > Just to be safe, I ran with the official jar file from one of the mirrors
> > and reproduced the problem.
> > The debug session is not showing any characters = '\u' (checking this
> in
> > Tokenizer).
> > The output from the modified CheckIndex follows. There are only a few
> terms
> > with the inconsistency. They are all legitimate terms from the app's
> > context. With this info, I might be able to isolate the source documents.
> > What should I be looking for when they are indexed?
> >
> > CheckInput output:
> >
> > Opening index @ D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4
> >
> > Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS [Lucene
> > 2.9]
> >  1 of 3: name=_0 docCount=413585
> >compound=false
> >hasProx=true
> >numFiles=8
> >size (MB)=1,148.817
> >diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
> >docStoreOffset=0
> >docStoreSegment=_0
> >docStoreIsCompoundFile=false
> >no deletions
> >test: open reader.OK
> >test: fields..OK [33 fields]
> >test: field norms.OK [33 fields]
> >test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs
> pairs;
> > 340244234 tokens]
> >test: stored fields...OK [1240755 total field count; avg 3 fields
> > per doc]
> >test: term vectorsOK [0 total vector count; avg 0 term/freq
> > vector fields per doc]
> >
> >  2 of 3: name=_1 docCount=359068
> >compound=false
> >hasProx=true
> >numFiles=8
> >size (MB)=1,125.161
> >diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
> >docStoreOffset=413585
> >docStoreSegment=_0
> >docStoreIsCompoundFile=false
> >no deletions
> >test: open reader.OK
> >test: fields..OK [33 fields]
> >test: field norms.OK [33 fields]
> >test: terms, freq, prox...WARNING: term  literals:cfid196$ docFreq=43
> !=
> > num docs seen 4 + num docs deleted 0
> > WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
> > deleted 0
> > WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
> > deleted 0
> > WARNING: term  literals:cfid196$commandant docFreq=1 != num docs seen 9 +
> > num docs deleted 0
> > WARNING: term  literals:cfid196$on docFreq=3178 != num docs seen 1 + num
> > docs deleted 0
> > OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens]
> >test: stored fields...OK [1077204 total field count; avg 3 fields
> > per doc]
> >test: term vectorsOK [0 total vector count; avg 0 term/freq
> > vector fields per doc]
> >
> >  3 of 3: name=_2 docCount=304849
> >compound=false
> >hasProx=true
> >numFiles=8
> >size (MB)=962.004
> >diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
> >docStoreOffset=772653
> >docStoreSegment=_0
> >docStoreIsCompoundFile=false
> >no deletions
> >test: open reader.OK
> >test: fields..OK [33 fields]
> >test: field norms.OK [33 fields]
> >test: terms, freq, prox...WARNING: term  contents:? docFreq=1 != num
> > docs seen 246 + num docs deleted 0
> > WARNING: term  literals:cfid196$ docFreq=45 != num docs seen 4 + num docs
> > deleted 0
> > WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
> > deleted 0
> > WARNING: term  literals:cfid196$cashier docFreq=1 != num docs seen 37 +
> num
> > docs deleted 0
> > WAR

Re: IO exception during merge/optimize

2009-10-29 Thread Peter Keegan
Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with
optimization in just under 30 min.
I used setRAMBufferSizeMB=1.9G

Peter

On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan wrote:

> A handful of the source documents did contain the U+ character. The
> patch from *LUCENE-2016
> *fixed the problem.
> Thanks Mike!
>
> Peter
>
>
> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hmm, only a few affected terms, and all this particular
>> "literals:cfid196$" term, with optional suffixes.  Really strange.
>>
>> One things that's odd is the exact term "literals:cfid196$" is printed
>> twice, which should never happen (every unique term should be stored
>> only once, in the terms dict).
>>
>> And, otherwise, CheckIndex got through the index just fine.
>>
>> Try searching a TermQuery with these affected terms and see if it
>> succeeds?  If so, maybe trying making an index with one or two of
>> them, alone, and see if that index shows the problem?
>>
>> OK I'm attaching more mods.  Can you re-run your CheckIndex?  It will
>> produce an enormous amount of output, but if you can excise the few
>> lines around when that warning comes out & post back that'd be great.
>>
>> Mike
>>
>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan 
>> wrote:
>> > Just to be safe, I ran with the official jar file from one of the
>> mirrors
>> > and reproduced the problem.
>> > The debug session is not showing any characters = '\u' (checking
>> this in
>> > Tokenizer).
>> > The output from the modified CheckIndex follows. There are only a few
>> terms
>> > with the inconsistency. They are all legitimate terms from the app's
>> > context. With this info, I might be able to isolate the source
>> documents.
>> > What should I be looking for when they are indexed?
>> >
>> > CheckInput output:
>> >
>> > Opening index @
>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4
>> >
>> > Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS
>> [Lucene
>> > 2.9]
>> >  1 of 3: name=_0 docCount=413585
>> >compound=false
>> >hasProx=true
>> >numFiles=8
>> >size (MB)=1,148.817
>> >diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
>> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>> >docStoreOffset=0
>> >docStoreSegment=_0
>> >docStoreIsCompoundFile=false
>> >no deletions
>> >test: open reader.OK
>> >test: fields..OK [33 fields]
>> >test: field norms.OK [33 fields]
>> >test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs
>> pairs;
>> > 340244234 tokens]
>> >test: stored fields...OK [1240755 total field count; avg 3 fields
>> > per doc]
>> >test: term vectorsOK [0 total vector count; avg 0 term/freq
>> > vector fields per doc]
>> >
>> >  2 of 3: name=_1 docCount=359068
>> >compound=false
>> >hasProx=true
>> >numFiles=8
>> >size (MB)=1,125.161
>> >diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
>> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>> >docStoreOffset=413585
>> >docStoreSegment=_0
>> >docStoreIsCompoundFile=false
>> >no deletions
>> >test: open reader.OK
>> >test: fields..OK [33 fields]
>> >test: field norms.OK [33 fields]
>> >test: terms, freq, prox...WARNING: term  literals:cfid196$ docFreq=43
>> !=
>> > num docs seen 4 + num docs deleted 0
>> > WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
>> > deleted 0
>> > WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
>> > deleted 0
>> > WARNING: term  literals:cfid196$commandant docFreq=1 != num docs seen 9
>> +
>> > num docs deleted 0
>> > WARNING: term  literals:cfid196$on docFreq=3178 != num docs seen 1 + num
>> > docs deleted 0
>> > OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens]
>> >test: stored fields...OK [1077204 total field count; avg 3 fields
>> > per doc]
>> >test: term vectorsOK [0 total vector count; avg 0 term/freq
>> > vector fields per doc]
>> >
>> >  3 of 3: name=_2 docCount=304849
>> >compound=false
>> >hasProx=true
>> >numFiles=8
>> >size (MB)=962.004
>> >diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
>> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>> >docStoreOffset=772653
>> >docStoreSegment=_0
>> >docStoreIsCompoundFile=false
>> >no deletions
>> >test: open reader.OK
>> >test: fields..OK [33 fields]
>> >test: field norms.OK [33 fields]
>> >test: terms, freq, prox...WARNING: term  contents:? docFreq

Re: IO exception during merge/optimize

2009-10-29 Thread Michael McCandless
I'm glad we finally got to the bottom of this :)  This fix will be in 2.9.1.

This is a nice fast indexing result, too...

Mike

On Thu, Oct 29, 2009 at 3:55 PM, Peter Keegan  wrote:
> Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with
> optimization in just under 30 min.
> I used setRAMBufferSizeMB=1.9G
>
> Peter
>
> On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan wrote:
>
>> A handful of the source documents did contain the U+ character. The
>> patch from *LUCENE-2016
>> *fixed the problem.
>> Thanks Mike!
>>
>> Peter
>>
>>
>> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>> Hmm, only a few affected terms, and all this particular
>>> "literals:cfid196$" term, with optional suffixes.  Really strange.
>>>
>>> One things that's odd is the exact term "literals:cfid196$" is printed
>>> twice, which should never happen (every unique term should be stored
>>> only once, in the terms dict).
>>>
>>> And, otherwise, CheckIndex got through the index just fine.
>>>
>>> Try searching a TermQuery with these affected terms and see if it
>>> succeeds?  If so, maybe trying making an index with one or two of
>>> them, alone, and see if that index shows the problem?
>>>
>>> OK I'm attaching more mods.  Can you re-run your CheckIndex?  It will
>>> produce an enormous amount of output, but if you can excise the few
>>> lines around when that warning comes out & post back that'd be great.
>>>
>>> Mike
>>>
>>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan 
>>> wrote:
>>> > Just to be safe, I ran with the official jar file from one of the
>>> mirrors
>>> > and reproduced the problem.
>>> > The debug session is not showing any characters = '\u' (checking
>>> this in
>>> > Tokenizer).
>>> > The output from the modified CheckIndex follows. There are only a few
>>> terms
>>> > with the inconsistency. They are all legitimate terms from the app's
>>> > context. With this info, I might be able to isolate the source
>>> documents.
>>> > What should I be looking for when they are indexed?
>>> >
>>> > CheckInput output:
>>> >
>>> > Opening index @
>>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4
>>> >
>>> > Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS
>>> [Lucene
>>> > 2.9]
>>> >  1 of 3: name=_0 docCount=413585
>>> >    compound=false
>>> >    hasProx=true
>>> >    numFiles=8
>>> >    size (MB)=1,148.817
>>> >    diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
>>> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>>> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>>> >    docStoreOffset=0
>>> >    docStoreSegment=_0
>>> >    docStoreIsCompoundFile=false
>>> >    no deletions
>>> >    test: open reader.OK
>>> >    test: fields..OK [33 fields]
>>> >    test: field norms.OK [33 fields]
>>> >    test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs
>>> pairs;
>>> > 340244234 tokens]
>>> >    test: stored fields...OK [1240755 total field count; avg 3 fields
>>> > per doc]
>>> >    test: term vectorsOK [0 total vector count; avg 0 term/freq
>>> > vector fields per doc]
>>> >
>>> >  2 of 3: name=_1 docCount=359068
>>> >    compound=false
>>> >    hasProx=true
>>> >    numFiles=8
>>> >    size (MB)=1,125.161
>>> >    diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
>>> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>>> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>>> >    docStoreOffset=413585
>>> >    docStoreSegment=_0
>>> >    docStoreIsCompoundFile=false
>>> >    no deletions
>>> >    test: open reader.OK
>>> >    test: fields..OK [33 fields]
>>> >    test: field norms.OK [33 fields]
>>> >    test: terms, freq, prox...WARNING: term  literals:cfid196$ docFreq=43
>>> !=
>>> > num docs seen 4 + num docs deleted 0
>>> > WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
>>> > deleted 0
>>> > WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
>>> > deleted 0
>>> > WARNING: term  literals:cfid196$commandant docFreq=1 != num docs seen 9
>>> +
>>> > num docs deleted 0
>>> > WARNING: term  literals:cfid196$on docFreq=3178 != num docs seen 1 + num
>>> > docs deleted 0
>>> > OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens]
>>> >    test: stored fields...OK [1077204 total field count; avg 3 fields
>>> > per doc]
>>> >    test: term vectorsOK [0 total vector count; avg 0 term/freq
>>> > vector fields per doc]
>>> >
>>> >  3 of 3: name=_2 docCount=304849
>>> >    compound=false
>>> >    hasProx=true
>>> >    numFiles=8
>>> >    size (MB)=962.004
>>> >    diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
>>> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>>> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>>> 

Re: IO exception during merge/optimize

2009-10-29 Thread Mark Miller
Any chance I could get you to try that again with a buffer of like 800MB
to a gig and do a comparison?

I've been investigating the returns you get with a larger buffer size.
It appears to be pretty diminishing returns over 100MB or so - at higher
than that, I've gotten both slower speeds for some sizes, and larger
gains for others. But only better by 5-10 docs a second up to a gig. But
I can't reliably test at over a gig - I have only 4 GB of RAM, and even
with that, at over a gig it starts to page and the performance gets hit.
I'd love to see what kind of benefit you see going from around a gig to
just under 2.

Peter Keegan wrote:
> Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with
> optimization in just under 30 min.
> I used setRAMBufferSizeMB=1.9G
>
> Peter
>
> On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan wrote:
>
>   
>> A handful of the source documents did contain the U+ character. The
>> patch from *LUCENE-2016
>> *fixed the problem.
>> Thanks Mike!
>>
>> Peter
>>
>>
>> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>> 
>>> Hmm, only a few affected terms, and all this particular
>>> "literals:cfid196$" term, with optional suffixes.  Really strange.
>>>
>>> One things that's odd is the exact term "literals:cfid196$" is printed
>>> twice, which should never happen (every unique term should be stored
>>> only once, in the terms dict).
>>>
>>> And, otherwise, CheckIndex got through the index just fine.
>>>
>>> Try searching a TermQuery with these affected terms and see if it
>>> succeeds?  If so, maybe trying making an index with one or two of
>>> them, alone, and see if that index shows the problem?
>>>
>>> OK I'm attaching more mods.  Can you re-run your CheckIndex?  It will
>>> produce an enormous amount of output, but if you can excise the few
>>> lines around when that warning comes out & post back that'd be great.
>>>
>>> Mike
>>>
>>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan 
>>> wrote:
>>>   
 Just to be safe, I ran with the official jar file from one of the
 
>>> mirrors
>>>   
 and reproduced the problem.
 The debug session is not showing any characters = '\u' (checking
 
>>> this in
>>>   
 Tokenizer).
 The output from the modified CheckIndex follows. There are only a few
 
>>> terms
>>>   
 with the inconsistency. They are all legitimate terms from the app's
 context. With this info, I might be able to isolate the source
 
>>> documents.
>>>   
 What should I be looking for when they are indexed?

 CheckInput output:

 Opening index @
 
>>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4
>>>   
 Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS
 
>>> [Lucene
>>>   
 2.9]
  1 of 3: name=_0 docCount=413585
compound=false
hasProx=true
numFiles=8
size (MB)=1,148.817
diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
 java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
docStoreOffset=0
docStoreSegment=_0
docStoreIsCompoundFile=false
no deletions
test: open reader.OK
test: fields..OK [33 fields]
test: field norms.OK [33 fields]
test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs
 
>>> pairs;
>>>   
 340244234 tokens]
test: stored fields...OK [1240755 total field count; avg 3 fields
 per doc]
test: term vectorsOK [0 total vector count; avg 0 term/freq
 vector fields per doc]

  2 of 3: name=_1 docCount=359068
compound=false
hasProx=true
numFiles=8
size (MB)=1,125.161
diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
 java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
docStoreOffset=413585
docStoreSegment=_0
docStoreIsCompoundFile=false
no deletions
test: open reader.OK
test: fields..OK [33 fields]
test: field norms.OK [33 fields]
test: terms, freq, prox...WARNING: term  literals:cfid196$ docFreq=43
 
>>> !=
>>>   
 num docs seen 4 + num docs deleted 0
 WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
 deleted 0
 WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
 deleted 0
 WARNING: term  literals:cfid196$commandant docFreq=1 != num docs seen 9
 
>>> +
>>>   
 num docs deleted 0
 WARNING: term  literals:cfid196$on docFreq=3178 != num docs seen 1 + num

Re: search problem

2009-10-29 Thread Karl Wettin


29 okt 2009 kl. 12.12 skrev m.harig:


i've a doubt in search , i've a word in my index welcomelucene  
(without
spaces) , when i search for welcome lucene(with a space) , am not  
able to
get the hits. It should pick the document welcomelucene.. is there  
anyway to
do it ? i've used wildcard option too. but no results , please  
anyone help

me..


Using a bigram shingle filter with no spacer characters at query time  
should match the document, but as Erick says you might want consider  
and tell us why you want to do this.



  karl

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IO exception during merge/optimize

2009-10-29 Thread Peter Keegan
Mark,

With 1.9G, I had to increase the JVM heap significantly (to 8G)  to avoid
paging and GC hits. Here is a table comparing indexing times, optimizing
times and peak memory usage as a function of the  RAMBufferSize. This was
run on a 64-bit server with 32GB RAM:

RamSizeIndex(min)Optimize(min) Max VM
1.9G 245   5G
800M245  4G

Not much difference. I'll make a couple more runs with lower values.
Btw, the indexing times are really about 5 min. shorter because of some
non-Lucene related delays after the last document.

Peter



On Thu, Oct 29, 2009 at 4:30 PM, Mark Miller  wrote:

> Any chance I could get you to try that again with a buffer of like 800MB
> to a gig and do a comparison?
>
> I've been investigating the returns you get with a larger buffer size.
> It appears to be pretty diminishing returns over 100MB or so - at higher
> than that, I've gotten both slower speeds for some sizes, and larger
> gains for others. But only better by 5-10 docs a second up to a gig. But
> I can't reliably test at over a gig - I have only 4 GB of RAM, and even
> with that, at over a gig it starts to page and the performance gets hit.
> I'd love to see what kind of benefit you see going from around a gig to
> just under 2.
>
> Peter Keegan wrote:
> > Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with
> > optimization in just under 30 min.
> > I used setRAMBufferSizeMB=1.9G
> >
> > Peter
> >
> > On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan  >wrote:
> >
> >
> >> A handful of the source documents did contain the U+ character. The
> >> patch from *LUCENE-2016<
> https://issues.apache.org/jira/browse/LUCENE-2016>
> >> *fixed the problem.
> >> Thanks Mike!
> >>
> >> Peter
> >>
> >>
> >> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless <
> >> luc...@mikemccandless.com> wrote:
> >>
> >>
> >>> Hmm, only a few affected terms, and all this particular
> >>> "literals:cfid196$" term, with optional suffixes.  Really strange.
> >>>
> >>> One things that's odd is the exact term "literals:cfid196$" is printed
> >>> twice, which should never happen (every unique term should be stored
> >>> only once, in the terms dict).
> >>>
> >>> And, otherwise, CheckIndex got through the index just fine.
> >>>
> >>> Try searching a TermQuery with these affected terms and see if it
> >>> succeeds?  If so, maybe trying making an index with one or two of
> >>> them, alone, and see if that index shows the problem?
> >>>
> >>> OK I'm attaching more mods.  Can you re-run your CheckIndex?  It will
> >>> produce an enormous amount of output, but if you can excise the few
> >>> lines around when that warning comes out & post back that'd be great.
> >>>
> >>> Mike
> >>>
> >>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan  >
> >>> wrote:
> >>>
>  Just to be safe, I ran with the official jar file from one of the
> 
> >>> mirrors
> >>>
>  and reproduced the problem.
>  The debug session is not showing any characters = '\u' (checking
> 
> >>> this in
> >>>
>  Tokenizer).
>  The output from the modified CheckIndex follows. There are only a few
> 
> >>> terms
> >>>
>  with the inconsistency. They are all legitimate terms from the app's
>  context. With this info, I might be able to isolate the source
> 
> >>> documents.
> >>>
>  What should I be looking for when they are indexed?
> 
>  CheckInput output:
> 
>  Opening index @
> 
> >>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4
> >>>
>  Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS
> 
> >>> [Lucene
> >>>
>  2.9]
>   1 of 3: name=_0 docCount=413585
> compound=false
> hasProx=true
> numFiles=8
> size (MB)=1,148.817
> diagnostics = {os.version=5.2, os=Windows 2003,
> lucene.version=2.9.0
>  817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>  java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
> docStoreOffset=0
> docStoreSegment=_0
> docStoreIsCompoundFile=false
> no deletions
> test: open reader.OK
> test: fields..OK [33 fields]
> test: field norms.OK [33 fields]
> test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs
> 
> >>> pairs;
> >>>
>  340244234 tokens]
> test: stored fields...OK [1240755 total field count; avg 3
> fields
>  per doc]
> test: term vectorsOK [0 total vector count; avg 0 term/freq
>  vector fields per doc]
> 
>   2 of 3: name=_1 docCount=359068
> compound=false
> hasProx=true
> numFiles=8
> size (MB)=1,125.161
> diagnostics = {os.version=5.2, os=Windows 2003,
> lucene.version=2.9.0
>  817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>  java.version=1.6.0_16, java.vendor=Sun Microsystems In

Re: IO exception during merge/optimize

2009-10-29 Thread Mark Miller
Thanks a lot Peter! Really appreciate it.

Peter Keegan wrote:
> Mark,
>
> With 1.9G, I had to increase the JVM heap significantly (to 8G)  to avoid
> paging and GC hits. Here is a table comparing indexing times, optimizing
> times and peak memory usage as a function of the  RAMBufferSize. This was
> run on a 64-bit server with 32GB RAM:
>
> RamSizeIndex(min)Optimize(min) Max VM
> 1.9G 245   5G
> 800M245  4G
>
> Not much difference. I'll make a couple more runs with lower values.
> Btw, the indexing times are really about 5 min. shorter because of some
> non-Lucene related delays after the last document.
>
> Peter
>
>
>
> On Thu, Oct 29, 2009 at 4:30 PM, Mark Miller  wrote:
>
>   
>> Any chance I could get you to try that again with a buffer of like 800MB
>> to a gig and do a comparison?
>>
>> I've been investigating the returns you get with a larger buffer size.
>> It appears to be pretty diminishing returns over 100MB or so - at higher
>> than that, I've gotten both slower speeds for some sizes, and larger
>> gains for others. But only better by 5-10 docs a second up to a gig. But
>> I can't reliably test at over a gig - I have only 4 GB of RAM, and even
>> with that, at over a gig it starts to page and the performance gets hit.
>> I'd love to see what kind of benefit you see going from around a gig to
>> just under 2.
>>
>> Peter Keegan wrote:
>> 
>>> Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with
>>> optimization in just under 30 min.
>>> I used setRAMBufferSizeMB=1.9G
>>>
>>> Peter
>>>
>>> On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan >> wrote:
>>>
>>>
>>>   
 A handful of the source documents did contain the U+ character. The
 patch from *LUCENE-2016<
 
>> https://issues.apache.org/jira/browse/LUCENE-2016>
>> 
 *fixed the problem.
 Thanks Mike!

 Peter


 On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless <
 luc...@mikemccandless.com> wrote:


 
> Hmm, only a few affected terms, and all this particular
> "literals:cfid196$" term, with optional suffixes.  Really strange.
>
> One things that's odd is the exact term "literals:cfid196$" is printed
> twice, which should never happen (every unique term should be stored
> only once, in the terms dict).
>
> And, otherwise, CheckIndex got through the index just fine.
>
> Try searching a TermQuery with these affected terms and see if it
> succeeds?  If so, maybe trying making an index with one or two of
> them, alone, and see if that index shows the problem?
>
> OK I'm attaching more mods.  Can you re-run your CheckIndex?  It will
> produce an enormous amount of output, but if you can excise the few
> lines around when that warning comes out & post back that'd be great.
>
> Mike
>
> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan    
> wrote:
>
>   
>> Just to be safe, I ran with the official jar file from one of the
>>
>> 
> mirrors
>
>   
>> and reproduced the problem.
>> The debug session is not showing any characters = '\u' (checking
>>
>> 
> this in
>
>   
>> Tokenizer).
>> The output from the modified CheckIndex follows. There are only a few
>>
>> 
> terms
>
>   
>> with the inconsistency. They are all legitimate terms from the app's
>> context. With this info, I might be able to isolate the source
>>
>> 
> documents.
>
>   
>> What should I be looking for when they are indexed?
>>
>> CheckInput output:
>>
>> Opening index @
>>
>> 
> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4
>
>   
>> Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS
>>
>> 
> [Lucene
>
>   
>> 2.9]
>>  1 of 3: name=_0 docCount=413585
>>compound=false
>>hasProx=true
>>numFiles=8
>>size (MB)=1,148.817
>>diagnostics = {os.version=5.2, os=Windows 2003,
>> 
>> lucene.version=2.9.0
>> 
>> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>>docStoreOffset=0
>>docStoreSegment=_0
>>docStoreIsCompoundFile=false
>>no deletions
>>test: open reader.OK
>>test: fields..OK [33 fields]
>>test: field norms.OK [33 fields]
>>test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs
>>
>> 
> pairs;
>
>   
>> 340244234 tokens]
>>test: stored fields...OK [1240755 total field count; avg 

Re: IO exception during merge/optimize

2009-10-29 Thread Peter Keegan
A couple more data points:

RamSizeIndex(min)Optimize(min)Peak mem
1.9G2455G
800M2454G
400M25  53.5G
100M2553G
50M 26  43G

Peter


On Thu, Oct 29, 2009 at 8:49 PM, Mark Miller  wrote:

> Thanks a lot Peter! Really appreciate it.
>
> Peter Keegan wrote:
> > Mark,
> >
> > With 1.9G, I had to increase the JVM heap significantly (to 8G)  to avoid
> > paging and GC hits. Here is a table comparing indexing times, optimizing
> > times and peak memory usage as a function of the  RAMBufferSize. This was
> > run on a 64-bit server with 32GB RAM:
> >
> > RamSizeIndex(min)Optimize(min) Max VM
> > 1.9G 245   5G
> > 800M245  4G
> >
> > Not much difference. I'll make a couple more runs with lower values.
> > Btw, the indexing times are really about 5 min. shorter because of some
> > non-Lucene related delays after the last document.
> >
> > Peter
> >
> >
> >
> > On Thu, Oct 29, 2009 at 4:30 PM, Mark Miller 
> wrote:
> >
> >
> >> Any chance I could get you to try that again with a buffer of like 800MB
> >> to a gig and do a comparison?
> >>
> >> I've been investigating the returns you get with a larger buffer size.
> >> It appears to be pretty diminishing returns over 100MB or so - at higher
> >> than that, I've gotten both slower speeds for some sizes, and larger
> >> gains for others. But only better by 5-10 docs a second up to a gig. But
> >> I can't reliably test at over a gig - I have only 4 GB of RAM, and even
> >> with that, at over a gig it starts to page and the performance gets hit.
> >> I'd love to see what kind of benefit you see going from around a gig to
> >> just under 2.
> >>
> >> Peter Keegan wrote:
> >>
> >>> Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with
> >>> optimization in just under 30 min.
> >>> I used setRAMBufferSizeMB=1.9G
> >>>
> >>> Peter
> >>>
> >>> On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan  >>> wrote:
> >>>
> >>>
> >>>
>  A handful of the source documents did contain the U+ character.
> The
>  patch from *LUCENE-2016<
> 
> >> https://issues.apache.org/jira/browse/LUCENE-2016>
> >>
>  *fixed the problem.
>  Thanks Mike!
> 
>  Peter
> 
> 
>  On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless <
>  luc...@mikemccandless.com> wrote:
> 
> 
> 
> > Hmm, only a few affected terms, and all this particular
> > "literals:cfid196$" term, with optional suffixes.  Really strange.
> >
> > One things that's odd is the exact term "literals:cfid196$" is
> printed
> > twice, which should never happen (every unique term should be stored
> > only once, in the terms dict).
> >
> > And, otherwise, CheckIndex got through the index just fine.
> >
> > Try searching a TermQuery with these affected terms and see if it
> > succeeds?  If so, maybe trying making an index with one or two of
> > them, alone, and see if that index shows the problem?
> >
> > OK I'm attaching more mods.  Can you re-run your CheckIndex?  It will
> > produce an enormous amount of output, but if you can excise the few
> > lines around when that warning comes out & post back that'd be great.
> >
> > Mike
> >
> > On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan <
> peterlkee...@gmail.com
> >
> > wrote:
> >
> >
> >> Just to be safe, I ran with the official jar file from one of the
> >>
> >>
> > mirrors
> >
> >
> >> and reproduced the problem.
> >> The debug session is not showing any characters = '\u' (checking
> >>
> >>
> > this in
> >
> >
> >> Tokenizer).
> >> The output from the modified CheckIndex follows. There are only a
> few
> >>
> >>
> > terms
> >
> >
> >> with the inconsistency. They are all legitimate terms from the app's
> >> context. With this info, I might be able to isolate the source
> >>
> >>
> > documents.
> >
> >
> >> What should I be looking for when they are indexed?
> >>
> >> CheckInput output:
> >>
> >> Opening index @
> >>
> >>
> > D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4
> >
> >
> >> Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS
> >>
> >>
> > [Lucene
> >
> >
> >> 2.9]
> >>  1 of 3: name=_0 docCount=413585
> >>compound=false
> >>hasProx=true
> >>numFiles=8
> >>size (MB)=1,148.817
> >>diagnostics = {os.version=5.2, os=Windows 2003,
> >>
> >> lucene.version=2.9.0
> >>
> >> 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
> >> java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
> >>docStoreOffset=0
> >>docStoreSegment=_0
> >>>

Re: search problem

2009-10-29 Thread m.harig

Thanks Erick ,

   i understand the issue , but my doubt is when you search for a keyword
which is originally a single word, for example , metacity is really single
keyword . when i search for meta city am not able to get the results , this
is what my doubt , 

 if you goto google and search for meta city , it'll give you the results
for metacity , how do i solve this issue ? is there any concepts behind
indexing ?? please anyone let me know
-- 
View this message in context: 
http://old.nabble.com/search-problem-tp26111084p26124737.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is multiple indexing and how does it work in Lucene [Java]

2009-10-29 Thread DHIVYA M
The question is indeed wrong. Sry for the inconvenience. Actually i should have 
asked this way!
 
Am trying out executing the demo of lucene 1.4.3.
When i run a file for the first time, the index is properly getting created.
When i run the indexing for the second time with a different file, the file 
"_1" , a CFS file in the index folder is getting overwritten, i.e. i couldnt 
find the index created for the previous file i used.
 
So kindly let me know about the cause of this problem and a solution too.
 
I would also be happy if anyone let me know from which version the index will 
be appended.
 
Thanks in advance.
M.Dhivya


  From cricket scores to your friends. Try the Yahoo! India Homepage! 
http://in.yahoo.com/trynew

clucene user

2009-10-29 Thread Vithya Arumugasami
Hi,

Am working with clucene.Kindly tell the forum to find my solution

Thanks
Vithya


Re: What is multiple indexing and how does it work in Lucene [Java]

2009-10-29 Thread Anshum
In case you are trying to say that in subsequent runs, the previous state of
the index just goes off, its because the indexwriter gets opened with
'create new' flag as true. In other words, the index would be newly created
overwriting any existing index at the directory location.
The solution to this would be to open the indexwriter with the createnew
flag as false.
*public IndexWriter(String path,*
*   Analyzer a,*
*   boolean create)*
*throws IOException*
This should solve your problem.

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw


On Fri, Oct 30, 2009 at 11:43 AM, DHIVYA M wrote:

> The question is indeed wrong. Sry for the inconvenience. Actually i should
> have asked this way!
>
> Am trying out executing the demo of lucene 1.4.3.
> When i run a file for the first time, the index is properly getting
> created.
> When i run the indexing for the second time with a different file, the file
> "_1" , a CFS file in the index folder is getting overwritten, i.e. i couldnt
> find the index created for the previous file i used.
>
> So kindly let me know about the cause of this problem and a solution too.
>
> I would also be happy if anyone let me know from which version the index
> will be appended.
>
> Thanks in advance.
> M.Dhivya
>
>
>  From cricket scores to your friends. Try the Yahoo! India Homepage!
> http://in.yahoo.com/trynew


soln found for overwritten problem

2009-10-29 Thread DHIVYA M
Let me try out and get back to you sir
 
Thanks
M.Dhivya

--- On Fri, 30/10/09, Anshum  wrote:


From: Anshum 
Subject: Re: What is multiple indexing and how does it work in Lucene [Java]
To: java-user@lucene.apache.org
Date: Friday, 30 October, 2009, 6:23 AM


In case you are trying to say that in subsequent runs, the previous state of
the index just goes off, its because the indexwriter gets opened with
'create new' flag as true. In other words, the index would be newly created
overwriting any existing index at the directory location.
The solution to this would be to open the indexwriter with the createnew
flag as false.
*public IndexWriter(String path,*
*                   Analyzer a,*
*                   boolean create)*
*            throws IOException*
This should solve your problem.

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw


On Fri, Oct 30, 2009 at 11:43 AM, DHIVYA M wrote:

> The question is indeed wrong. Sry for the inconvenience. Actually i should
> have asked this way!
>
> Am trying out executing the demo of lucene 1.4.3.
> When i run a file for the first time, the index is properly getting
> created.
> When i run the indexing for the second time with a different file, the file
> "_1" , a CFS file in the index folder is getting overwritten, i.e. i couldnt
> find the index created for the previous file i used.
>
> So kindly let me know about the cause of this problem and a solution too.
>
> I would also be happy if anyone let me know from which version the index
> will be appended.
>
> Thanks in advance.
> M.Dhivya
>
>
>      From cricket scores to your friends. Try the Yahoo! India Homepage!
> http://in.yahoo.com/trynew



  From cricket scores to your friends. Try the Yahoo! India Homepage! 
http://in.yahoo.com/trynew

Re: soln found for index overwritting problem

2009-10-29 Thread DHIVYA M
Thanks a lot sir. Its working out well.
 
But i have one more doubt.
Is it possible to check whether the same documents are indexed again and again?
bcos due to appending of indexes,
when i search a query,
the result is displayed as much number of times as the index is created for 
that document.

how to solve this sir
 
Is it possible to remove duplicates with a flag in indexwriter?
 
Kindly let me know about this.
 
Thanks in advance
M.Dhivya


  Keep up with people you care about with Yahoo! India Mail. Learn how. 
http://in.overview.mail.yahoo.com/connectmore