RE: can't find common words -- using Lucene 3.4.0

Ilya Zavorin Wed, 28 Mar 2012 07:47:10 -0700

Steve,

I had to pull different pieces of the code below from different places in my 
system, but here what I do:


                Analyzer anIndx = new StandardAnalyzer(Version.LUCENE_34);
                IndexWriterConfig iwc = new 
IndexWriterConfig(Version.LUCENE_34, anIndx);
                if (create == true)
                {
                        iwc.setOpenMode(OpenMode.CREATE);
                }
                else
                {
                        iwc.setOpenMode(OpenMode.APPEND);
                }    
                Directory dir = FSDirectory.open(new File(fPath));
                IndexWriter writer = new IndexWriter(dir, iwc);

Anything suspicious here?

Thanks


Ilya Zavorin


-----Original Message-----
From: Steven A Rowe [mailto:sar...@syr.edu] 
Sent: Monday, March 26, 2012 1:48 PM
To: java-user@lucene.apache.org
Subject: RE: can't find common words -- using Lucene 3.4.0 

On 3/26/2012 at 12:21 PM, Ilya Zavorin wrote:
> I am not seeing anything suspicious. Here's what I see in the HEX:
>
> "n.e" from "pain.electricity": 6E-2E-0D-0A-0D-0A-65
> (n-.-CR-LF-CR-LF-e) "e.H" from "sentence.He": 65-2E-0D-0A-48

I agree, standard DOS/Windows line endings.

> I am pretty sure I am using the std analyzer

Interesting.  I'm quite sure something else is going on besides 
StandardAnalyzer, since StandardAnalyzer (more specifically, StandardTokenizer) 
always breaks tokens on whitespace, and excludes punctuation at the end of 
tokens.  In case you're interested, the "standard" to which StandardTokenizer 
(v3.1 - v3.5) conforms is the Word Boundaries rules from Unicode 6.0.0 standard 
annex #29 aka UAX#29: 
<http://www.unicode.org/reports/tr29/tr29-17.html#Word_Boundaries>.

Can you share the code where you construct your analyzer and IndexWriterConfig?

> Here's how I add a doc to the index (oc is String containing the whole 
> document):
>
> doc.add(new Field("contents", 
>               oc, 
>               Field.Store.YES,
>               Field.Index.ANALYZED, 
>               Field.TermVector.WITH_POSITIONS_OFFSETS));
>
> Can this affect the indexing?

The way you add the Field looks fine.

Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: can't find common words -- using Lucene 3.4.0

Reply via email to