Re: Question about indexing (BrazilianAnalyzer)

Thomas Arni Wed, 04 Jun 2008 00:39:02 -0700

Hi,

First of all please please always make sure that you use
exactly the same Analyser during indexing and searching.


I am not confident with the BrazilianAnalyzer, but I saw
in the source code that it does not use a ISOLatin1AccentFilter,
which replaces the accented characters (ç -> c).

Probably the problem is with this accents..
You can check this if you adapt the method tokenStream()
in the BrazilianAnalzyer by including the ISOLatin1AccentFilter
in the filter chain.

Thomas



Vinicius Carvalho said the following on 03/06/08 20:51:

Hello there! I'm indexing documents using the BrazilianAnalyzer, and I've
noticed that many words are not being indexed. I store and index the entire
doc (I'm doing this in order to present the fragments on the results, don't
know if its the best way, mostly on large docs, any ideas?). Well using luke
to check the index I open the stored doc, and its contents contains 17
occurrences of the word "herança" for instance. But, there's no term for
this word or it stemm version: "heranc", so searching for this word would
not return a result for this document.

I'm pretty sure I'm missing something on the indexing process:


try {
            doc.add(new
Field("contents",docText,Field.Store.YES,Field.Index.TOKENIZED,Field.TermVector.YES));
            IndexWriter writer = new
IndexWriter("/java/lucene/portal/cms",new BrazilianAnalyzer()); // gotta
improve this latter
            writer.addDocument(doc);
            writer.close();
        }


So, why would these word (and others) not being indexed?

Regards


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question about indexing (BrazilianAnalyzer)

Reply via email to