Hi, First of all please please always make sure that you use exactly the same Analyser during indexing and searching.
I am not confident with the BrazilianAnalyzer, but I saw in the source code that it does not use a ISOLatin1AccentFilter, which replaces the accented characters (ç -> c). Probably the problem is with this accents.. You can check this if you adapt the method tokenStream() in the BrazilianAnalzyer by including the ISOLatin1AccentFilter in the filter chain. Thomas Vinicius Carvalho said the following on 03/06/08 20:51:
Hello there! I'm indexing documents using the BrazilianAnalyzer, and I've noticed that many words are not being indexed. I store and index the entire doc (I'm doing this in order to present the fragments on the results, don't know if its the best way, mostly on large docs, any ideas?). Well using luke to check the index I open the stored doc, and its contents contains 17 occurrences of the word "herança" for instance. But, there's no term for this word or it stemm version: "heranc", so searching for this word would not return a result for this document. I'm pretty sure I'm missing something on the indexing process: try { doc.add(new Field("contents",docText,Field.Store.YES,Field.Index.TOKENIZED,Field.TermVector.YES)); IndexWriter writer = new IndexWriter("/java/lucene/portal/cms",new BrazilianAnalyzer()); // gotta improve this latter writer.addDocument(doc); writer.close(); } So, why would these word (and others) not being indexed? Regards
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]