Hello Otis, Thank you for the hint. I have made a custom analyzer which uses a custom tokenizer similar to CharTokenizer - it treats brackets as token characters, but removes them in the next() method. This is because I do not want to split the word when adding it to the index. It seems to work ok, still needs more testing. By just using SimpleAnalyzer words were split.
Mile -----Original Message----- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 31, 2006 7:36 PM To: java-user@lucene.apache.org Subject: Re: Removing brackets before indexing Mile, Any Analyzer that uses a Tokenizer that throws out non-characters will do. For example, take a look at SimpleAnalyzer. It uses LowerCaseTokenizer. If you read the javadoc for LowerCaseTokenizer, I think you will see it suits you. Otis ----- Original Message ---- From: Mile Rosu <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, May 31, 2006 11:47:12 AM Subject: Removing brackets before indexing Hello! I am currently trying to index latin language documents, in which missing letters are appended to words by using square brackets, like this : "[divinit]atis". Could you tell me please which would be the best practice to remove the brackets before adding into the Lucene index? (in the example to store the word "divinitatis"). Thank you a lot, Mile Rosu --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]