> Can someone please point me in the right direction. > > We are creating an application that needs to beable to > search on C++ and get > back doc's that have C++ in it. The StandardAnalyzer > does not seem to index > the "+", so a search for "C++" will bring back docs that > contain, C++, C, > C#, etc..... The WhiteSpaceAnalyzer will index the > "+", but if we have the > term "C++." that is, if C++ is at the end of a sentence, it > will index > "C++." so a search for "C++" will not return the doc. > I have heard of maybe > a CustomAnalyzer; however, it seems like there would > actually need to be a > CustomFilter/CustomTokenizer, I looked at: > - StandardAnalyzer.java > - StandardFilter.java > - StandardTokenizer.java > - StandardTokenizerImpl.java > - StandardTokenizerImpl.jflex > > I would guess that the StandardTokenizer is where the > changes would need to > be made to allow the "+" character, but I am unclear as to > how. > > Any and all help is greatly appreciated.
One option is to modify StandardTokenizerImpl.jflex and generate CustomTokenizerImpl.java so that it will recognize C++ and C# as one token. You need to write a new Tokenizer that uses that CustomTokenizerImpl.java. Other option can be to extend CharTokenizer. Modify the source code of LetterTokenizer : @Override protected boolean isTokenChar(char c) { return Character.isLetter(c) || c=='+' || c=='#'; } Hope this helps. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org