I am back to doing something with Lucene after a short break from it. I am trying to index/search hyphenated words, and retrieve them from a token stream.
1. I modified the StandardTokenizer.jj file. Essentially, I added the following to StandardTokenizer.jj | <HYPHENWORD1: (<LETTER>)+"-"(<LETTER>)+("-"<LETTER>)*> 2. I used JavaCC to get a set of .java files including a tokenizer. 3. I modified the file to use org.apache.lucene.analysis.standard classes, such as Token,CharStream instead of the ones provided by javaCC. 4. I was able to index and retrieve words like merry-go-round (as oppose to merry go round). So, I was quite happy. Now I want to get "merry-go-round" from the token stream. And that doesn't seem to work. Note that retrieve words with 1 hyphen seems to work, but 2 hyphens seems to represent a problem. In getting the tokens from the stream, I get "Merry-go-r" and "ound" instead of "Merry-go-round" "editor-in-c" and "hief" instead of "editor-in-chief". This behaviour is so strange, and I don't know how the indexer and query processing knows about "merry-go-round", and yet the TokenStream doesn't. "green-monster" would work. But not words with more than one hyphen. There are two snippets of code I tried, both didn't return the desired result: Snippet 1: MyStandardAnalyzer bsa = new MyStandardAnalyzer(); TokenStream ts = bsa.tokenStream("content", rdr ); //Token t; while (true) { org.apache.lucene.analysis.Token t = ts.next(); if (t == null) break; System.out.println(t.termText()); } MyStandardAnalyzer contains my special Tokenizer generated from the new .jj file. Essentially, where replace StandardTokenizer with MyStandardTokenizer. Merry-go-round becomes Merry-go-r ound Snippet 2: StandardAnalyzer sa1 = new StandardAnalyzer(); ts = sa1.tokenStream("content", new StringReader("this is a merry-go-round with 3 children") ); //Token t; while (true) { org.apache.lucene.analysis.Token t = ts.next(); if (t == null) break; System.out.println(t.termText()); } Merry-go-round becomes 3 tokens as merry go round Could someone give me some suggestions. The reasons I need the tokens is so that I can get words before and after the selected words to form some context. (By the way, currently, I convert a hyphenated word into a phrase, but to me, that seems like special casing hyphenated words, and I just want to stay away from special casing. People has been asking for all sorts of punctuation, such as _ or / etc. I thought that if I learn how to do modify the .jj files and produce the right tokens, I am better off. Thank you very much in advance.