words with more than 1 hyphen ?

Beady Geraghty Wed, 07 Dec 2005 10:36:43 -0800

I am back to doing something with Lucene after a short break from it.

I am trying to index/search hyphenated words,
and retrieve them from a token stream.


1. I modified the StandardTokenizer.jj file.

   Essentially, I added the following to StandardTokenizer.jj
 | <HYPHENWORD1: (<LETTER>)+"-"(<LETTER>)+("-"<LETTER>)*>

2. I used JavaCC to get a set of .java files including a
    tokenizer.

3. I modified the file to use
   org.apache.lucene.analysis.standard classes, such as Token,CharStream
   instead of the ones provided by javaCC.

4. I was able to index and retrieve words like
merry-go-round (as oppose to merry go round).  So, I
was quite happy.
Now I want to get "merry-go-round" from the token
stream.  And that doesn't seem to work.
Note that retrieve words with 1 hyphen seems to work,
but 2 hyphens seems to represent a problem.

In getting the tokens from the stream, I get
"Merry-go-r" and "ound"  instead of "Merry-go-round"
"editor-in-c" and "hief"  instead of "editor-in-chief".
This behaviour is so strange, and I don't know how
the indexer and query processing knows about "merry-go-round",
and yet the TokenStream doesn't.

"green-monster" would work.  But not words with more than
one hyphen.

There are two snippets of code I tried, both didn't
return the desired result:

Snippet 1:
 MyStandardAnalyzer bsa = new MyStandardAnalyzer();
 TokenStream ts = bsa.tokenStream("content", rdr );
 //Token t;
 while (true) {
  org.apache.lucene.analysis.Token t = ts.next();
  if (t == null)
   break;
  System.out.println(t.termText());
  }

   MyStandardAnalyzer contains my special Tokenizer generated from the new
.jj file.
   Essentially, where replace StandardTokenizer with MyStandardTokenizer.
   Merry-go-round becomes Merry-go-r  ound


Snippet 2:
 StandardAnalyzer sa1 = new StandardAnalyzer();
 ts = sa1.tokenStream("content", new StringReader("this is a merry-go-round
with 3 children")  );
 //Token t;
 while (true) {
  org.apache.lucene.analysis.Token t = ts.next();
  if (t == null)
   break;
  System.out.println(t.termText());
 }


   Merry-go-round becomes 3 tokens as merry go round


Could someone give me some suggestions.
The reasons I need the tokens is so that I can get words before and after
the selected words to form some context.
(By the way, currently, I convert a hyphenated word into a phrase,
but to me, that seems like special casing hyphenated words, and I
just want to stay away from special casing.  People has been asking
for all sorts of punctuation, such as _ or / etc.  I thought that if I learn
how to do modify the .jj files and produce the right tokens, I am better
off.


Thank you very much in advance.

words with more than 1 hyphen ?

Reply via email to