StandardAnalyzer Problem with Apostrophes

Sarah Hunter Mon, 13 Nov 2006 13:51:33 -0800

Hi there,
Any ideas you have about the following would be greatly appreciated.


I'd like apostropes to break up a word into two for indexing - ie, the
french l'observatoire would be indexed as two separate tokens, l
observatoire. My understanding from reading documentation and list
archives is that StandardAnalyzer should do this. However, it is not
working that way for me, and l'observatoire is indexing as one word.
Interstingly, l`observatoire (ie, with the other, less common
apostrophe) is indexing properly, as l observatoire .

Here is the test I've written and the output I'm getting.

TEST


import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.io.StringReader;
import java.io.IOException;

public class LuceneTokenizingTest {

  static Analyzer analyzer;

  public static void main(String[] args) throws IOException {
      analyzer =  new StandardAnalyzer(new String[] {});

      testTokenizeUnusualApostrophe();
      testTokenizeUsualApostrophe();
  }

      public static void testTokenizeUnusualApostrophe() {
              System.out.println( "Testing how l`observatoire is tokenized." );
              System.out.println( "Expecting: [l] [observatoire] ");
              System.out.println( "Getting: " +
analyze("l`observatoire") + "\n\n");
      }

      public static void testTokenizeUsualApostrophe() {
              System.out.println( "Testing how l'observatoire is tokenized." );
              System.out.println( "Expecting: [l] [observatoire] ");
              System.out.println( "Getting: " + analyze("l'observatoire") );
      }

  private static String analyze(String text) {
      String returnString="";
      try{
              TokenStream stream = analyzer.tokenStream("contents", new
StringReader(text));
              while (true) {
                  Token token = stream.next();
                  if (token == null) break;
                      returnString = returnString + "[" +
token.termText() + "] ";
                  }
      }catch(IOException e){
              System.out.println("Exception: " + e.toString());
      }
      return returnString;
  }
}



OUTPUT

Testing how l`observatoire is tokenized.
Expecting: [l] [observatoire]
Getting: [l] [observatoire]


Testing how l'observatoire is tokenized.
Expecting: [l] [observatoire]
Getting: [l'observatoire]




Thanks so much for your help!

Sarah

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

StandardAnalyzer Problem with Apostrophes

Reply via email to