Hi there, Any ideas you have about the following would be greatly appreciated.
I'd like apostropes to break up a word into two for indexing - ie, the french l'observatoire would be indexed as two separate tokens, l observatoire. My understanding from reading documentation and list archives is that StandardAnalyzer should do this. However, it is not working that way for me, and l'observatoire is indexing as one word. Interstingly, l`observatoire (ie, with the other, less common apostrophe) is indexing properly, as l observatoire . Here is the test I've written and the output I'm getting. TEST import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.standard.StandardAnalyzer; import java.io.StringReader; import java.io.IOException; public class LuceneTokenizingTest { static Analyzer analyzer; public static void main(String[] args) throws IOException { analyzer = new StandardAnalyzer(new String[] {}); testTokenizeUnusualApostrophe(); testTokenizeUsualApostrophe(); } public static void testTokenizeUnusualApostrophe() { System.out.println( "Testing how l`observatoire is tokenized." ); System.out.println( "Expecting: [l] [observatoire] "); System.out.println( "Getting: " + analyze("l`observatoire") + "\n\n"); } public static void testTokenizeUsualApostrophe() { System.out.println( "Testing how l'observatoire is tokenized." ); System.out.println( "Expecting: [l] [observatoire] "); System.out.println( "Getting: " + analyze("l'observatoire") ); } private static String analyze(String text) { String returnString=""; try{ TokenStream stream = analyzer.tokenStream("contents", new StringReader(text)); while (true) { Token token = stream.next(); if (token == null) break; returnString = returnString + "[" + token.termText() + "] "; } }catch(IOException e){ System.out.println("Exception: " + e.toString()); } return returnString; } } OUTPUT Testing how l`observatoire is tokenized. Expecting: [l] [observatoire] Getting: [l] [observatoire] Testing how l'observatoire is tokenized. Expecting: [l] [observatoire] Getting: [l'observatoire] Thanks so much for your help! Sarah --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]