The problem is in StandardTokenizer so Analyzer with method:
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new LowerCaseTokenizer(reader);
result = new StopFilter(result, stopSet);
return result;
}
if you need everything standard analyzer does
Fr
That was my first thought as well, but it looks like APOSTROPHE is
already the one that I want. As you can see, from StandardAnalyzer.jj
---
TOKEN : { // token patterns
// basic word: a sequence of digits & letters
||)+ >
// internal ap
Apostrophe is recognized as a part of word - Standard analyzer is mostly
English oriented.
The way is to swap apostrophes - "normal" with unusual.
StandardAnalyzer.java line 40-44
APOSTROPHE:
token = jj_consume_token(APOSTROPHE);
-
Hi there,
Any ideas you have about the following would be greatly appreciated.
I'd like apostropes to break up a word into two for indexing - ie, the
french l'observatoire would be indexed as two separate tokens, l
observatoire. My understanding from reading documentation and list
archives is tha