Lucene standard anlyzer will remove almost all punctuation.
In some cases, we want to keep some punctuation, for example in music
search, some singer name and album name could be a punctuation.
Is there any analyzer that we can customized punctuation to be removed?
Hi,
Whitespace analyser/tokenizer for example.
Ahmet
On Monday, March 6, 2017 10:21 AM, Yonghui Zhao wrote:
Lucene standard anlyzer will remove almost all punctuation.
In some cases, we want to keep some punctuation, for example in music
search, some singer name and album name could be a punc
Yes whitespace analyzer will keep punctuation, but it only breaks word by
space.
I didn’t explain my requirement clearly.
I want to an analyzer like standard analyzer but may keep some punctuation
configured.
2017-03-06 18:03 GMT+08:00 Ahmet Arslan :
> Hi,
>
> Whitespace analyser/tokenizer for
You could use ICUTokenizer and make a custom RuleBasedBreakIterator .rbbi
file to control precisely when splitting should happen, but that language
is complex to configure ;)
Another option is to maybe make a CharFilter ahead of StandardTokenizer
that tries to rewrite the punctuation you want to k
Hi Zhao,
WhiteSpace tokeniser followed by a customised word delimiter filter factory
would be solution.
Please see types attribute of the word delimiter filter for customising
characters.
ahmet
On Monday, March 6, 2017 12:22 PM, Yonghui Zhao wrote:
Yes whitespace analyzer will keep punctuat
What you can do, is adding a custom search field with the singer name
into your document to be indexed :
doc.add(new StringField("singername", myValue, Store.NO));
Than you query you index like this:
String myquery="(singername:\" + searchphrase + "\") or (" +
searchphrase + ")";
in