Hi Shai,
What I really wanted to do was reduce the frq file size
Oddly (when tokenizing 3 seperate fields) with the
WhitespaceTokenizer, more terms are produced than with the CJK
analyzer and the CJK frq filesize is much larger ... examples below:
with WhitespaceTokenizer:
89M _0.tis
If I understand correctly, you compare the size of the .frq when
WhitespaceTokenizer is used, vs the CJK ones?
I'd bet this is because WhitespaceTokenizer creates far less terms than the
CJK one. Whitespace tokenizes the text by separating on whitespace, while
CJK does sort of N-Gram tokenization,