subject:"Re\: Large .frq file"

Re: Large .frq file

2011-01-18 Thread dan sutton

Hi Shai, What I really wanted to do was reduce the frq file size Oddly (when tokenizing 3 seperate fields) with the WhitespaceTokenizer, more terms are produced than with the CJK analyzer and the CJK frq filesize is much larger ... examples below: with WhitespaceTokenizer: 89M _0.tis

Re: Large .frq file

2011-01-18 Thread Shai Erera

If I understand correctly, you compare the size of the .frq when WhitespaceTokenizer is used, vs the CJK ones? I'd bet this is because WhitespaceTokenizer creates far less terms than the CJK one. Whitespace tokenizes the text by separating on whitespace, while CJK does sort of N-Gram tokenization,