Hi Shai, What I really wanted to do was reduce the frq file size
Oddly (when tokenizing 3 seperate fields) with the WhitespaceTokenizer, more terms are produced than with the CJK analyzer and the CJK frq filesize is much larger ... examples below: with WhitespaceTokenizer: 89M _0.tis 1.4M _0.tii 71 _0.fnm 5.8M _0.fdx 741K _0.fdt 20 segments.gen 293 segments_2 119M _0.frq with CJKTokenizer: 31M _0.tis 633K _0.tii 71 _0.fnm 5.8M _0.fdx 741K _0.fdt 20 segments.gen 293 segments_2 166M _0.frq Also I believe solr calls addDocument with payLoads turned off. I'm not sure why the size is much larger. Cheers, Dan On Tue, Jan 18, 2011 at 12:41 PM, Shai Erera <ser...@gmail.com> wrote: > If I understand correctly, you compare the size of the .frq when > WhitespaceTokenizer is used, vs the CJK ones? > > I'd bet this is because WhitespaceTokenizer creates far less terms than the > CJK one. Whitespace tokenizes the text by separating on whitespace, while > CJK does sort of N-Gram tokenization, which usually leads to much more terms > created. This affects the .frq file in that there are much more posting > lists created, which are stored in the .frq file. > > See if the .tii and .tis files differ and if their difference is the same > order of the .frq differences (e.g. if they are 2x larger w/ CJK, so .frq > should be of the same order of difference), then I believe this is the > reason. > > Shai > > On Tue, Jan 18, 2011 at 2:13 PM, dan sutton <danbsut...@gmail.com> wrote: > >> Hi, >> >> We're trying to create a large index via solr for trends and notice >> that we have a large '.frq' file after doing the following: >> >> >> make all text fields index="true", stored="false", >> omitTermFreqAndPositions="true" omitNorms="true" termPositions="false" >> termOffsets="false" termVectors="false" >> >> We are using a variation on org.apache.lucene.analysis.cjk and notice >> that the .frq is about 4 time larger than, for example, the >> WhiteSpaceTokenizer. >> >> >> Considering that with omitTermFreqAndPositions="true" for the text >> fields I'd have thought this should be : "If omitTf were true it would >> be this sequence of VInts instead:" >> (http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies) >> >> >> Can anyone suggest how I can reduce the size of this file? >> >> >> Many thanks, >> Dan >> >> Lucene Specification Version: 2.9.1 >> Solr Specification Version: 1.4.0.2010.09.10.17.10.36 >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org