Re: Large .frq file

dan sutton Tue, 18 Jan 2011 08:10:59 -0800

Hi Shai,

What I really wanted to do was reduce the frq file size


Oddly (when tokenizing 3 seperate fields) with the
WhitespaceTokenizer, more terms are produced than with the CJK
analyzer and the CJK frq filesize is much larger ... examples below:

with WhitespaceTokenizer:
89M             _0.tis
1.4M    _0.tii
71              _0.fnm
5.8M    _0.fdx
741K    _0.fdt
20             segments.gen
293     segments_2
119M    _0.frq

with CJKTokenizer:
31M     _0.tis
633K    _0.tii
71              _0.fnm
5.8M    _0.fdx
741K    _0.fdt
20              segments.gen
293     segments_2
166M    _0.frq

Also I believe solr calls addDocument with payLoads turned off. I'm
not sure why the size is much larger.

Cheers,
Dan

On Tue, Jan 18, 2011 at 12:41 PM, Shai Erera <ser...@gmail.com> wrote:
> If I understand correctly, you compare the size of the .frq when
> WhitespaceTokenizer is used, vs the CJK ones?
>
> I'd bet this is because WhitespaceTokenizer creates far less terms than the
> CJK one. Whitespace tokenizes the text by separating on whitespace, while
> CJK does sort of N-Gram tokenization, which usually leads to much more terms
> created. This affects the .frq file in that there are much more posting
> lists created, which are stored in the .frq file.
>
> See if the .tii and .tis files differ and if their difference is the same
> order of the .frq differences (e.g. if they are 2x larger w/ CJK, so .frq
> should be of the same order of difference), then I believe this is the
> reason.
>
> Shai
>
> On Tue, Jan 18, 2011 at 2:13 PM, dan sutton <danbsut...@gmail.com> wrote:
>
>> Hi,
>>
>> We're trying to create a large index via solr for trends and notice
>> that we have a large '.frq' file after doing the following:
>>
>>
>> make all text fields index="true", stored="false",
>> omitTermFreqAndPositions="true" omitNorms="true" termPositions="false"
>> termOffsets="false" termVectors="false"
>>
>> We are using a variation on org.apache.lucene.analysis.cjk and notice
>> that the .frq is about 4 time larger than, for example, the
>> WhiteSpaceTokenizer.
>>
>>
>> Considering that with omitTermFreqAndPositions="true" for the text
>> fields I'd have thought this should be : "If omitTf were true it would
>> be this sequence of VInts instead:"
>> (http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies)
>>
>>
>> Can anyone suggest how I can reduce the size of this file?
>>
>>
>> Many thanks,
>> Dan
>>
>> Lucene Specification Version: 2.9.1
>> Solr Specification Version: 1.4.0.2010.09.10.17.10.36
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Large .frq file

Reply via email to