Avoid to store document in for full text indicies

Marco Matessi Thu, 22 Aug 2024 04:45:42 -0700

Hi,
I'm using OAK with Lucene.
I have a big plain text file (413MB) and I would like to make
full-text queries on the entire content. To achieve this, I have
changed the default values of:
- `maxFieldLength=-1` (disabled)
- `maxExtractLength=1000000000` (1GB)


This file contains words and a lot of numbers. The goal is to parse
the whole file and create an index for all words. I am not interested
in indexing the numbers, so I have replaced all numbers with an empty
string using a pattern replace filter.

With this configuration (using the default OAK codec), the `.cfs`
index file size (414MB) seems too big. I have investigated the problem
and tried to reduce it.

I have tried to avoid copying the document content inside the `.cfs`
file. In the method `FieldFactory.newFulltextField(value, stored)`, I
have forced the `stored` value to `false`. This way, the document is
not stored, and the `.cfs` index file size is reduced to 269kB.

I have also tried to avoid storing frequencies and positions by
replacing the index options of full-text fields from
`IndexOptions.DOCS_AND_FREQS_AND_POSITIONS` to
`IndexOptions.DOCS_ONLY`. This reduced the size to 14.6kB.

My proposal is to add these settings of `store values` and `index
options` in the OAK index definition.
Could this make sense?
If it would be useful, I can provide a merge request for it.

Test results about csf file size

DOCS_AND_FREQS_AND_POSITIONS
    Store.YES
        ALL VALUES: 713MB
        ONLY WORDS (no numbers): 414MB
    Store.NO
        ALL VALUES: 299MB
        ONLY WORDS (no numbers): 269kB
DOCS_ONLY
    Store.YES
        ALL VALUES: 611MB
        ONLY WORDS (no numbers): 413MB
    Store.NO
        ALL VALUES: 198MB
        ONLY WORDS (no numbers): 15kB


Thank you.

-- 
 <https://25.esteco.com>

Avoid to store document in for full text indicies

Reply via email to