Hi,
I'm using OAK with Lucene.
I have a big plain text file (413MB) and I would like to make
full-text queries on the entire content. To achieve this, I have
changed the default values of:
- `maxFieldLength=-1` (disabled)
- `maxExtractLength=1000000000` (1GB)
This file contains words and a lot of numbers. The goal is to parse
the whole file and create an index for all words. I am not interested
in indexing the numbers, so I have replaced all numbers with an empty
string using a pattern replace filter.
With this configuration (using the default OAK codec), the `.cfs`
index file size (414MB) seems too big. I have investigated the problem
and tried to reduce it.
I have tried to avoid copying the document content inside the `.cfs`
file. In the method `FieldFactory.newFulltextField(value, stored)`, I
have forced the `stored` value to `false`. This way, the document is
not stored, and the `.cfs` index file size is reduced to 269kB.
I have also tried to avoid storing frequencies and positions by
replacing the index options of full-text fields from
`IndexOptions.DOCS_AND_FREQS_AND_POSITIONS` to
`IndexOptions.DOCS_ONLY`. This reduced the size to 14.6kB.
My proposal is to add these settings of `store values` and `index
options` in the OAK index definition.
Could this make sense?
If it would be useful, I can provide a merge request for it.
Test results about csf file size
DOCS_AND_FREQS_AND_POSITIONS
Store.YES
ALL VALUES: 713MB
ONLY WORDS (no numbers): 414MB
Store.NO
ALL VALUES: 299MB
ONLY WORDS (no numbers): 269kB
DOCS_ONLY
Store.YES
ALL VALUES: 611MB
ONLY WORDS (no numbers): 413MB
Store.NO
ALL VALUES: 198MB
ONLY WORDS (no numbers): 15kB
Thank you.
--
<https://25.esteco.com>