Dear Developers,
I learned that *omitting norms during indexing for a field saves a byte per
document *in Lucene. However, during my testing, I observed varying results
in the overall size of the Lucene index (collection of documents) when
disabling norms for string fields during indexing.
Here are the configuration details for reference:
- *Lucene Version:* 5.3.1
- *Java Version:* OpenJDK 17.0.8.1
- *Indexer Configuration:*
- index.merge_factor: 10
- index.partition_max_doc: 5,000,000
- indexer.commit_interval_sec: 60
- indexer.commit_max_doc: 100,000
- *Merge Policy:* LogByteSizeMergePolicy
*Test Results:*
*TEST DATA *
*#UNIQUE FIELDS IN AN INDEX(5M DOCUMENTS)*
*#STRING FIELDS - FOR WHICH NORMS WILL BE ENABLED OR DISABLED*
*AVG SIZE OF INDEX IN MB [NORMS ENABLED] *
*AVG SIZE OF INDEX IN MB [NORMS DISABLED]*
*DIFFERENCE*
DATA - I (All documents contain same set of fields and their values)
103
74
1869
1876
No difference
DATA - II (All documents contain same set of fields but having random
values)
128
113
25412
31890
Increased by 20%
DATA - II (Documents contain different sets of field-value pairs, subsets
of all field-value pairs)
184
87
2295
2005
Reduced by 14%
DATA - IV(Documents contain different sets of field-value pairs, subsets of
all field-value pairs)
1091
1026
10512
5905
Reduced by 43%
Could you please provide insights or clarify whether this behavior aligns
with the expected impact on index size? Additionally, could you explain why
the size reduction appears to be unpredictable?
Thank you for your assistance!
With Regards,
Balaram Sharma