Mike Sokolov created LUCENE-8920:
------------------------------------
Summary: Reduce size of FSTs due to use of direct-addressing
encoding
Key: LUCENE-8920
URL: https://issues.apache.org/jira/browse/LUCENE-8920
Project: Lucene - Core
Issue Type: Improvement
Reporter: Mike Sokolov
Some data can lead to worst-case ~4x RAM usage due to this optimization.
Several ideas were suggested to combat this on the mailing list:
bq. I think we can improve thesituation here by tracking, per-FST instance, the
size increase we're seeing while building (or perhaps do a preliminary pass
before building) in order to decide whether to apply the encoding.
bq. we could also make the encoding a
bit more efficient. For instance I noticed that arc metadata is pretty
large in some cases (in the 10-20 bytes) which make gaps very costly.
Associating each label with a dense id and having an intermediate
lookup, ie. lookup label -> id and then id->arc offset instead of
doing label->arc directly could save a lot of space in some cases?
Also it seems that we are repeating the label in the arc metadata when
array-with-gaps is used, even though it shouldn't be necessary since
the label is implicit from the address?
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]