[jira] [Updated] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

Mike Sokolov (JIRA) Mon, 15 Jul 2019 09:59:06 -0700


     [ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mike Sokolov updated LUCENE-8920:
---------------------------------
    Description: 
Some data can lead to worst-case ~4x RAM usage due to this optimization. 
Several ideas were suggested to combat this on the mailing list:

bq. I think we can improve thesituation here by tracking, per-FST instance, the 
size increase we're seeing while building (or perhaps do a preliminary pass 
before building) in order to decide whether to apply the encoding. 

bq. we could also make the encoding a bit more efficient. For instance I 
noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
which make gaps very costly. Associating each label with a dense id and having 
an intermediate lookup, ie. lookup label -> id and then id->arc offset instead 
of doing label->arc directly could save a lot of space in some cases? Also it 
seems that we are repeating the label in the arc metadata when array-with-gaps 
is used, even though it shouldn't be necessary since the label is implicit from 
the address?

  was:
Some data can lead to worst-case ~4x RAM usage due to this optimization. 
Several ideas were suggested to combat this on the mailing list:

bq. I think we can improve thesituation here by tracking, per-FST instance, the 
size increase we're seeing while building (or perhaps do a preliminary pass 
before building) in order to decide whether to apply the encoding. 

bq. we could also make the encoding a
bit more efficient. For instance I noticed that arc metadata is pretty
large in some cases (in the 10-20 bytes) which make gaps very costly.
Associating each label with a dense id and having an intermediate
lookup, ie. lookup label -> id and then id->arc offset instead of
doing label->arc directly could save a lot of space in some cases?
Also it seems that we are repeating the label in the arc metadata when
array-with-gaps is used, even though it shouldn't be necessary since
the label is implicit from the address?


> Reduce size of FSTs due to use of direct-addressing encoding 
> -------------------------------------------------------------
>
>                 Key: LUCENE-8920
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8920
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Major
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

Reply via email to