[
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969588#comment-13969588
]
Robert Muir commented on LUCENE-5584:
-------------------------------------
It depends on your app, but usually something like a Monotonic packed ints
storing address to every Nth term, and prefix coding within that block will
work well. There are examples of this kind of stuff all over the lucene
codebase. Its probably even better compression too, because the compression of
the FST here for these sequence outputs is not very efficient (and traversal
for large number of bytes as you see, is not really either, unless you are
using the intermediate values and actually drive efficiency from that).
> Allow FST read method to also recycle the output value when traversing FST
> --------------------------------------------------------------------------
>
> Key: LUCENE-5584
> URL: https://issues.apache.org/jira/browse/LUCENE-5584
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/FSTs
> Affects Versions: 4.7.1
> Reporter: Christian Ziech
> Attachments: fst-itersect-benchmark.tgz
>
>
> The FST class heavily reuses Arc instances when traversing the FST. The
> output of an Arc however is not reused. This can especially be important when
> traversing large portions of a FST and using the ByteSequenceOutputs and
> CharSequenceOutputs. Those classes create a new byte[] or char[] for every
> node read (which has an output).
> In our use case we intersect a lucene Automaton with a FST<BytesRef> much
> like it is done in
> org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and
> since the Automaton and the FST are both rather large tens or even hundreds
> of thousands of temporary byte array objects are created.
> One possible solution to the problem would be to change the
> org.apache.lucene.util.fst.Outputs class to have two additional methods (if
> you don't want to change the existing methods for compatibility):
> {code}
> /** Decode an output value previously written with {@link
> * #write(Object, DataOutput)} reusing the object passed in if possible */
> public abstract T read(DataInput in, T reuse) throws IOException;
> /** Decode an output value previously written with {@link
> * #writeFinalOutput(Object, DataOutput)}. By default this
> * just calls {@link #read(DataInput)}. This tries to reuse the object
> * passed in if possible */
> public T readFinalOutput(DataInput in, T reuse) throws IOException {
> return read(in, reuse);
> }
> {code}
> The new methods could then be used in the FST in the readNextRealArc() method
> passing in the output of the reused Arc. For most inputs they could even just
> invoke the original read(in) method.
> If you should decide to make that change I'd be happy to supply a patch
> and/or tests for the feature.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]