[
https://issues.apache.org/jira/browse/LUCENE-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996247#comment-13996247
]
Michael McCandless commented on LUCENE-5584:
--------------------------------------------
bq. If the outputs of the FST wouldn't actually be a field of the FST itself
but if they would be under control of the caller of the FST read*Arc methods
just like the BytesReader is, we wouldn't have the problem (maybe instead of
the BytesReader).
This would essentially push thread-privateness of the Outputs out to the
caller. It's true we did this for the BytesReader (and we've wondered in the
past about using a ThreadPrivate instead), but it makes me nervous also pushing
thread-privateness of Outputs to the caller.
I'm also confused on why a custom Outputs impl that "secretly" reuses isn't
sufficient here.
Actually, let me ammend the suggestion I made before, to this:
{noformat}
public BytesRef add(BytesRef a, BytesRef b) {
BytesRef result;
if (a == NO_OUTPUT) {
result = new BytesRef();
} else {
result = a;
}
result.append(b);
return result;
}
{noformat}
(Not tested). I think something like this would not require any
thread-privateness yet would allow multiple threads to work correctly because
each thread would first do that "result = new BytesRef()" and then re-use that
output from then on, without requiring explicit ThreadLocal anywhere?
> Allow FST read method to also recycle the output value when traversing FST
> --------------------------------------------------------------------------
>
> Key: LUCENE-5584
> URL: https://issues.apache.org/jira/browse/LUCENE-5584
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/FSTs
> Affects Versions: 4.7.1
> Reporter: Christian Ziech
> Attachments: fst-itersect-benchmark.tgz
>
>
> The FST class heavily reuses Arc instances when traversing the FST. The
> output of an Arc however is not reused. This can especially be important when
> traversing large portions of a FST and using the ByteSequenceOutputs and
> CharSequenceOutputs. Those classes create a new byte[] or char[] for every
> node read (which has an output).
> In our use case we intersect a lucene Automaton with a FST<BytesRef> much
> like it is done in
> org.apache.lucene.search.suggest.analyzing.FSTUtil.intersectPrefixPaths() and
> since the Automaton and the FST are both rather large tens or even hundreds
> of thousands of temporary byte array objects are created.
> One possible solution to the problem would be to change the
> org.apache.lucene.util.fst.Outputs class to have two additional methods (if
> you don't want to change the existing methods for compatibility):
> {code}
> /** Decode an output value previously written with {@link
> * #write(Object, DataOutput)} reusing the object passed in if possible */
> public abstract T read(DataInput in, T reuse) throws IOException;
> /** Decode an output value previously written with {@link
> * #writeFinalOutput(Object, DataOutput)}. By default this
> * just calls {@link #read(DataInput)}. This tries to reuse the object
> * passed in if possible */
> public T readFinalOutput(DataInput in, T reuse) throws IOException {
> return read(in, reuse);
> }
> {code}
> The new methods could then be used in the FST in the readNextRealArc() method
> passing in the output of the reused Arc. For most inputs they could even just
> invoke the original read(in) method.
> If you should decide to make that change I'd be happy to supply a patch
> and/or tests for the feature.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]