Hi Dawid,

Yes, the performance differences between Lucene's FST,  morfologik's fsa
and the PatriciaTrie are rather small.

I've managed to get something working well for my use-case. Thanks Dawid
and Michael for all your insight!

On 20 February 2017 at 19:21, Dawid Weiss <dawid.we...@gmail.com> wrote:

> > PatriciaTrie. In particular building an FST with doShareSuffix = false is
> > the fastest of any option,
>
> If you don't share the suffix then you are building a kind of Patricia
> trie... But suffix sharing is cheap and can give you a memory saving
> (and resulting cache locality sometimes) that is non-trivial to
> neglect.
>
> > and morfologik's fsa has the lowest heap usage of
> > all (both in terms of garbage and the final size of the FSA).
>
> I think these would be minuscule differences, really. Something not
> worth the effort of maintaining a compatibility with two libraries,
> for example. Lucene's FST are transducers, so they do have an
> additional benefit in that you can store something extra with each
> entry (this may be handy sometimes -- for example frequency
> information attached to each term, etc.).
>
> > to work with BytesRef) and it supports both prefix
> > (MatchResult.SEQUENCE_IS_A_PREFIX) and exact matching
> > (MatchResult.EXACT_MATCH).
>
> So does Lucene's FST, although at a lower level (you'd need to descend
> manually).
>
> > Given the efficiencies and functionality similarity of the FST vs
> Automation
> > in Lucene, I'm kind of curious as to why you would ever use Automaton?
>
> The Automaton class was ported from Brics  library and it supports a
> much more generic class of automata, including optimizations from
> NFA->DFA, etc. The FST class is built directly from one specific kind
> of data (unique sorted keys on input) and it's heavily optimized
> towards this case.
>
> Dawid
>

Reply via email to