[
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492904#comment-16492904
]
David Smiley commented on LUCENE-8332:
--------------------------------------
Oh I wanted to mention one thing; perhaps just here though I could put in the
docs.
An alternative approach to this tagger might be to use the SynonymGraphFilter
(with other steps/configuration),
which has a lot of similarities with the Tagger's algorithm. I've heard of
others that have done this (Dice.com?), and before I created the tagger I
thought about this approach too. There are some issues/barriers to "just"
using the synonym filter::
* if the filter finds multiple overlapping matches, it only returns one without
any control over its choice. (compare to the STT's "overlaps" param with
several choices and it's pluggable)
* the filter doesn't hold any metadata; it's just a set of names. Though you
could use synonyms to map to an ID that you then lookup in something else (e.g.
some DB or Solr index).
* the synonym filter must re-construct its FST on startup each time;
customizations are necessary to load an existing one from disk.
* you have to arrange for any text processing/analysis (e.g. tokenization rules
or phonetic filters) of the dictionary to create synonym entries. With the STT
this is all configurable in a standard way like any text field.
* and of course you'd have to glue it all together somehow.
> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> -------------------------------------------------------------------
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: David Smiley
> Assignee: David Smiley
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the
> analysis module renamed as ConcatenateGraphTokenStream. See comments in
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has
> several uses:
> * for the suggest module
> * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
> * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from
> the suggest module as it's marked experimental and it only seems to be public
> now perhaps due to some technicality (it has package level constructors).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]