[
https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Koorosh Vakhshoori updated SOLR-7136:
-------------------------------------
Attachment: SOLR-7136.patch
AutoPhaseFiniteStateDiagram.pdf
Here I am uploading a new implementation of AutoPhrasing in coordination with
Ted. This version adds a few features on top of the previous code. Here they
are:
- The phrase detection algorithm is refactored as a finite-state machine. This
FSM takes a term as input for each transition. I am including the FSM diagram
here.
- The new code correctly keeps track of the start and end offsets in all cases.
- Now the code records the PostionLength attribute, since it would be handy for
highlighter. That is once the highlighter is fixed, SOLR-3390.
- There is a new argument ‘emitAmbiguousPhrases’. When it is set to false, it
would only emit auto-phrase that matches the longest sequence of terms. For
example, if we have ‘New York City’ and ‘New York’ in the autophrases.txt file
and the text is ‘New York City is a great place to live’, in this case only
‘New York City’ is emitted. Well, my use case required it and I am sure others
may want it too.
- Rather than applying AutoPhrasing at index time, now you can detect it at
query time by setting ‘quotePhrase’ to true. This is a major enhancement, no
need to do anything special at index time, now the queryParser would just
double quote the detected phrase and run the search as a phrase query. Another
advantage is you can update the autophrases.txt file on the fly, no need to
re-index.
- Updated the queryParser so it would not touch any term in quoted string,
since it would be interfering with user’s intend. For example, in query ‘we are
going to “New York airport”’ the phrase “new York airport” is untouched.
Side note, as far as comparing SOLR-4381 patch and this one, in my opinion they
are complementary not competing. I did some experimentation by chaining
AutoPhrasing and Query-time Synonym as a queryParser. They work well together,
where one detected the phrases and the other one expanded the query to its
synonyms. However, one issue I found was around acronyms in synonym list. For
example, DC stands for ‘Direct Current’. If the index text has DC in it,
searching for ‘Current’ would not match DC, since the indexed document has not
expanded the term to ‘Direct Current’.
> Add an AutoPhrasing TokenFilter
> -------------------------------
>
> Key: SOLR-7136
> URL: https://issues.apache.org/jira/browse/SOLR-7136
> Project: Solr
> Issue Type: New Feature
> Reporter: Ted Sullivan
> Attachments: AutoPhaseFiniteStateDiagram.pdf, SOLR-7136.patch,
> SOLR-7136.patch, SOLR-7136.patch, SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases
> that represent a single entity to be tokenized in a singular fashion. Adds
> support for ManagedResources and Query parser auto-phrasing support given
> LUCENE-2605.
> The rationale for this Token Filter and its use in solving the long standing
> multi-term synonym problem in Lucene Solr has been documented online.
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]