[jira] [Updated] (SOLR-7136) Add an AutoPhrasing TokenFilter

Koorosh Vakhshoori (JIRA) Thu, 19 Nov 2015 12:15:01 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Koorosh Vakhshoori updated SOLR-7136:
-------------------------------------
    Attachment: SOLR-7136.patch
                AutoPhaseFiniteStateDiagram.pdf

Here I am uploading a new implementation of AutoPhrasing in coordination with 
Ted. This version adds a few features on top of the previous code. Here they 
are:
- The phrase detection algorithm is refactored as a finite-state machine. This 
FSM takes a term as input for each transition. I am including the FSM diagram 
here.
- The new code correctly keeps track of the start and end offsets in all cases.
- Now the code records the PostionLength attribute, since it would be handy for 
highlighter. That is once the highlighter is fixed, SOLR-3390.
- There is a new argument ‘emitAmbiguousPhrases’. When it is set to false, it 
would only emit auto-phrase that matches the longest sequence of terms. For 
example, if we have ‘New York City’ and ‘New York’ in the autophrases.txt file 
and the text is ‘New York City is a great place to live’, in this case only 
‘New York City’ is emitted. Well, my use case required it and I am sure others 
may want it too.
- Rather than applying AutoPhrasing at index time, now you can detect it at 
query time by setting ‘quotePhrase’ to true. This is a major enhancement, no 
need to do anything special at index time, now the queryParser would just 
double quote the detected phrase and run the search as a phrase query. Another 
advantage is you can update the autophrases.txt file on the fly, no need to 
re-index.
- Updated the queryParser so it would not touch any term in quoted string, 
since it would be interfering with user’s intend. For example, in query ‘we are 
going to “New York airport”’ the phrase “new York airport” is untouched.
Side note, as far as comparing SOLR-4381 patch and this one, in my opinion they 
are complementary not competing. I did some experimentation by chaining 
AutoPhrasing and Query-time Synonym as a queryParser. They work well together, 
where one detected the phrases and the other one expanded the query to its 
synonyms. However, one issue I found was around acronyms in synonym list. For 
example, DC stands for ‘Direct Current’. If the index text has DC in it, 
searching for ‘Current’ would not match DC, since the indexed document has not 
expanded the term to ‘Direct Current’.


> Add an AutoPhrasing TokenFilter
> -------------------------------
>
>                 Key: SOLR-7136
>                 URL: https://issues.apache.org/jira/browse/SOLR-7136
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ted Sullivan
>         Attachments: AutoPhaseFiniteStateDiagram.pdf, SOLR-7136.patch, 
> SOLR-7136.patch, SOLR-7136.patch, SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases 
> that represent a single entity to be tokenized in a singular fashion. Adds 
> support for ManagedResources and Query parser auto-phrasing support given 
> LUCENE-2605.
> The rationale for this Token Filter and its use in solving the long standing 
> multi-term synonym problem in Lucene Solr has been documented online. 
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-7136) Add an AutoPhrasing TokenFilter

Reply via email to