[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

Michael McCandless (JIRA) Sun, 26 Jul 2015 14:43:22 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-6664:
---------------------------------------
    Attachment: LUCENE-6664.patch

New patch, fixing all nocommits, folding in all the nice test cases from 
LUCENE-6582 (thanks [~ianribas]!), fixing some offsets bugs.

I think it's finally ready.  This issue absorbs LUCENE-6638.

I also wrote a fun test method ({{toDot(TokenStream)}}) that converts a 
{{TokenStream}} to a dot file which you can then render with graphviz.  E.g. 
here's the un-flattened expansion for various syns of usa:

!usa.png!

and the corresponding flattened version:

!usa_flat.png!

(red arcs are the inserted synonym tokens)

With {{SynonymGraphFilter}}, multi token synonyms can finally be correctly 
represented in the token stream, and using query-time synonyms with either 
{{TermAutomatonQuery}} or some other approach (e.g. expanding all paths and 
making OR of PhraseQuery), the correct results should be returned.  Index-time 
synonyms will still be incorrect (fail to match some phrase queries, and 
incorrectly match other phrase queries) since we don't index the 
PosLenAttribute.


> Replace SynonymFilter with SynonymGraphFilter
> ---------------------------------------------
>
>                 Key: LUCENE-6664
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6664
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.3, Trunk
>
>         Attachments: LUCENE-6664.patch, LUCENE-6664.patch, usa.png, 
> usa_flat.png
>
>
> Spinoff from LUCENE-6582.
> I created a new SynonymGraphFilter (to replace the current buggy
> SynonymFilter), that produces correct graphs (does no "graph
> flattening" itself).  I think this makes it simpler.
> This means you must add the FlattenGraphFilter yourself, if you are
> applying synonyms during indexing.
> Index-time syn expansion is a necessarily "lossy" graph transformation
> when multi-token (input or output) synonyms are applied, because the
> index does not store {{posLength}}, so there will always be phrase
> queries that should match but do not, and then phrase queries that
> should not match but do.
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> goes into detail about this.
> However, with this new SynonymGraphFilter, if instead you do synonym
> expansion at query time (and don't do the flattening), and you use
> TermAutomatonQuery (future: somehow integrated into a query parser),
> or maybe just "enumerate all paths and make union of PhraseQuery", you
> should get 100% correct matches (not sure about "proper" scoring
> though...).
> This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

Reply via email to