[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

Michael McCandless (JIRA) Tue, 28 Jul 2015 01:09:13 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-6664:
---------------------------------------
    Attachment: LUCENE-6664.patch

New patch with Rob's idea: I made the new SynonymGraphFilter and
SausageFilter package private, and replaced the old SynonymFilter with
these two filters.

But TestSynonymMapFilter (the existing unit test) fails, because there
are some changes in behavior with the new filter:

  * Syn output order is different: with the new syn filter, the syn
    comes out before the original token.  This is necessary to ensure
    offsets never go backwards...

  * When there are more output tokens for a syn than input tokens,
    then new syn filter makes new positions for the extra tokens, but
    the old one didn't.

  * The new syn filter does more captureState() calls

I think we need to keep the old behavior available, maybe using a
Version constant or a separate class (SynFilterPre53,
LegacySynFilter) or something?


> Replace SynonymFilter with SynonymGraphFilter
> ---------------------------------------------
>
>                 Key: LUCENE-6664
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6664
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.3, Trunk
>
>         Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, 
> usa.png, usa_flat.png
>
>
> Spinoff from LUCENE-6582.
> I created a new SynonymGraphFilter (to replace the current buggy
> SynonymFilter), that produces correct graphs (does no "graph
> flattening" itself).  I think this makes it simpler.
> This means you must add the FlattenGraphFilter yourself, if you are
> applying synonyms during indexing.
> Index-time syn expansion is a necessarily "lossy" graph transformation
> when multi-token (input or output) synonyms are applied, because the
> index does not store {{posLength}}, so there will always be phrase
> queries that should match but do not, and then phrase queries that
> should not match but do.
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> goes into detail about this.
> However, with this new SynonymGraphFilter, if instead you do synonym
> expansion at query time (and don't do the flattening), and you use
> TermAutomatonQuery (future: somehow integrated into a query parser),
> or maybe just "enumerate all paths and make union of PhraseQuery", you
> should get 100% correct matches (not sure about "proper" scoring
> though...).
> This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

Reply via email to