[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

Erick Erickson (JIRA) Wed, 17 Oct 2018 19:45:33 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654550#comment-16654550
 ]


Erick Erickson commented on LUCENE-8497:
----------------------------------------

For crying out loud, I wrote that 11 years ago and you expect me to remember 
why ;)

OK, I'll get serious now. The origin of MultTermAware but was to allow us to 
apply some filters for wildcard queries, it all started with LowerCaseFilter. I 
got really tired of explaining to users that "Sol*" didn't find "solr" because 
terms with wildcards were unanalyzed. As long as that behavior is retained that 
test can be removed for all of me. It's pretty out of date, it only verifies 
that a few of the filters that implement that interface anyway.

So remove it if you see fit. A more effective test of the behavior I care about 
would be determining if all the filters that implement that interface properly 
work with, say, wildcards in the search term.

> Rethink multi-term analysis handling
> ------------------------------------
>
>                 Key: LUCENE-8497
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8497
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8497.patch, LUCENE-8497.patch
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current framework for handling term normalisation works via instanceof 
> checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself 
> deals in AbstractAnalysisComponents, and so callers need to cast to the 
> correct component type before use, which is ripe for misuse.
> We should re-organise all this to be type-safe and usable without casts.  One 
> possibility is to add `normalize` methods to CharFilterFactory and 
> TokenFilterFactory that mirror their existing `create` methods.  The default 
> implementation would return the input unchanged, while filters that should 
> apply at normalization time can delegate to `create`.
> Related to this, we should deprecate and remove LowerCaseTokenizer, which 
> combines tokenization and normalization in a way that will break this API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8497) Rethink multi-term analysis handling

Reply via email to