[jira] [Commented] (LUCENE-8240) Support different analysis per field instance

Adrien Grand (JIRA) Thu, 05 Apr 2018 06:09:17 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426899#comment-16426899
 ]


Adrien Grand commented on LUCENE-8240:
--------------------------------------

Ok, I understand your use-case now. I'm not sure I'm up to making it easy to do 
this kind of things, for instance knowing the text content and the analyzer is 
not enough to know how a field got analyzed, you'd also need to know what sub 
field name was provided. I'm wondering that you may be able to do what you want 
with the current API by creating a Tokenizer wrapper that sets the current sub 
field name in a custom attribute, and then have a custom synonym filter that 
applies different synonyms depending on the current field, which it can read 
thanks to the custom attribute?

> Support different analysis per field instance
> ---------------------------------------------
>
>                 Key: LUCENE-8240
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8240
>             Project: Lucene - Core
>          Issue Type: Wish
>          Components: modules/analysis
>            Reporter: Mike Sokolov
>            Priority: Major
>         Attachments: SubFieldAnalyzer.java
>
>
> The simplest change for this would be to make 
> TokenStreamComponents.setReader() public. Another alternative would be to 
> provide a SubFieldAnalyzer along the lines of what is attached, although for 
> reasons given below I think this implementation is a little hacky and would 
> ideally be supported in a different way before making *that* part of a public 
> Lucene API.
> Exposing this method would allow a third-party extension to access it in 
> order to wrap TokenStreamComponents. My use case is a SubFieldAnalyzer 
> (attached, for reference) that applies different analysis to different 
> instances of a field. This supports a big "catch-all" field that has 
> different (index-time) text processing. The way we implement that is by 
> creating a TokenStreamComponents that wraps separate per-subfield components 
> and switches among them when setReader() is called.
> Why setReader()? This is the only part of the API where we can inject this 
> notion of subfields. setReader() is called with a Reader for each field 
> instance, and we supply a special Reader that identifies its subfield.
> This is a bit hacky – ideally subfields would be first-class citizens in the 
> Analyzer API, so eg there would be methods like 
> Analyzer.createComponents(String fieldName, String subFieldName), etc. 
> However this seems like a pretty big change for an experimental feature, so 
> it seems like an OK tradeoff to live with the Reader-per-subfield hack for 
> now.
> Currently SubFieldAnalyzer has to live in org.apache.lucene.analysis package 
> in order to call TokenStreamComponents.setReader (on a separate instance) and 
> propitiate java's code-hiding rules, which is awkward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8240) Support different analysis per field instance

Reply via email to