Re: [jira] [Commented] (LUCENE-8240) Support different analysis per field instance

Michael Sokolov Thu, 05 Apr 2018 06:15:07 -0700

Ok that was actually my first implementation. It was a lot messier. I'll
follow up with details when I get back to a keyboard


On Thu, Apr 5, 2018, 9:09 AM Adrien Grand (JIRA) <[email protected]> wrote:

>
>     [
> https://issues.apache.org/jira/browse/LUCENE-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426899#comment-16426899
> ]
>
> Adrien Grand commented on LUCENE-8240:
> --------------------------------------
>
> Ok, I understand your use-case now. I'm not sure I'm up to making it easy
> to do this kind of things, for instance knowing the text content and the
> analyzer is not enough to know how a field got analyzed, you'd also need to
> know what sub field name was provided. I'm wondering that you may be able
> to do what you want with the current API by creating a Tokenizer wrapper
> that sets the current sub field name in a custom attribute, and then have a
> custom synonym filter that applies different synonyms depending on the
> current field, which it can read thanks to the custom attribute?
>
> > Support different analysis per field instance
> > ---------------------------------------------
> >
> >                 Key: LUCENE-8240
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-8240
> >             Project: Lucene - Core
> >          Issue Type: Wish
> >          Components: modules/analysis
> >            Reporter: Mike Sokolov
> >            Priority: Major
> >         Attachments: SubFieldAnalyzer.java
> >
> >
> > The simplest change for this would be to make
> TokenStreamComponents.setReader() public. Another alternative would be to
> provide a SubFieldAnalyzer along the lines of what is attached, although
> for reasons given below I think this implementation is a little hacky and
> would ideally be supported in a different way before making *that* part of
> a public Lucene API.
> > Exposing this method would allow a third-party extension to access it in
> order to wrap TokenStreamComponents. My use case is a SubFieldAnalyzer
> (attached, for reference) that applies different analysis to different
> instances of a field. This supports a big "catch-all" field that has
> different (index-time) text processing. The way we implement that is by
> creating a TokenStreamComponents that wraps separate per-subfield
> components and switches among them when setReader() is called.
> > Why setReader()? This is the only part of the API where we can inject
> this notion of subfields. setReader() is called with a Reader for each
> field instance, and we supply a special Reader that identifies its subfield.
> > This is a bit hacky – ideally subfields would be first-class citizens in
> the Analyzer API, so eg there would be methods like
> Analyzer.createComponents(String fieldName, String subFieldName), etc.
> However this seems like a pretty big change for an experimental feature, so
> it seems like an OK tradeoff to live with the Reader-per-subfield hack for
> now.
> > Currently SubFieldAnalyzer has to live in org.apache.lucene.analysis
> package in order to call TokenStreamComponents.setReader (on a separate
> instance) and propitiate java's code-hiding rules, which is awkward.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [jira] [Commented] (LUCENE-8240) Support different analysis per field instance

Reply via email to