Ok that was actually my first implementation. It was a lot messier. I'll follow up with details when I get back to a keyboard
On Thu, Apr 5, 2018, 9:09 AM Adrien Grand (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426899#comment-16426899 > ] > > Adrien Grand commented on LUCENE-8240: > -------------------------------------- > > Ok, I understand your use-case now. I'm not sure I'm up to making it easy > to do this kind of things, for instance knowing the text content and the > analyzer is not enough to know how a field got analyzed, you'd also need to > know what sub field name was provided. I'm wondering that you may be able > to do what you want with the current API by creating a Tokenizer wrapper > that sets the current sub field name in a custom attribute, and then have a > custom synonym filter that applies different synonyms depending on the > current field, which it can read thanks to the custom attribute? > > > Support different analysis per field instance > > --------------------------------------------- > > > > Key: LUCENE-8240 > > URL: https://issues.apache.org/jira/browse/LUCENE-8240 > > Project: Lucene - Core > > Issue Type: Wish > > Components: modules/analysis > > Reporter: Mike Sokolov > > Priority: Major > > Attachments: SubFieldAnalyzer.java > > > > > > The simplest change for this would be to make > TokenStreamComponents.setReader() public. Another alternative would be to > provide a SubFieldAnalyzer along the lines of what is attached, although > for reasons given below I think this implementation is a little hacky and > would ideally be supported in a different way before making *that* part of > a public Lucene API. > > Exposing this method would allow a third-party extension to access it in > order to wrap TokenStreamComponents. My use case is a SubFieldAnalyzer > (attached, for reference) that applies different analysis to different > instances of a field. This supports a big "catch-all" field that has > different (index-time) text processing. The way we implement that is by > creating a TokenStreamComponents that wraps separate per-subfield > components and switches among them when setReader() is called. > > Why setReader()? This is the only part of the API where we can inject > this notion of subfields. setReader() is called with a Reader for each > field instance, and we supply a special Reader that identifies its subfield. > > This is a bit hacky – ideally subfields would be first-class citizens in > the Analyzer API, so eg there would be methods like > Analyzer.createComponents(String fieldName, String subFieldName), etc. > However this seems like a pretty big change for an experimental feature, so > it seems like an OK tradeoff to live with the Reader-per-subfield hack for > now. > > Currently SubFieldAnalyzer has to live in org.apache.lucene.analysis > package in order to call TokenStreamComponents.setReader (on a separate > instance) and propitiate java's code-hiding rules, which is awkward. > > > > -- > This message was sent by Atlassian JIRA > (v7.6.3#76005) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
