[
https://issues.apache.org/jira/browse/LUCENE-7355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-7355:
---------------------------------
Attachment: LUCENE-7355.patch
bq. it appears you accidentally included other WIP
Sorry I probably generated the patch against the wrong base commit, hence these
unrelated changes.
bq. Why create a StringTokenStream; isn't KeywordTokenizer fine? Oh I see
that's in another module... kinda seems like a generic utility that should be
in core to me IMO.
I'd be fine to have KeywordTokenizer in core too, let's discuss it in another
issue and then potentially cut over to it if it makes it to core?
bq. An easy optimization is to check if initReaderForNormalization returns the
input StringReader. If so, simply set filteredText to text.
The way #normalize works is indeed not very efficient at the moment. In
addition to this, it does not cache its analysis chain like we do for
#tokenStream. But it's probably ok since this method should not be called as
intensively as #tokenStream? (we can still improve in the future if this proves
to be a bottleneck)
bq. It's a shame to call createComponents just to get the AttributeFactory
Agreed, this one annoys me too. I initially wanted to add a method but this is
a pity since this information is already available. That said, maybe the method
approach is better since borrowing the attribute factory from the regular
analysis chain makes us close the token stream before it has been consumed,
which some analysis chains might not like. I updated the patch.
bq. I suppose a separate issue might be for Solr to do this when someone
configures a custom Analyzer.
Solr already solves this problem in a different way by having a different
analyzer for multi-term queries which is computed using
MultiTermAwareComponent. I agree it would be nice for it to switch to
Analyzer#normalize but this would have the drawback that it would either
require to drop support for configuring a custom multi-term analyzer or the
integration would be a bit weird, ie. it would have to use Analyzer.tokenStream
on the multiterm analyzer if it is configured or fall back to
Analyzer.normalize on the default analyzer if no multi-term analyzer is
configured - which might be controversial.
> Leverage MultiTermAwareComponent in query parsers
> -------------------------------------------------
>
> Key: LUCENE-7355
> URL: https://issues.apache.org/jira/browse/LUCENE-7355
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-7355.patch, LUCENE-7355.patch, LUCENE-7355.patch,
> LUCENE-7355.patch, LUCENE-7355.patch, LUCENE-7355.patch
>
>
> MultiTermAwareComponent is designed to make it possible to do the right thing
> in query parsers when in comes to analysis of multi-term queries. However,
> since query parsers just take an analyzer and since analyzers do not
> propagate the information about what to do for multi-term analysis, query
> parsers cannot do the right thing out of the box.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]