[
https://issues.apache.org/jira/browse/SOLR-11917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341693#comment-16341693
]
Hoss Man commented on SOLR-11917:
---------------------------------
h2. *S2.2*: Support configuring fieldTypes with many arbitrarily named analyzers
h3. *S2.2G*: Goal
The initial goal i considered was based solely on SOLR-5053:
* Advanced users may want to use arbitrary analyzers at query time – either by
"name" when parsing a query, or based on special logic in custom plugins.
But when considered in conjunction with some of the other Usecases discussed
here, I realized there was a lot of potenial for general purpose improvements
to the "plumbing" of IndexSchema to supporting multiple analyzers...
* Multilingual "fields" _might_ benefit from having "analyzer per lang"
configurable on a field type (see *S2.1.STRAW1* above)
* It would allow us to define arbitrary "docValues" analyzer for TextFields
(see *S1.3* below)
h3. *S2.2BAD*: Bad Ideas – aka: NOT-Suggested Approach(es)
For the sake of completeness, let's set aside for a moment the broader benefit
of a robust solution to supporting arbitrary analyzers for TextField, and
consider how we might tackle the "user wants to pick an analyzer by name at
query time" with the least amount of development work (ie: "rapid prototype")
...
*Bad Idea #1:*
* subclass of TextField
* at query time, getFieldQuery() looks for an "ftanalyzer" local param from
QParser,
** if exists: use that to find some other FieldType whose Analyzer we use in
place of our own
* NOTE: see *S2.1.HITCH* above about SolrQueryParserBase/QueryBuilder in
previous topic – also apply here except in simple cases
** Although in a super special case situation like this, maybe a limitation of
"only works with 'field' QParser" would be fine? ... but in that case, see Bad
Idea #2...
*Bad Idea #2:*
* assume this usecase/situation is so niche, that we don't need general query
parser support
* assume we can tell people the *must* use a specific QParser to do it
* No need for a special TextField subclass – instead create a new
'ForceAnalyzerQParser'
** use it just like the "field" QParser, but it also takes in a
"forceFieldType" localparam
** use the "forceFieldType" to pick a fieldType by name, and ask it for
getFieldQuery(...) using the original field name
* Example:
{noformat}
# query the 'title' field for "how now brown cow" using the analyzer from the
'nostopwords' <fieldType/>
q={!forceanalyzer f=title forceFieldType='nostopwords'}how+now+brown+cow
{noformat}
h3. *S2.2A*: Suggested Approach – aka: Good Idea
AFAICT, This is what it would take to support arbitrary analyzers on TextFields
"the right way"...
* FieldType method deprecation/replacement
** deprecate/replace these overly specific methods in FieldType...
{code:java}
public final void setQueryAnalyzer(Analyzer analyzer)
public void setIsExplicitQueryAnalyzer(boolean isExplicitQueryAnalyzer)
public boolean isExplicitQueryAnalyzer()
//
public final void setIndexAnalyzer(Analyzer analyzer)
public void setIsExplicitAnalyzer(boolean explicitAnalyzer)
public boolean isExplicitAnalyzer()
{code}
** With code like these...
{code:java}
/** For setting the (explicitly) configured analyzers, key is type,
can throw an exception if a type is not allowed/supported,
but must ultimately call super for getExplicitAnalyzers() to work
Default impl calls changeExplicitAnalyzers() for validation
*/
public final void setExplicitAnalyzers(Map<String,Analyzer>) throws
SolrException
/** exactly what was configured/passed to setExplicitAnalyzers */
public final Map<String,Analyzer> getExplicitAnalyzers()
/** can create & track any implicit /synthetic analyzers needed based on what's
explicitly configured.
can throw an exception if an explicitly configured type is not
allowed/supported.
See below for discussion of default impl
*/
public void changeExplicitAnalyzers(Map<String,Analyzer>) throws SolrException
/* returns either an explicit analyzer configured with that type,
or may return an alternate (or synthetic created) analyzer deemed suitable
for that purpose,
or null if nothing suitable.
default impl only supports "index" and "query"
*/
public Analyzer getAnalyzer(String type) throws SolrException
{code}
* {{FieldTypePluginLoader.create}} refactoring...
** instead of looking for specific analyzers, call {{readAnalyzer}} on all
{{"./analyzer"}} XML nodes
*** if no {{type}} attribute, use {{type=null}}
** build up {{Map<String,Analyzer>}} using everything found
*** if, as the Map is being built, any key already exists, fail w/exception
about the field type having multiple analyzers with the same type.
** delegate to {{FieldType.setExplicitAnalyzers()}}
*** all the existing special case handling in this method should be refactored
into {{FieldType.changeExplicitAnalyzers() ...}}
**** {{HasImplicitIndexAnalyzer}}
**** query Analyzer is / is not null
**** (index) Analyzer is/ is not null
**** {{constructMultiTermAnalyzer}}
*** ALLTHOUGH: the _really_ special case stuff should actually be refactored
into the specific classes:
**** {{TextField}} should override {{setExplicitAnalyzers()}} and call
{{constructMultiTermAnalyzer}} after {{super.setExplicitAnalyzers()}}
**** {{PreAnalalyzedField (the only class that implements
HasImplicitIndexAnalyzer)}} should override {{setExplicitAnalyzers()}} and do
it's own ignoring/logging about explicit analyzers before delegating to super
**** Deprecate/Keep the {{HasImplicitIndexAnalyzer}} check in
{{FieldType.changeExplicitAnalyzers()}} for backcompat (in case custom field
types use it) on stable branches, remove the interface completely from master
* {{FieldTypeDefinition}}
** deprecate/replace these methods...
{code:java}
public AnalyzerDefinition getIndexAnalyzer()
public void setIndexAnalyzer(AnalyzerDefinition indexAnalyzer)
public AnalyzerDefinition getQueryAnalyzer()
public void setQueryAnalyzer(AnalyzerDefinition queryAnalyzer)
public AnalyzerDefinition getMultiTermAnalyzer()
public void setMultiTermAnalyzer(AnalyzerDefinition multiTermAnalyzer)
{code}
** with these methods...
{code:java}
public Map<String,AnalyzerDefinition} getAnalyzers()
public void setAnalyzers(Map<String,AnalyzerDefinition>)
{code}
** this _should_ be really straight forward, since these methods only deal
with explicit analyzer configurations, and can just mutate a Map directly – or
even be removed completley since the class is only used internally by other
methods we'll be modifying below...
* API Structure – aka: {{SchemaRequest}} & {{SchemaResponse}} &
{{FieldType.getNamedPropertyValues()}} & {{FieldTypeXmlAdapter}}
** *this is where things get really, _really_, messy*
** The current Schema API structure for field types with analyzer(s) looks
like this...
{code:java}
{
"name":"myNewTextField",
"class":"solr.TextField",
"indexAnalyzer":{
"tokenizer":{
"class":"solr.PathHierarchyTokenizerFactory",
"delimiter":"/" }},
"queryAnalyzer":{
"class":"solr.KeywordAnalyzer" }
}
{code}
** Moving forward, In an ideal world where we support arbitrary analyzers, it
should probably look something like...
{code:java}
{
"name":"myNewTextField",
"class":"solr.TextField",
"analyzers":[
{ "type":"index", // in the simplest case: '"type":null' or no type
key at all
"tokenizer":{
"class":"solr.PathHierarchyTokenizerFactory",
"delimiter":"/" }},
{ "type":"query",
"class":"solr.KeywordAnalyzer" },
{ "type":"what ever",
"class":"solr.WhateverAnalyzer" } ]
}
{code}
** How do we get from here to there?
*** Changing these classes to all read/write in the new format would be fairly
easy
**** Because most of these code paths really only deal with "explicit(ly
configured)" Analyzers the code changes are really simple and straight forward
– the heavy lifting is all done by other method replacements disucssed above.
*** Changing the necessary classes to be able to *read* either format
(depending on what keys are found) would also be fairly simple
*** The hard part is knowing when to _*write*_ in the old format for
backcompat.
*** Under the new "V2" API model, we _could_ potentially start versioning
stuff ... but are we ready/willing to start saying this new {{fieldType}}
request/response format is {{"/v3"}} (or {{"/v2.1"}} ) ?
**** SOLR-11183 already eliminated the the {{/v2}} path and replaced it with
{{/api}} .. is bring it back (or using {{/api2.1}} ) worthwhile?
**** Hell: Do the _internal_ "V2" method APIs even give us enough info about
the path to make this approach even viable?
*** FWIW: We could, as an ugly shim/hack, ignore the "ideal" API structure,
and instead extend the current {{fooAnalyzer}} syntax such that any FieldType
attribute that ends in "Analyzer" is treated as an analyzer type...
{code:java}
{
"name":"myNewTextField",
"class":"solr.TextField",
"indexAnalyzer":{
"tokenizer":{
"class":"solr.PathHierarchyTokenizerFactory",
"delimiter":"/" }},
"queryAnalyzer":{
"class":"solr.KeywordAnalyzer" },
"what everAnalyzer":{
"class":"solr.WhateverAnalyzer" }
}
{code}
...obviously we'd need to include special handling for the existing
single-analyzer case of {{"analyzer"}} as the attribute name – but that would
be trivial
**** *NOTE:* I feel gross even suggesting this syntax, but it is something
that would be easily achievable
**** In reality, if we did decide to "version" the response structure, this
kind of "shim" structure (and the code to read/write it) might be needed anyway
(example: in case a client requesting the "old" format hits a Solr where there
are some FieldTypes w/arbitrary analyzers)
With the above changes, it would be a fairly simple matter to make methods like
{{TextField.getFieldQuery}} ask the QParser if there is an {{analyzer}}
localparam, and if so use it instead of the \{{query} analyzer when parsing
queries. As before: the *S2.1.HITCH* still applies and would also need to be
"fixed" for this to work with non-trivila QParsers.
h3. *S2.2E*: Example Usage
In addition to how this might simplify *S2.1.STRAW1* (above) and *S1.3* (below)
this would also make the expert level "I want to be able to have lots of
arbitrary analyzers I pick between arbitrarily at query time..." usecase very
clean...
{code:xml}
<!--
For the 'i want to do crazy things' users, a new 'allowAnyAnalyzer="true"'
option could be added to TextField so that it doesn't fail on unexpected
analyzer types.
Then methods like TextField.getFieldQuery could look for an 'analyzer' local
param so the user could do things like:
q={!field f=crazy analyzer=silliest}elephant
NOTE:
- failing on unexpected analyzer types by default is important to catch typos
- like 'type="qeury"'
- Since this is a really niche case, an alternative to
'allowAnyAnalyzer="true"'
would probably be to create a "DynamicQueryAnalyzerTextField extends
TextField"
w/ this special behavior instead
-->
<field name="crazy" type="text_crazy" />
<fieldType name="text_crazy" class="solr.TextField"
indexed="true" multiValued="true" allowAnyAnalyzer="true">
<analyzer type="index">
...
</analyzer>
<analyzer type="query">
...
</analyzer>
<analyzer type="silly">
...
</analyzer>
<analyzer type="silliest">
...
</analyzer>
</fieldType>
{code}
----
h2. *S1.3*: TextField supports a type="docValues" analyzer
h3. *S1.3G*: Goal
{{TextField}} should support {{docValues="true"}} and it should be possible to
configure an arbitrary analyzer (ie: {{<analyzer type="docValues" .../>}} ) for
determining what data gets put in the docValues of a text field for
sorting/faceting – independently from the the index/query analyzer)
For back compatibility, a TextField with {{docValues="true"}} but no
{{type="docValues"}} analyzer should use the {{type="index"}} analyzer.
h3. *S1.3E*: Example Usage
Building DocValues at index time to get same behavior as FieldCache w/o the
query RAM/time cost...
{code:java}
<!--
Faceting (or Sorting) on these 2 features_* fields (should) behave
identically, except that...
- features_fc would require uninversion of indexed terms (FieldCache)
- features_dv would use SortedSetDocValues made up of the same terms
(as created by the 'index' analyzer at indexing time)
-->
<field name="features_fc" type="text_general" />
<field name="features_dv" type="text_general" docValue="true"/>
<fieldType name="text_general" class="solr.TextField"
indexed="true">
<analyzer type="index">
...
</analyzer>
<analyzer type="query">
...
</analyzer>
</fieldType>
{code}
Special Faceting/Sorting behavior, independent of search behavior...
{code:java}
<!--
Stemming when searching, but we also want 'clean' terms for
'Tag Cloud' Faceting...
-->
<fieldType name="text_keywords" class="solr.TextField"
indexed="true" docValues="true">
<analyzer> <!-- query and index -->
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="docValues">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>
</fieldType>
{code}
{code:java}
<!--
Lowercase sorting on the whole string, independent of
whatever search based analysis is done
-->
<fieldType name="text_" class="solr.TextField"
indexed="true" docValues="true">
<analyzer> <!-- query and index -->
...
</analyzer>
<analyzer type="docValues">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>
{code}
h3. *S1.3A*: Suggested Approach
* 90% of the work required here would be the bulk of the work from *S2.2*:
Supporting the configuration of a "docValues" analyzer on TextField
** It would be silly to attempt this *S1.3* idea w/o also tackling the
generalized idea in *S2.2A* (ie: we should stop adding special magic named
analyzers to the logic in IndexSchema – the "plumbing" should be generic, and
the FieldTypes should know what analyzer names they care about)
* Once the plumbing is in place for IndexSchema to allow arbitrary named
analyzers, the remaining work would be fairly trivial:
** Make TextField expect an analyzer named "docValues", implicitly defaulting
to the effective "index" analyzer if there is no explicit "docValues" analyzer
** Refactor the core bit of logic from *S1.2A* (TermDocValuesTextField's
"TokenStream -> SortedSetDocValuesField") down to TextField
*** Either deprecate TermDocValuesTextField, or leave it as syntactic sugar
for TextField w/ {{docValues="true"}}
** Refactor SortableTextField (*S1.1A*) so that it's just syntactic sugar for
TextField with:
*** {{docValues="true"}}
*** a "docValues" analyzer using KeywordTokenizer + a new "Truncation" filter
(to limit the values to the confiured # of characters)
> A Potential Roadmap for robust multi-analyzer TextFields w/various options
> for configuring docValues
> ----------------------------------------------------------------------------------------------------
>
> Key: SOLR-11917
> URL: https://issues.apache.org/jira/browse/SOLR-11917
> Project: Solr
> Issue Type: Wish
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
> Assignee: Hoss Man
> Priority: Major
>
> A while back, I was tasked at my day job to brainstorm & design some "smarter
> field types" in Solr. In particular to think about:
> # How to simplify some of the "special things" people have to know about
> Solr behavior when creating their schemas
> # How to reduce the number of situations where users have to copy/clone one
> "logical field" into multiple "schema felds in order to meet diff use cases
> The main result of this thought excercise is a handful of usecases/goals that
> people seem to have - many of which are already tracked in existing jiras -
> along with a high level design/roadmap of potential solutions for these goals
> that can be implemented incrementally to leverage some common changes (and
> what those changes might look like).
> My intention is to use this jira as a place to share these ideas for broader
> community discussion, and as a central linkage point for the related jiras.
> (details to follow in a very looooooong comment)
> ----
> NOTE: I am not (at this point) personally committing to following through on
> implementing every aspect of these ideas :)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]