[jira] [Commented] (SOLR-11917) A Potential Roadmap for robust multi-analyzer TextFields w/various options for configuring docValues

Hoss Man (JIRA) Fri, 26 Jan 2018 14:23:48 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341693#comment-16341693
 ]


Hoss Man commented on SOLR-11917:
---------------------------------

h2. *S2.2*: Support configuring fieldTypes with many arbitrarily named analyzers
h3. *S2.2G*: Goal

The initial goal i considered was based solely on SOLR-5053:
 * Advanced users may want to use arbitrary analyzers at query time – either by 
"name" when parsing a query, or based on special logic in custom plugins.

But when considered in conjunction with some of the other Usecases discussed 
here, I realized there was a lot of potenial for general purpose improvements 
to the "plumbing" of IndexSchema to supporting multiple analyzers...
 * Multilingual "fields" _might_ benefit from having "analyzer per lang" 
configurable on a field type (see *S2.1.STRAW1* above)
 * It would allow us to define arbitrary "docValues" analyzer for TextFields 
(see *S1.3* below)

h3. *S2.2BAD*: Bad Ideas – aka: NOT-Suggested Approach(es)

For the sake of completeness, let's set aside for a moment the broader benefit 
of a robust solution to supporting arbitrary analyzers for TextField, and 
consider how we might tackle the "user wants to pick an analyzer by name at 
query time" with the least amount of development work (ie: "rapid prototype") 
...

*Bad Idea #1:*
 * subclass of TextField
 * at query time, getFieldQuery() looks for an "ftanalyzer" local param from 
QParser,
 ** if exists: use that to find some other FieldType whose Analyzer we use in 
place of our own
 * NOTE: see *S2.1.HITCH* above about SolrQueryParserBase/QueryBuilder in 
previous topic – also apply here except in simple cases
 ** Although in a super special case situation like this, maybe a limitation of 
"only works with 'field' QParser" would be fine? ... but in that case, see Bad 
Idea #2...

*Bad Idea #2:*
 * assume this usecase/situation is so niche, that we don't need general query 
parser support
 * assume we can tell people the *must* use a specific QParser to do it
 * No need for a special TextField subclass – instead create a new 
'ForceAnalyzerQParser'
 ** use it just like the "field" QParser, but it also takes in a 
"forceFieldType" localparam
 ** use the "forceFieldType" to pick a fieldType by name, and ask it for 
getFieldQuery(...) using the original field name
 * Example:
{noformat}
# query the 'title' field for "how now brown cow" using the analyzer from the 
'nostopwords' <fieldType/>
q={!forceanalyzer f=title forceFieldType='nostopwords'}how+now+brown+cow
{noformat}

h3. *S2.2A*: Suggested Approach – aka: Good Idea

AFAICT, This is what it would take to support arbitrary analyzers on TextFields 
"the right way"...
 * FieldType method deprecation/replacement
 ** deprecate/replace these overly specific methods in FieldType...
{code:java}
public final void setQueryAnalyzer(Analyzer analyzer)
public void setIsExplicitQueryAnalyzer(boolean isExplicitQueryAnalyzer)
public boolean isExplicitQueryAnalyzer()
//
public final void setIndexAnalyzer(Analyzer analyzer)
public void setIsExplicitAnalyzer(boolean explicitAnalyzer)
public boolean isExplicitAnalyzer()
{code}

 ** With code like these...
{code:java}
/** For setting the (explicitly) configured analyzers, key is type,
    can throw an exception if a type is not allowed/supported,
    but must ultimately call super for getExplicitAnalyzers() to work
    
    Default impl calls changeExplicitAnalyzers() for validation
  */
public final void setExplicitAnalyzers(Map<String,Analyzer>) throws 
SolrException
/** exactly what was configured/passed to setExplicitAnalyzers */
public final Map<String,Analyzer> getExplicitAnalyzers()
/** can create & track any implicit /synthetic analyzers needed based on what's 
explicitly configured.
    can throw an exception if an explicitly configured type is not 
allowed/supported.
    See below for discussion of default impl
  */
public void changeExplicitAnalyzers(Map<String,Analyzer>) throws SolrException
/* returns either an explicit analyzer configured with that type,
   or may return an alternate (or synthetic created) analyzer deemed suitable 
for that purpose,
   or null if nothing suitable.
   default impl only supports "index" and "query"
   */
public Analyzer getAnalyzer(String type) throws SolrException
{code}

 * {{FieldTypePluginLoader.create}} refactoring...
 ** instead of looking for specific analyzers, call {{readAnalyzer}} on all 
{{"./analyzer"}} XML nodes
 *** if no {{type}} attribute, use {{type=null}}
 ** build up {{Map<String,Analyzer>}} using everything found
 *** if, as the Map is being built, any key already exists, fail w/exception 
about the field type having multiple analyzers with the same type.
 ** delegate to {{FieldType.setExplicitAnalyzers()}}
 *** all the existing special case handling in this method should be refactored 
into {{FieldType.changeExplicitAnalyzers() ...}}
 **** {{HasImplicitIndexAnalyzer}}
 **** query Analyzer is / is not null
 **** (index) Analyzer is/ is not null
 **** {{constructMultiTermAnalyzer}}
 *** ALLTHOUGH: the _really_ special case stuff should actually be refactored 
into the specific classes:
 **** {{TextField}} should override {{setExplicitAnalyzers()}} and call 
{{constructMultiTermAnalyzer}} after {{super.setExplicitAnalyzers()}}
 **** {{PreAnalalyzedField (the only class that implements 
HasImplicitIndexAnalyzer)}} should override {{setExplicitAnalyzers()}} and do 
it's own ignoring/logging about explicit analyzers before delegating to super
 **** Deprecate/Keep the {{HasImplicitIndexAnalyzer}} check in 
{{FieldType.changeExplicitAnalyzers()}} for backcompat (in case custom field 
types use it) on stable branches, remove the interface completely from master
 * {{FieldTypeDefinition}}
 ** deprecate/replace these methods...
{code:java}
public AnalyzerDefinition getIndexAnalyzer()
public void setIndexAnalyzer(AnalyzerDefinition indexAnalyzer)
public AnalyzerDefinition getQueryAnalyzer()
public void setQueryAnalyzer(AnalyzerDefinition queryAnalyzer)
public AnalyzerDefinition getMultiTermAnalyzer()
public void setMultiTermAnalyzer(AnalyzerDefinition multiTermAnalyzer)
{code}

 ** with these methods...
{code:java}
public Map<String,AnalyzerDefinition} getAnalyzers()
public void setAnalyzers(Map<String,AnalyzerDefinition>)
{code}

 ** this _should_ be really straight forward, since these methods only deal 
with explicit analyzer configurations, and can just mutate a Map directly – or 
even be removed completley since the class is only used internally by other 
methods we'll be modifying below...
 * API Structure – aka: {{SchemaRequest}} & {{SchemaResponse}} & 
{{FieldType.getNamedPropertyValues()}} & {{FieldTypeXmlAdapter}}
 ** *this is where things get really, _really_, messy*
 ** The current Schema API structure for field types with analyzer(s) looks 
like this...
{code:java}
{
     "name":"myNewTextField",
     "class":"solr.TextField",
     "indexAnalyzer":{
        "tokenizer":{
           "class":"solr.PathHierarchyTokenizerFactory", 
           "delimiter":"/" }},
     "queryAnalyzer":{
        "class":"solr.KeywordAnalyzer" }
}
{code}

 ** Moving forward, In an ideal world where we support arbitrary analyzers, it 
should probably look something like...
{code:java}
{
     "name":"myNewTextField",
     "class":"solr.TextField",
     "analyzers":[
       {  "type":"index",    // in the simplest case: '"type":null' or no type 
key at all
          "tokenizer":{
             "class":"solr.PathHierarchyTokenizerFactory", 
             "delimiter":"/" }},
       { "type":"query",
         "class":"solr.KeywordAnalyzer" },
       { "type":"what ever",
         "class":"solr.WhateverAnalyzer" } ]
}
{code}

 ** How do we get from here to there?
 *** Changing these classes to all read/write in the new format would be fairly 
easy
 **** Because most of these code paths really only deal with "explicit(ly 
configured)" Analyzers the code changes are really simple and straight forward 
– the heavy lifting is all done by other method replacements disucssed above.
 *** Changing the necessary classes to be able to *read* either format 
(depending on what keys are found) would also be fairly simple
 *** The hard part is knowing when to _*write*_ in the old format for 
backcompat.
 *** Under the new "V2" API model, we _could_ potentially start versioning 
stuff ... but are we ready/willing to start saying this new {{fieldType}} 
request/response format is {{"/v3"}} (or {{"/v2.1"}} ) ?
 **** SOLR-11183 already eliminated the the {{/v2}} path and replaced it with 
{{/api}} .. is bring it back (or using {{/api2.1}} ) worthwhile?
 **** Hell: Do the _internal_ "V2" method APIs even give us enough info about 
the path to make this approach even viable?
 *** FWIW: We could, as an ugly shim/hack, ignore the "ideal" API structure, 
and instead extend the current {{fooAnalyzer}} syntax such that any FieldType 
attribute that ends in "Analyzer" is treated as an analyzer type...
{code:java}
{
     "name":"myNewTextField",
     "class":"solr.TextField",
     "indexAnalyzer":{
        "tokenizer":{
           "class":"solr.PathHierarchyTokenizerFactory", 
           "delimiter":"/" }},
     "queryAnalyzer":{
        "class":"solr.KeywordAnalyzer" },
     "what everAnalyzer":{
        "class":"solr.WhateverAnalyzer" }
}
{code}
...obviously we'd need to include special handling for the existing 
single-analyzer case of {{"analyzer"}} as the attribute name – but that would 
be trivial

 **** *NOTE:* I feel gross even suggesting this syntax, but it is something 
that would be easily achievable
 **** In reality, if we did decide to "version" the response structure, this 
kind of "shim" structure (and the code to read/write it) might be needed anyway 
(example: in case a client requesting the "old" format hits a Solr where there 
are some FieldTypes w/arbitrary analyzers)

With the above changes, it would be a fairly simple matter to make methods like 
{{TextField.getFieldQuery}} ask the QParser if there is an {{analyzer}} 
localparam, and if so use it instead of the \{{query} analyzer when parsing 
queries. As before: the *S2.1.HITCH* still applies and would also need to be 
"fixed" for this to work with non-trivila QParsers.
h3. *S2.2E*: Example Usage

In addition to how this might simplify *S2.1.STRAW1* (above) and *S1.3* (below) 
this would also make the expert level "I want to be able to have lots of 
arbitrary analyzers I pick between arbitrarily at query time..." usecase very 
clean...
{code:xml}
<!--
  For the 'i want to do crazy things' users, a new 'allowAnyAnalyzer="true"'
  option could be added to TextField so that it doesn't fail on unexpected
  analyzer types.
     
  Then methods like TextField.getFieldQuery could look for an 'analyzer' local
  param so the user could do things like:

    q={!field f=crazy analyzer=silliest}elephant

  NOTE:
   - failing on unexpected analyzer types by default is important to catch typos
     - like 'type="qeury"'
   - Since this is a really niche case, an alternative to 
'allowAnyAnalyzer="true"'
     would probably be to create a "DynamicQueryAnalyzerTextField extends 
TextField"
     w/ this special behavior instead
-->
<field name="crazy" type="text_crazy" />
<fieldType name="text_crazy" class="solr.TextField"
           indexed="true" multiValued="true" allowAnyAnalyzer="true">
  <analyzer type="index">
   ...
  </analyzer>
  <analyzer type="query">
   ...
  </analyzer>
  <analyzer type="silly">
   ...
  </analyzer>
  <analyzer type="silliest">
   ...
  </analyzer>
</fieldType>
{code}
----
h2. *S1.3*: TextField supports a type="docValues" analyzer
h3. *S1.3G*: Goal

{{TextField}} should support {{docValues="true"}} and it should be possible to 
configure an arbitrary analyzer (ie: {{<analyzer type="docValues" .../>}} ) for 
determining what data gets put in the docValues of a text field for 
sorting/faceting – independently from the the index/query analyzer)

For back compatibility, a TextField with {{docValues="true"}} but no 
{{type="docValues"}} analyzer should use the {{type="index"}} analyzer.
h3. *S1.3E*: Example Usage

Building DocValues at index time to get same behavior as FieldCache w/o the 
query RAM/time cost...
{code:java}
<!--
  Faceting (or Sorting) on these 2 features_* fields (should) behave
  identically, except that...
   - features_fc would require uninversion of indexed terms (FieldCache)
   - features_dv would use SortedSetDocValues made up of the same terms
     (as created by the 'index' analyzer at indexing time)
  -->
<field name="features_fc" type="text_general" />
<field name="features_dv" type="text_general" docValue="true"/>
<fieldType name="text_general" class="solr.TextField"
           indexed="true">
  <analyzer type="index">
   ...
  </analyzer>
  <analyzer type="query">
   ...
  </analyzer>
</fieldType>
{code}
Special Faceting/Sorting behavior, independent of search behavior...
{code:java}
<!--
  Stemming when searching, but we also want 'clean' terms for
  'Tag Cloud' Faceting...
  -->
<fieldType name="text_keywords" class="solr.TextField"
           indexed="true" docValues="true">
  <analyzer> <!-- query and index -->
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="docValues">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory"/>
  </analyzer>
</fieldType>
{code}
{code:java}
<!--
  Lowercase sorting on the whole string, independent of
  whatever search based analysis is done
  -->
<fieldType name="text_" class="solr.TextField"
           indexed="true" docValues="true">
  <analyzer> <!-- query and index -->
    ...
  </analyzer>
  <analyzer type="docValues">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.TrimFilterFactory"/>
  </analyzer>
</fieldType>
{code}
h3. *S1.3A*: Suggested Approach
 * 90% of the work required here would be the bulk of the work from *S2.2*: 
Supporting the configuration of a "docValues" analyzer on TextField
 ** It would be silly to attempt this *S1.3* idea w/o also tackling the 
generalized idea in *S2.2A* (ie: we should stop adding special magic named 
analyzers to the logic in IndexSchema – the "plumbing" should be generic, and 
the FieldTypes should know what analyzer names they care about)
 * Once the plumbing is in place for IndexSchema to allow arbitrary named 
analyzers, the remaining work would be fairly trivial:
 ** Make TextField expect an analyzer named "docValues", implicitly defaulting 
to the effective "index" analyzer if there is no explicit "docValues" analyzer
 ** Refactor the core bit of logic from *S1.2A* (TermDocValuesTextField's 
"TokenStream -> SortedSetDocValuesField") down to TextField
 *** Either deprecate TermDocValuesTextField, or leave it as syntactic sugar 
for TextField w/ {{docValues="true"}}
 ** Refactor SortableTextField (*S1.1A*) so that it's just syntactic sugar for 
TextField with:
 *** {{docValues="true"}}
 *** a "docValues" analyzer using KeywordTokenizer + a new "Truncation" filter 
(to limit the values to the confiured # of characters)

> A Potential Roadmap for robust multi-analyzer TextFields w/various options 
> for configuring docValues
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11917
>                 URL: https://issues.apache.org/jira/browse/SOLR-11917
>             Project: Solr
>          Issue Type: Wish
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>            Priority: Major
>
> A while back, I was tasked at my day job to brainstorm & design some "smarter 
> field types" in Solr. In particular to think about:
>  # How to simplify some of the "special things" people have to know about 
> Solr behavior when creating their schemas
>  # How to reduce the number of situations where users have to copy/clone one 
> "logical field" into multiple "schema felds in order to meet diff use cases
> The main result of this thought excercise is a handful of usecases/goals that 
> people seem to have - many of which are already tracked in existing jiras - 
> along with a high level design/roadmap of potential solutions for these goals 
> that can be implemented incrementally to leverage some common changes (and 
> what those changes might look like).
> My intention is to use this jira as a place to share these ideas for broader 
> community discussion, and as a central linkage point for the related jiras. 
> (details to follow in a very looooooong comment)
> ----
> NOTE: I am not (at this point) personally committing to following through on 
> implementing every aspect of these ideas :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-11917) A Potential Roadmap for robust multi-analyzer TextFields w/various options for configuring docValues

Reply via email to