[jira] [Commented] (SOLR-11917) A Potential Roadmap for robust multi-analyzer TextFields w/various options for configuring docValues

Hoss Man (JIRA) Fri, 26 Jan 2018 14:21:30 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-11917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341687#comment-16341687
 ]


Hoss Man commented on SOLR-11917:
---------------------------------

h1. Some Concrete Thoughts On *S*olutions

*NOTE:* While there is a one-to-one corrispondice in the naming/numbering of 
the *U*secases listed above and the proposed *S*olutions listed below, I have 
ordered the *S*olutions in the way that I think makes the most sense from an 
"explaining how to achieve things" standpoint.
----
h2. *S1.1*: A 'SortableTextField' that builds docValues using the original text 
input
h3. *S1.1G*: Goal

A new SortableTextField subclass would be added that would functionally work 
the same as TextField except:
 * {{docValues="true|false"}} could be configured, with the default being "true"
 * The docValues would contain (a prefix of) the original input values (just 
like StrField) for sorting (or faceting)
 ** By default, to protect users from excessively large docValues, only the 
first 1024 of each field value would be used – but this could be overridden 
with configuration.

h3. *S1.1E*: Example Usage

Consider the following sample configuration:
{code:java}
<field name="title" type="text_sortable" docValues="true"
       indexed="true" docValues="true" stored="true" multiValued="false"/>
<fieldType name="text_sortable" class="solr.SortableTextField">
  <analyzer type="index">
   ...
  </analyzer>
  <analyzer type="query">
   ...
  </analyzer>
</fieldType>
{code}
Given a document with a title of "Solr In Action"

Users could:
 * Search for individual (indexed) terms in the "title" field: {{q=title:solr}}
 * Sort documents by title ( {{sort=title asc}} ) such that this document's 
sort value would be "Solr In Action"

If another document had a "title" value that was longer then 1024 chars, then 
the docValues would be built using only the first 1024 characters of the value 
(unless the user modified the configuration)

NOTE: This would be functionally equivalent to the following existing 
configuration - including the on disk index segments - except that the on disk 
DocValues would refer directly to the "title" field, reducing the total number 
of "field infos" in the index (which has a small impact on segment housekeeping 
and merge times) and end users would not need to sort on an alternate 
"title_string" field name - the original "title" field name would always be 
used directly.
{code:java}
<field name="title" type="text"
       indexed="true" docValues="true" stored="true" multiValued="false"/>
<field name="title_string" type="string"
       indexed="false" docValues="true" stored="false" multiValued="false"/>
<copyField source="title" dest="title_string" maxChars="1024" />
{code}
h3. *S1.1A*: Suggested Approach (SOLR-11916)

While experimenting with a quick POC for this idea, I actually wound up 
building a {{SortableTextField}} that is feature complete. See patch in 
SOLR-11916.

NOTE: If/when *S1.3A* is implemented, this SortableTextField could be 
refactored to be syntactic sugar for TextField w/ some added defaults – see 
below.
----
h2. *S1.2*: A 'TermDocValuesTextField' that builds docValues using the 
post-analysis terms
h3. *S1.2G*: Goal

A new TermDocValuesTextField subclass would be added that would functionally 
work the same as TextField except:
 * {{docValues="true|false"}} could be configured, with the default being "true"
 * Instances of fields using this type would support faceting (or sorting), 
using DocValues build from the terms produced by the "index" analyzer
 ** NOTE: Sorting on this type of field would only make sense in some special 
circumstances depending on the analyzer used (ie: KeywordTokenizer)

h3. *S1.2E*: Example Usage

Consider the following sample configuration
{code:java}
<field name="keywords" type="text_facet" docValues="true"
       indexed="true" docValues="true" stored="true" multiValued="true"/>
<fieldType name="text_facet" class="solr.TermDocValuesTextField">
  <analyzer>
   <tokenizer class="solr.WhitespaceTokenizerFactory" rule="unicode"/>
   ...
  </analyzer>
</fieldType>

<field name="author" type="text_lc_sort" docValues="true"
       indexed="true" docValues="true" stored="true" multiValued="false"/>
<fieldType name="text_lc_sort" class="solr.TermDocValuesTextField">
  <analyzer>
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>
{code}
Given a document with an author of "Grainger, Trey" and keywords value of of 
"book lucene solr"

Users could:
 * Search for individual (indexed) terms in the "keywords" field: 
q=keywords:book
 * Facet on the keywords field (facet.field=keywords) such that if this were 
the only document in the index, the facet counts would be "book=1, lucene=1, 
solr=1"
 * Sort documents by author (sort=title asc) such that this document's sort 
value would be "grainger, trey"

NOTE: This should be functionally equivalent to users faceting on a "keywords" 
TextField (or sorting on an "author" TextField using KeywordTokenizer) today, 
except that the facet/sort values would come from DocValues (written at 
indexing time), and not the FieldCache (built on the fly at query time and held 
solely in RAM).
h3. *S1.2A*: Suggested Approach
 * Add a new TermDocValuesTextField subclass of TextField
 * if docValues="true":
 ** Augment the configured "index" analyzer to record each resulting token from 
the stream in a Set
 ** When indexing, pre-analyze/buffer the token stream and use the recorded Set 
of tokens to build additional SortedSetDocValuesField instances in the 
underling indexed document
 * OPTIMIZATION?: We may be able to avoid the pre-analysis/buffering of the 
TokenStream and instead hook into the low level indexing code with a callback 
to generate the SortedSetDocValuesField instances on the fly as the 
DocumentsWriter reads from the (original) TokenStream ... needs 
experimentation/refactoring once we have some tests.

NOTE: If/when *S1.3A* is implemented, this TermDocValuesTextField could be 
refactored to be syntactic sugar for TextField w/ some added defaults – see 
below.

 

 

> A Potential Roadmap for robust multi-analyzer TextFields w/various options 
> for configuring docValues
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11917
>                 URL: https://issues.apache.org/jira/browse/SOLR-11917
>             Project: Solr
>          Issue Type: Wish
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>            Priority: Major
>
> A while back, I was tasked at my day job to brainstorm & design some "smarter 
> field types" in Solr. In particular to think about:
>  # How to simplify some of the "special things" people have to know about 
> Solr behavior when creating their schemas
>  # How to reduce the number of situations where users have to copy/clone one 
> "logical field" into multiple "schema felds in order to meet diff use cases
> The main result of this thought excercise is a handful of usecases/goals that 
> people seem to have - many of which are already tracked in existing jiras - 
> along with a high level design/roadmap of potential solutions for these goals 
> that can be implemented incrementally to leverage some common changes (and 
> what those changes might look like).
> My intention is to use this jira as a place to share these ideas for broader 
> community discussion, and as a central linkage point for the related jiras. 
> (details to follow in a very looooooong comment)
> ----
> NOTE: I am not (at this point) personally committing to following through on 
> implementing every aspect of these ideas :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-11917) A Potential Roadmap for robust multi-analyzer TextFields w/various options for configuring docValues

Reply via email to