[ 
https://issues.apache.org/jira/browse/SOLR-9526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15551468#comment-15551468
 ] 

Jan Høydahl commented on SOLR-9526:
-----------------------------------

This is the approach that ES will take in 5.x too, see 
https://www.elastic.co/blog/strings-are-dead-long-live-strings
When auto guessing they will index the field, say "city" as full-text, and also 
add a string/keyword copy as "city.keyword". This can be changed by modifying 
mappings.

Instead of the "exclude" params, perhaps we should have a way to cutoff the 
string copy at e.g. 256 chars, I mean, when would you need longer facet values?

Also, it is unfortunate to split your "schema" across the schema file and a 
solrconfig URP. Take the example where you want to use data driven schema, but 
want to lock a few key fields up front by issuing {{add-field}} commands. With 
Hoss' suggestion this would work fine if you lock e.g. {{<field name="city" 
fieldType="string" />}}, but what if you want to force it into e.g. a Norwegian 
text with {{<field name="city" fieldType="text_no" />}}. Then the 
CloneFieldUpdateProcessorFactory would still run, creating the {{city_str}} 
copy. That would be confusing.

So I'm thinking if it would be best to bake this feature more integrated with 
{{AddSchemaFieldsUpdateProcessorFactory}}, so that when an unknown field name 
with String content comes in, we create a text_general field for it, but we 
also create a copyFIeld in the schema for it, e.g. {{<copyField source="city" 
dest="city_txt" cutoff="256"/>}}. This means we'd add a cutoff feature to 
today's copyFIeld, but we have the rest of what we need. Sample UPF:

{code:xml}
    <processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
      <str name="defaultFieldType">text_general</str>
      <lst name="typeMapping">
        <str name="valueClass">java.lang.String</str>
        <str name="fieldType">text_general</str>
        <lst name="copyField">
          <str name="pattern">^(.*)$</str>
          <str name="replacement">$1_str</str>
          <int name="cutoff">256</int>
        </lst>
      </lst>
      <lst name="typeMapping">
        <str name="valueClass">java.lang.Boolean</str>
        <str name="fieldType">booleans</str>
      </lst>
      <lst name="typeMapping">
        <str name="valueClass">java.util.Date</str>
        <str name="fieldType">tdates</str>
      </lst>
      <lst name="typeMapping">
        <str name="valueClass">java.lang.Long</str>
        <str name="valueClass">java.lang.Integer</str>
        <str name="fieldType">tlongs</str>
      </lst>
      <lst name="typeMapping">
        <str name="valueClass">java.lang.Number</str>
        <str name="fieldType">tdoubles</str>
      </lst>
    </processor>
{code}

The result will be that users can configure fields up-front without our logic 
messing it up, and they can also change ONLY the schema later if they wish to 
remove the {{copyFIeld}} again. Then our defaults would not mess it up either. 
Users will only need to relate the the schema API!

> data_driven configs defaults to "strings" for unmapped fields, makes most 
> fields containing "textual content" unsearchable, breaks tutorial examples
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-9526
>                 URL: https://issues.apache.org/jira/browse/SOLR-9526
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>
> James Pritchett pointed out on the solr-user list that this sample query from 
> the quick start tutorial matched no docs (even though the tutorial text says 
> "The above request returns only one document")...
> http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=name:foundation
> The root problem seems to be that the add-unknown-fields-to-the-schema chain 
> in data_driven_schema_configs is configured with...
> {code}
> <str name="defaultFieldType">strings</str>
> {code}
> ...and the "strings" type uses StrField and is not tokenized.
> ----
> Original thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201609.mbox/%3ccac-n2zrpsspfnk43agecspchc5b-0ff25xlfnzogyuvyg2d...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to