[
https://issues.apache.org/jira/browse/SOLR-9526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15551468#comment-15551468
]
Jan Høydahl commented on SOLR-9526:
-----------------------------------
This is the approach that ES will take in 5.x too, see
https://www.elastic.co/blog/strings-are-dead-long-live-strings
When auto guessing they will index the field, say "city" as full-text, and also
add a string/keyword copy as "city.keyword". This can be changed by modifying
mappings.
Instead of the "exclude" params, perhaps we should have a way to cutoff the
string copy at e.g. 256 chars, I mean, when would you need longer facet values?
Also, it is unfortunate to split your "schema" across the schema file and a
solrconfig URP. Take the example where you want to use data driven schema, but
want to lock a few key fields up front by issuing {{add-field}} commands. With
Hoss' suggestion this would work fine if you lock e.g. {{<field name="city"
fieldType="string" />}}, but what if you want to force it into e.g. a Norwegian
text with {{<field name="city" fieldType="text_no" />}}. Then the
CloneFieldUpdateProcessorFactory would still run, creating the {{city_str}}
copy. That would be confusing.
So I'm thinking if it would be best to bake this feature more integrated with
{{AddSchemaFieldsUpdateProcessorFactory}}, so that when an unknown field name
with String content comes in, we create a text_general field for it, but we
also create a copyFIeld in the schema for it, e.g. {{<copyField source="city"
dest="city_txt" cutoff="256"/>}}. This means we'd add a cutoff feature to
today's copyFIeld, but we have the rest of what we need. Sample UPF:
{code:xml}
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
<str name="defaultFieldType">text_general</str>
<lst name="typeMapping">
<str name="valueClass">java.lang.String</str>
<str name="fieldType">text_general</str>
<lst name="copyField">
<str name="pattern">^(.*)$</str>
<str name="replacement">$1_str</str>
<int name="cutoff">256</int>
</lst>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Boolean</str>
<str name="fieldType">booleans</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.util.Date</str>
<str name="fieldType">tdates</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Long</str>
<str name="valueClass">java.lang.Integer</str>
<str name="fieldType">tlongs</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Number</str>
<str name="fieldType">tdoubles</str>
</lst>
</processor>
{code}
The result will be that users can configure fields up-front without our logic
messing it up, and they can also change ONLY the schema later if they wish to
remove the {{copyFIeld}} again. Then our defaults would not mess it up either.
Users will only need to relate the the schema API!
> data_driven configs defaults to "strings" for unmapped fields, makes most
> fields containing "textual content" unsearchable, breaks tutorial examples
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-9526
> URL: https://issues.apache.org/jira/browse/SOLR-9526
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Hoss Man
>
> James Pritchett pointed out on the solr-user list that this sample query from
> the quick start tutorial matched no docs (even though the tutorial text says
> "The above request returns only one document")...
> http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=name:foundation
> The root problem seems to be that the add-unknown-fields-to-the-schema chain
> in data_driven_schema_configs is configured with...
> {code}
> <str name="defaultFieldType">strings</str>
> {code}
> ...and the "strings" type uses StrField and is not tokenized.
> ----
> Original thread:
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201609.mbox/%3ccac-n2zrpsspfnk43agecspchc5b-0ff25xlfnzogyuvyg2d...@mail.gmail.com%3E
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]