[
https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jan Høydahl updated SOLR-1979:
------------------------------
Attachment: SOLR-1979.patch
Fixed threshold so that Tika distance 0.1 gives certainty 0.5 and distance 0.02
gives certainty 0.9. The default threshold of 0.5 now works pretty well, at
least for the tests...
*New parameters:*
Field name mapping is now configurable to user defined pattern, so to map
ABC_title to title_<lang>, you set:
{code}
&langid.map.pattern=ABC_(.*)
&langid.map.replace=$1_{lang}
{code}
A parameter to map multiple detected languages to same field regex. I.e. to map
both Japanese, Korean and Chinese texts to a field *_cjk, do:
{code}langid.map.lcmap=jp:cjk zh:cjk ko:cjk{code}
Turn off validation of field names against schema (useful if you want to rename
or delete fields later in the UpdateChain):
{code}&langid.enforceSchema=false{code}
*Other changes*
Removed default on langField, i.e. if langField is not specified, the detected
language will not be written anywhere. A typical minimal config for only
detecting language and writing to a field is now:
{code}
<processor
class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
<defaults>
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</defaults>
</processor>
{code}
Also added multiple other languages to the tests.
> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
> Key: SOLR-1979
> URL: https://issues.apache.org/jira/browse/SOLR-1979
> Project: Solr
> Issue Type: New Feature
> Components: update
> Reporter: Jan Høydahl
> Assignee: Jan Høydahl
> Priority: Minor
> Labels: UpdateProcessor
> Fix For: 3.4
>
> Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch,
> SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> Language identification from document fields, and mapping of field names to
> language-specific fields based on detected language.
> Wrap the Tika LanguageIdentifier in an UpdateProcessor.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]