[
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375144#comment-16375144
]
Uwe Schindler edited comment on LUCENE-8183 at 2/24/18 12:04 AM:
-----------------------------------------------------------------
[~rwesten]: I was not aware that this was my dictionary file! The names in your
example (under "environment in your report) did not look like the example
listed here: https://github.com/uschindler/german-decompounder
was (Author: thetaphi):
[~rwesten]: I was not aware that this was my dictionary file! The names in your
example did not look like the example listed here:
https://github.com/uschindler/german-decompounder
> HyphenationCompoundWordTokenFilter creates overlapping tokens with
> onlyLongestMatch enabled
> -------------------------------------------------------------------------------------------
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.HyphenationCompoundWordTokenFilterFactory"
> hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
> dictionary="lang/wordlist_de.txt"
> onlyLongestMatch="true"/>
>
> Reporter: Rupert Westenthaler
> Assignee: Uwe Schindler
> Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if
> onlyLongestMatch is enabled.
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
> Hyphenator: {{de_DR.xml}} //from Apche Offo
> onlyLongestMatch: true
>
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
> # the 2nd 'gesellschaft' as it duplicates the original token
> # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the
> dictionary
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]