[
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374886#comment-16374886
]
Matthias Krueger commented on LUCENE-8183:
------------------------------------------
[~rwesten] Quick question regarding your patch: What's the reasoning behind not
decomposing terms that are part of the dictionary at all?
The {{onlyLongestMatch}} flag currently affects whether all matches or only the
longest match should be returned *per* *start* character (in
DictionaryCompoundWordTokenFilter) or *per* hyphenation *start* point (in
HyphenationCompoundWordTokenFilter).
Example:
Dictionary {{"Schaft", "Wirt", "Wirtschaft", "Wissen", "Wissenschaft"}} for
input "Wirtschaftswissenschaft" will return the original input plus tokens
"Wirtschaft", "schaft", "wissenschaft", "schaft" but not "Wirt" or "Wissen".
"schaft" is still returned (even twice) because it's the longest token starting
at the respective position.
I like the idea of restricting this further to only the longest terms that
*touch* a certain hyphenation point. This would exclude "schaft" in the example
above (as "Wirtschaft" and "wissenschaft" are two longer terms encompassing the
respective hyphenation point). On the other hand, there might be examples where
you still want to include the "overlapping" tokens. For "Fußballpumpe" and
dictionary {{"Ball", "Ballpumpe", "Pumpe", "Fuß", "Fußball"}} you would get the
tokens "Fußball" and "pumpe" but not "Ballpumpe" as "Ball" has already been
considered part of Fußball. Also, not sure if your change also improves the
situation for languages other than German.
Regarding point 1: The current algorithm always returns the term itself again
if it's part of the dictionary. I guess, this could be changed if we don't
check against {{this.maxSubwordSize}} but against
{{Math.min(this.maxSubwordSize), termAtt.length()-1)}}
Perhaps these kind of adjustments should rather be done in a TokenFilter
similar to RemoveDuplicatesTokenFilter instead of complicating the
decompounding algorithm?
> HyphenationCompoundWordTokenFilter creates overlapping tokens with
> onlyLongestMatch enabled
> -------------------------------------------------------------------------------------------
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.HyphenationCompoundWordTokenFilterFactory"
> hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
> dictionary="lang/wordlist_de.txt"
> onlyLongestMatch="true"/>
>
> Reporter: Rupert Westenthaler
> Assignee: Uwe Schindler
> Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if
> onlyLongestMatch is enabled.
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
> Hyphenator: {{de_DR.xml}} //from Apche Offo
> onlyLongestMatch: true
>
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
> # the 2nd 'gesellschaft' as it duplicates the original token
> # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the
> dictionary
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]