[
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374916#comment-16374916
]
Rupert Westenthaler commented on LUCENE-8183:
---------------------------------------------
I was not aware that this is the intended behaviour.
{quote}For "Fußballpumpe" and dictionary "Ball", "Ballpumpe", "Pumpe", "Fuß",
"Fußball" you would get the tokens "Fußball" and "pumpe" but not "Ballpumpe" as
"Ball" has already been considered part of Fußball. Also, not sure if your
change also improves the situation for languages other than German.{quote}
Thats a good point. Maybe one should still consider parts that are not enclosed
by an token that was already decomposed. So for {{Fußballpumpe}}: {{ball}}
would be ignored as {{{Fußball}} is already present, but {{ballpumpe}} would
still be added as token. Finally {{pumpe}} is ignored as {{ballpumpe}} is
present.
This reminds me to {{ALL}}, {{NO_SUB}} and {{LONGEST_DOMINANT_RIGHT}} as
supported by the [Solr Text
Tagger|https://github.com/OpenSextant/SolrTextTagger#the-tagger-request-time-parameters-are]
{quote}
Perhaps these kind of adjustments should rather be done in a TokenFilter
similar to RemoveDuplicatesTokenFilter instead of complicating the
decompounding algorithm?
{quote}
I am aware of this possibility. In fact I do use the
{{RemoveDuplicatesTokenFilter}} to remove those tokens. My point was just why
they are added in the first place.
> HyphenationCompoundWordTokenFilter creates overlapping tokens with
> onlyLongestMatch enabled
> -------------------------------------------------------------------------------------------
>
> Key: LUCENE-8183
> URL: https://issues.apache.org/jira/browse/LUCENE-8183
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 6.6
> Environment: Configuration of the analyzer:
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.HyphenationCompoundWordTokenFilterFactory"
> hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
> dictionary="lang/wordlist_de.txt"
> onlyLongestMatch="true"/>
>
> Reporter: Rupert Westenthaler
> Assignee: Uwe Schindler
> Priority: Major
> Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if
> onlyLongestMatch is enabled.
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
> Hyphenator: {{de_DR.xml}} //from Apche Offo
> onlyLongestMatch: true
>
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
> # the 2nd 'gesellschaft' as it duplicates the original token
> # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the
> dictionary
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]