[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

Rupert Westenthaler (JIRA) Fri, 23 Feb 2018 12:06:23 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374916#comment-16374916
 ]


Rupert Westenthaler commented on LUCENE-8183:
---------------------------------------------

I was not aware that this is the intended behaviour.

{quote}For "Fußballpumpe" and dictionary "Ball", "Ballpumpe", "Pumpe", "Fuß", 
"Fußball" you would get the tokens "Fußball" and "pumpe" but not "Ballpumpe" as 
"Ball" has already been considered part of Fußball. Also, not sure if your 
change also improves the situation for languages other than German.{quote}

Thats a good point. Maybe one should still consider parts that are not enclosed 
by an token that was already decomposed. So for {{Fußballpumpe}}: {{ball}} 
would be ignored as {{{Fußball}} is already present, but {{ballpumpe}} would 
still be added as token. Finally {{pumpe}} is ignored as {{ballpumpe}} is 
present.

This reminds me to {{ALL}}, {{NO_SUB}} and {{LONGEST_DOMINANT_RIGHT}} as 
supported by the [Solr Text 
Tagger|https://github.com/OpenSextant/SolrTextTagger#the-tagger-request-time-parameters-are]

{quote}
Perhaps these kind of adjustments should rather be done in a TokenFilter 
similar to RemoveDuplicatesTokenFilter instead of complicating the 
decompounding algorithm?
{quote}
I am aware of this possibility. In fact I do use the 
{{RemoveDuplicatesTokenFilter}} to remove those tokens. My point was just why 
they are added in the first place.

> HyphenationCompoundWordTokenFilter creates overlapping tokens with 
> onlyLongestMatch enabled
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8183
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8183
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.6
>         Environment: Configuration of the analyzer:
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.HyphenationCompoundWordTokenFilterFactory" 
>         hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>            Reporter: Rupert Westenthaler
>            Assignee: Uwe Schindler
>            Priority: Major
>         Attachments: LUCENE-8183_20180223_rwesten.diff, lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if 
> onlyLongestMatch is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 
> 61 66 74]|[73 63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the 
> dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

Reply via email to