[ 
https://issues.apache.org/jira/browse/LUCENE-8863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867709#comment-16867709
 ] 

Tomoko Uchida commented on LUCENE-8863:
---------------------------------------

One small thing:
I think the issue title should be changed to more descriptive one. (It's hard 
to imagine to me that new Dictionary constructors was added with an issue named 
"Improve handling of edge cases in Kuromoji's DIctionaryBuilder"...)

> Improve handling of edge cases in Kuromoji's DIctionaryBuilder
> --------------------------------------------------------------
>
>                 Key: LUCENE-8863
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8863
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Assignee: Mike Sokolov
>            Priority: Major
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> While building a custom Kuromoji system dictionary, I discovered a few issues.
> First, the dictionary encoding has room for 13-bit (left and right) ids, but 
> really only supports 12 bits since this was all that was needed for the 
> IPADIC dictionary that ships with Kuromoji. The good news is we can easily 
> add support by fixing the bit-twiddling math.
> Second, the dictionary builder has a number of assertions that help uncover 
> problems in the input (like these overlarge ids), but the assertions aren't 
> enabled by default, so an unsuspecting new user doesn't get any benefit from 
> them, so we should upgrade to "real" exceptions.
> Finally, we want to handle the case of empty base forms differently. Kuromoji 
> does stemming by substituting a base form for a word when there is a base 
> form in the dictionary. Missing base forms are expected to be supplied as 
> {{*}}, but if a dictionary provides an empty string base form, we would end 
> up stripping that token completely. Since there is no possible meaning for an 
> empty base form (and the dictionary builder already treats {{*}} and empty 
> strings as equivalent in a number of other cases), I think we should simply 
> ignore empty base forms (rather than replacing words with empty strings when 
> tokenizing!)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to