[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

Tomoko Uchida (JIRA) Thu, 30 May 2019 23:31:10 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852706#comment-16852706
 ]


Tomoko Uchida commented on LUCENE-8816:
---------------------------------------

Related to the comment from [~rcmuir],
{quote}This has changed, so I think it makes sense to look at how to really 
support other japanese dictionaries compatible with the apache license. It 
might mean representing some data differently in the worst case because we need 
more bits.
{quote}
I will try to describe my current rough thoughts.

As far as Japanese dictionaries, I think we should restrict supported 
dictionaries to only "[MeCab IPADIC|https://taku910.github.io/mecab/#download]"; 
and "[UniDic|https://unidic.ninjal.ac.jp/download#unidic_bccwj]";, both are 
distributed under OSS licenses which are compatible with ASF's policy. Although 
there are a few well-known extensions of them or user customized ones already 
used with Kuromoji in the wild, we need not (actually, cannot) to give 
"official" support or strict compatibility policy to the variants.

Nevertheless, it would be good for users to make small adjustments at some 
point of the implementation to allow to build well-known variants of 
mecab-ipadic or unidic. More concretely, for example: "relaxing the maximum 
string length to be allowed as inputs from X to Y when option Z is given, with 
possible performance degredation or unoptimized data representation".

Does that make sense to you? Currently I have no idea about the performance or 
bit level data size tuning which have been already done, so please correct me 
if I missed the points.

Anyway I'd like to start from understanding the exact meaning of the current 
assertions/restrictions in the builder class, and let me discuss with you about 
how we should/can change this, a little later :)

> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>
>                 Key: LUCENE-8816
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8816
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

Reply via email to