[
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849778#comment-16849778
]
Jim Ferenczi commented on LUCENE-8816:
--------------------------------------
We discussed this when we added the Korean module and said that we could have a
separate module to handle "mecab-like" tokenization and one module per
dictionary (ipadic, mecab-ko-dic, ...). There are some assertions in the
JapaneseTokenizer that checks some invariant of the ipadic (leftId == rightId
for instance) but I guess we could move them in the dictionary module. This
could be a nice cleanup if the goal is to handle multiple mecab dictionaries
(in different languages).
{quote}
While it has been slowly obsoleted, well-maintained and/or extended
dictionaries risen up in recent years (e.g.
[mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd],
[UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some
attempts/projects/efforts are made in Japan.
{quote}
While allowing more flexibility would be nice I wonder if there are that many
different dictionaries. If the ipadic is obsolete we could also adapt the main
distribution (kuromoji) to use the UniDic instead. Even if we handle multiple
dictionaries we'll still need to provide a way for users to add custom entries.
Mecab has an option to compute the leftId, rightId and cost automatically from
a partial user entry so I wonder if this could help to avoid users to
reimplement a dictionary from scratch ?
> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Tomoko Uchida
> Priority: Major
>
> I've inspired by this mail-list thread.
>
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years.
> While it has been slowly obsoleted, well-maintained and/or extended
> dictionaries risen up in recent years (e.g.
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd],
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially
> incompatible with the idea "switch the system dictionary", and developers
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the
> encoded dictionary (language model) had been decoupled (like MeCab, the
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a
> natural idea, and I feel that it's good time to re-think the current
> architecture.
> Also this would be good for advanced users who have customized/re-trained
> their own system dictionary.
> Goals of this issue:
> * Decouple JapaneseTokenizer itself and encoded system dictionary.
> * Implement dynamic dictionary load mechanism.
> * Provide developer-oriented dictionary build tool.
> Non-goals:
> * Provide learner or language model (it's up to users and should be outside
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or
> difficult at this moment.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]