[
https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859964#comment-16859964
]
Tomoko Uchida commented on LUCENE-8817:
---------------------------------------
Hi [~cm],
thanks for your comment! I used the term "mecab" without any deep thought in my
previous comment. I respect your perspective and agree with that we should not
use "mecab" in naming, except for the code which handles the dictionary "MeCab
IPADIC".
I also like the idea to use "viterbi" for shared tokenizer code. Meanwhile,
"statistical" sounds a little bit too general to me for describing the
analyzers' functionality. I just thought about using "morphologic" or "morph"
in the module name instead of "mecab", but there is already "morfologik" module
so it would be confusing...
There is another idea: how about using "kuromoji" in the top level module name
for both of Japanese and Korean analyzers, and changing current module names
"kuromoji" and "nori" to "kuromoji-ja" and "kuromoij-ko"? They are just module
names for internal use and not used in any exposed package or class or method
names (as far as I know). And they are not used in user configuration files (as
far as I know).
In order to clarify, my proposal would be changed like this. (I also changed
"tools" to "dict-tools" for clarification.)
{code:java}
analysis
└── kuromoji
├── common (module: analyzers-kuromoji-common)
│ ├── build.xml
│ └── src
├── ja (module: analyzers-kuromoji-ja)
│ ├── build.xml
│ └── src
├── ko (module: analyzers-kuromoji-ko)
│ ├── build.xml
│ └── src
└── dict-tools (module: analyzers-kuromoji-dict-tools)
├── build.xml
└── src
{code}
It looks natural to me, if we pursue the integration of the two analyzers. Does
the change sound too aggressive (especially for Korean analyzer users)? I'd
love to hear comments from others. :)
> Combine Nori and Kuromoji DictionaryBuilder
> -------------------------------------------
>
> Key: LUCENE-8817
> URL: https://issues.apache.org/jira/browse/LUCENE-8817
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Namgyu Kim
> Priority: Major
>
> This issue is related to LUCENE-8816.
> Currently Nori and Kuromoji Analyzer use the same dictionary structure.
> (MeCab)
> If we make combine DictionaryBuilder, we can reduce the code size.
> But this task may have a dependency on the language.
> (like HEADER string in BinaryDictionary and CharacterDefinition, methods in
> BinaryDictionaryWriter, ...)
> On the other hand, there are many overlapped classes.
> The purpose of this patch is to provide users of Nori and Kuromoji with the
> same system dictionary generator.
> It may take some time because there is a little workload.
> The work will be based on the latest master, and if the LUCENE-8816 is
> finished first, I will pull the latest code and proceed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]