[
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849967#comment-16849967
]
Namgyu Kim commented on LUCENE-8816:
------------------------------------
Hi everybody,
Thank you for opening the issue! [~tomoko]
To be honest, at first, when I talked about a custom system dictionary, I did
not see a big sight.
Anyway, the structure I think is as follows.
1. As Tomoko said, make developer-oriented dictionary build tool
The "ant regenerate" command inside the build.xml that I checked has the
following steps.
1) Compile the code (compile-tools)
2) Download the jar file (download-dict)
3) Save Noun.proper.csv diffs (patch-dict)
4) Run DictionaryBuilder and make dat files (build-dict)
It does not matter if user builds only system dictionary. (Of course there is a
problem to modify the classpath)
(ex) ant build-dict ipadic /home/my/path/customDicIn(custom-dic input)
/home/my/path/customDicOutput(dat output) utf-8 false
However, if user needs to get a dictionary from the server, they should modify
the build.xml.
As I know, the url path is hard-coded.
Of course, the user can run by modifying ivy.xml and build.xml.
But my personal opinion is user should not touch Lucene's internal code. (even
build script)
Maybe the user is afraid to change or feel reluctant to use it. (Especially
who have not used Apache Ant)
However, I think that this may be different for every person.
2. Version Control
I actually think this is the biggest problem.
As I mentioned in the email,
if the Lucene version goes up, users have to rebuild their system dictionary
unconditionally and put it in the jar.
Because the current process is,
1) Process like 1.
2) move users system directory dat files to
resources/org.apache.lucene.analysis.ja.dict
3) ant jar
Because of the 3), the user always has to rebuild kuromoji module or fix the
kuromoji jar.
The users can feel irritated, when there is no kuromoji module change in the
version up.
This problem can be solved easily if the system dictionary can only be
parameterized in JapaneseTokenizer.
(Of course, the expert javadoc is required)
> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Tomoko Uchida
> Priority: Major
>
> I've inspired by this mail-list thread.
>
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years.
> While it has been slowly obsoleted, well-maintained and/or extended
> dictionaries risen up in recent years (e.g.
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd],
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially
> incompatible with the idea "switch the system dictionary", and developers
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the
> encoded dictionary (language model) had been decoupled (like MeCab, the
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a
> natural idea, and I feel that it's good time to re-think the current
> architecture.
> Also this would be good for advanced users who have customized/re-trained
> their own system dictionary.
> Goals of this issue:
> * Decouple JapaneseTokenizer itself and encoded system dictionary.
> * Implement dynamic dictionary load mechanism.
> * Provide developer-oriented dictionary build tool.
> Non-goals:
> * Provide learner or language model (it's up to users and should be outside
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or
> difficult at this moment.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]