Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Tomoko Uchida
> Anyway, in my personal opinion, Lucene does not need to consider whether the system dictionary status is good or not. Please don't get me wrong, but I don't think so. Creating a customized or re-trained system dictionary still needs deep knowledge about language and machine-learning. Even among

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Namgyu Kim
Oh, I think my explanation was not enough. Sorry... I mentioned the following sentences. = 1. Modify your dictionary file and rebuild. 1-1) Install MeCab 1-2) Install MeCab Dictionary 1-3) Modify your dictionary file 1-4) Make it to tar.gz ==

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Tomoko Uchida
Hi, The system dictionary is not a mere "word collection", it includes a machine-learned language model which is carefully trained by researchers. If you want to replace the system dictionary, you have to start from "re-train" the model. This needs expert knowledge so I do not recommend to just mo

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Trejkaz
On Sun, 26 May 2019 at 23:49, Namgyu Kim wrote: > I think so about that approach. > It's not user-friendly and it is not good for the user. I think it's better to get the parameters in JapaneseTokenizer. > > What do you think about this? A way to override the system dictionary would be useful

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Namgyu Kim
I've been able to build a dictionary using DictionaryBuilder (I guess that is what the "regenerate" task must be using?) => Yes. That's right. The "regenerate" run commands in the following order: 1) Compile the code (compile-tools) 2) Download the jar file (download-dict) 3) Save Noun.proper.csv d

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Michael Sokolov
Thanks, Namgyu. I've been able to build a dictionary using DictionaryBuilder (I guess that is what the "regenerate" task must be using?) and I can replace the existing one on the classpath with jar surgery for now. Not a very user-friendly approach, but it will enable me to run some experiments and

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Namgyu Kim
Sorry for the wrong information, Mike. Tomoko is right. I checked it wrong. User dictionary is independent from the system dictionary. If you give the user entries, JapaneseTokenizer builds two FSTs one for the built-in dictionary and one for the user dictionary and they are retrieved separately.