Hi Thushara, Please use lucene-gosen mailing list for lucene-gosen questions:
http://groups.google.com/group/lucene-gosen Thanks, koji -- Query Log Visualizer for Apache Solr http://soleami.com/ (12/03/03 6:41), Thushara Wijeratna wrote: > I'm testing lucene-gosen for Japanese tokenization and wondering what the > differences are between the two jars provided. (ipadic / chaisen)? > In my preliminary testing, I'm not seeing any difference in tokenization in > these two jars. (the jar with no dictionary did not work, I assume I need > to make available a custom dictionary - header.sen which I did not try) > > I tried to tokenize this phrase: > > ゴルフが大好きなあなた。 > アメリカにあるベスト・ゴルフコース情報が満載のイエローページ・ジャパンでは、オンラインまたはガイド・ブックからもあらゆる情報が簡単に入手できます。 > 詳しい情報は > > > which google translates as > > > You love golf. Best golf course information in the United States is in the > Yellow Pages Japan is full of, any information can be obtained easily from > online or book guide. For more information > > > I'm getting identical tokenization from both jars, namely : > > > ゴルフ / Golf > > 大好き / I love > > あなた / You > > アメリカ / America > > ベスト / best > > ゴルフコース / Golf course > > 情報 / information > > 満載 / save > > イエロ / Hierro > > ページ / page > > ジャパン / Japan > > オンライン / online > > ガイド / guide > > ブック / book > > あらゆる / all > > 情報 / information > > 簡単 / simple > > 入手 / obtaining > > できる / able to > > 詳しい /detailed > > 情報 / information > > > Note: translations based on Google Translate > > > Any pointers you can provide as to the difference of the two methods of > tokenizing would be highly appreciated. > > > thx, > > thushara > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org