[GitHub] [solr] mocobeta commented on a change in pull request #270: SOLR-12255: Add docs for Nori Korean tokenizer

GitBox Tue, 24 Aug 2021 06:16:27 -0700


mocobeta commented on a change in pull request #270:
URL: https://github.com/apache/solr/pull/270#discussion_r694839187




##########
File path: solr/solr-ref-guide/src/language-analysis.adoc
##########
@@ -2419,6 +2422,130 @@ Example:
 ====
 --
 
+=== Korean
+
+The Korean (nori) analyzer integrates Lucene's nori analysis module into Solr.
+It uses the https://bitbucket.org/eunjeon/mecab-ko-dic[mecab-ko-dic] 
dictionary to perform morphological analysis of Korean texts.
+
+The dictionary was built with http://taku910.github.io/mecab/[MeCab] and 
defines a format for the features adapted for the Korean language.
+
+Nori also has a user dictionary feature that allows overriding the statistical 
model with your own entries for segmentation, part-of-speech tags, and readings 
without a need to specify weights.
+
+*Example*:
+
+[.dynamic-tabs]
+--
+[example.tab-pane#byname-lang-korean]
+====
+[.tab-label]*With name*
+[source,xml]
+----
+<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer name="korean" decompoundMode="discard" 
outputUnknownUnigrams="false"/>
+    <filter name="koreanPartOfSpeechStop" />
+    <filter name="koreanReadingForm" />
+    <filter name="lowercase" />
+  </analyzer>
+</fieldType>
+----
+====
+
+[example.tab-pane#byclass-lang-korean]
+====
+[.tab-label]*With class name (legacy)*
+[source,xml]
+----
+<fieldType name="text_ko" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer class="solr.KoreanTokenizerFactory" decompoundMode="discard" 
outputUnknownUnigrams="false"/>
+    <filter class="solr.KoreanPartOfSpeechStopFilterFactory" />
+    <filter class="solr.KoreanReadingFormFilterFactory" />
+    <filter class="solr.LowerCaseFilterFactory" />
+  </analyzer>
+</fieldType>
+----
+====
+--
+
+
+==== Korean Tokenizer
+
+*Factory class*: `solr.KoreanTokenizerFactory`
+
+*SPI name*: `korean`
+
+*Arguments*:
+
+`userDictionary`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+Path to a user-supplied dictionary to add custom nouns or compound terms to 
the default dictionary.
+
+`userDictionaryEncoding`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: none
+|===
++
+Character encoding of the user dictionary.
+
+`decompoundMode`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `discard`
+|===
++
+Defines how to handle compound tokens. The options are:
+
+* `none`: No decomposition for tokens.
+* `discard`: Tokens are decomposed and the original form is discarded.
+* `mixed`: Tokens are decomposed and the original form is retained.
+
+`outputUnknownUnigrams`::
++
+[%autowidth,frame=none]
+|===
+|Optional |Default: `true`

Review comment:
       I think the default value is `false`?
   
https://github.com/apache/lucene/blob/83ba5d859c377c6882947253ce0c6435153a1139/lucene/analysis/nori/src/java/org/apache/lucene/analysis/ko/KoreanTokenizerFactory.java#L96




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[GitHub] [solr] mocobeta commented on a change in pull request #270: SOLR-12255: Add docs for Nori Korean tokenizer

Reply via email to