On 2013-07-24, Yihui Xie wrote: > As far as I know, there are no kernings and ligatures in Chinese. All > Chinese characters are "independent" and of exactly the same width, so > it is OK to calculate the string length by simply counting the number > of characters.
> This post might help for the Unicode ranges: > http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode I did look up all the CJK characters for a somewhat similar problem in Docutils and came up with the following list: # Unicode unifies under the term CJK (Chinese, Japanese, Korean) the # scripts Han, Bopomofo, Hiragana, Katakana, Hangul, and Yi. These # scripts use ideographs that do not require spaces between words. # Sources for determination of the "CJK" property are the `Unicode # standard Chapter 11 East Asian Scripts`__ describing the CJK unification # and the Unicode data file ``Scripts.txt``. # __ http://unicode.org/versions/Unicode4.0.0/ch11.pdf cjk_characters = ( u'\u02EA\u02EB' # Bopomofo modifier letters u'\u1100-\u11FF' # 1100..11FF; Hangul Jamo u'\u2E80-\u4DBF' # 2E80..2EFF; CJK Radicals Supplement # 2F00..2FDF; Kangxi Radicals # 2FF0..2FFF; Ideographic Description Characters # 3000..303F; CJK Symbols and Punctuation # 3040..309F; Hiragana # 30A0..30FF; Katakana # 3100..312F; Bopomofo # 3130..318F; Hangul Compatibility Jamo # 3190..319F; Kanbun # 31A0..31BF; Bopomofo Extended # 31C0..31EF; CJK Strokes # 31F0..31FF; Katakana Phonetic Extensions # 3200..32FF; Enclosed CJK Letters and Months # 3300..33FF; CJK Compatibility # 3400..4DBF; CJK Unified Ideographs Extension A u'\u4E00-\uA4CF' # 4E00..9FFF; CJK Unified Ideographs # A000..A48F; Yi Syllables # A490..A4CF; Yi Radicals u'\uA960-\uA97F' # A960..A97F; Hangul Jamo Extended-A u'\uAC00-\uD7FF' # AC00..D7AF; Hangul Syllables # D7B0..D7FF; Hangul Jamo Extended-B u'\uF900-\uFAFF' # F900..FAFF; CJK Compatibility Ideographs u'\uFE30-\uFE4F' # FE30..FE4F; CJK Compatibility Forms u'\uFF00-\uFFEF' # FF00..FFEF; Halfwidth and Fullwidth Forms u'\U0001B000' # KATAKANA LETTER ARCHAIC E u'\U0001B001' # HIRAGANA LETTER ARCHAIC YE u'\U0001F200' # SQUARE HIRAGANA HOKA u'\U00020000-\U0002FA1F' # 20000..2A6DF; CJK Unified Ideographs Extension B # 2A700..2B73F; CJK Unified Ideographs Extension C # 2B740..2B81F; CJK Unified Ideographs Extension D # 2F800..2FA1F; CJK Compatibility Ideographs Supplement ) With a regular expression, wrapping could be allowed whenever "boarding" any of the specified characters. Hope this helps. Günter