On 2013-07-24, Yihui Xie wrote:
> As far as I know, there are no kernings and ligatures in Chinese. All
> Chinese characters are "independent" and of exactly the same width, so
> it is OK to calculate the string length by simply counting the number
> of characters.

> This post might help for the Unicode ranges:
> http://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode

I did look up all the CJK characters for a somewhat similar problem in
Docutils and came up with the following list:


    # Unicode unifies under the term CJK (Chinese, Japanese, Korean) the
    # scripts Han, Bopomofo, Hiragana, Katakana, Hangul, and Yi. These
    # scripts use ideographs that do not require spaces between words. 
    
    # Sources for determination of the "CJK" property are the `Unicode
    # standard Chapter 11 East Asian Scripts`__ describing the CJK unification
    # and the Unicode data file ``Scripts.txt``.
    # __ http://unicode.org/versions/Unicode4.0.0/ch11.pdf
    cjk_characters = (
        u'\u02EA\u02EB'    # Bopomofo modifier letters
        u'\u1100-\u11FF'   # 1100..11FF; Hangul Jamo
        u'\u2E80-\u4DBF'   # 2E80..2EFF; CJK Radicals Supplement
                           # 2F00..2FDF; Kangxi Radicals
                           # 2FF0..2FFF; Ideographic Description Characters
                           # 3000..303F; CJK Symbols and Punctuation
                           # 3040..309F; Hiragana
                           # 30A0..30FF; Katakana
                           # 3100..312F; Bopomofo
                           # 3130..318F; Hangul Compatibility Jamo
                           # 3190..319F; Kanbun
                           # 31A0..31BF; Bopomofo Extended
                           # 31C0..31EF; CJK Strokes
                           # 31F0..31FF; Katakana Phonetic Extensions
                           # 3200..32FF; Enclosed CJK Letters and Months
                           # 3300..33FF; CJK Compatibility
                           # 3400..4DBF; CJK Unified Ideographs Extension A
        u'\u4E00-\uA4CF'   # 4E00..9FFF; CJK Unified Ideographs
                           # A000..A48F; Yi Syllables
                           # A490..A4CF; Yi Radicals
        u'\uA960-\uA97F'   # A960..A97F; Hangul Jamo Extended-A
        u'\uAC00-\uD7FF'   # AC00..D7AF; Hangul Syllables
                           # D7B0..D7FF; Hangul Jamo Extended-B
        u'\uF900-\uFAFF'   # F900..FAFF; CJK Compatibility Ideographs
        u'\uFE30-\uFE4F'   # FE30..FE4F; CJK Compatibility Forms
        u'\uFF00-\uFFEF'   # FF00..FFEF; Halfwidth and Fullwidth Forms
        u'\U0001B000'      # KATAKANA LETTER ARCHAIC E
        u'\U0001B001'      # HIRAGANA LETTER ARCHAIC YE
        u'\U0001F200'      # SQUARE HIRAGANA HOKA
        u'\U00020000-\U0002FA1F' # 20000..2A6DF; CJK Unified Ideographs 
Extension B
                                 # 2A700..2B73F; CJK Unified Ideographs 
Extension C
                                 # 2B740..2B81F; CJK Unified Ideographs 
Extension D
                                 # 2F800..2FA1F; CJK Compatibility Ideographs 
Supplement
        )

With a regular expression, wrapping could be allowed whenever "boarding" any
of the specified characters.

Hope this helps.

Günter

Reply via email to