[Daniel Shahaf]
> The current patch's docstring implies the LF byte is necessarily part
> of a line terminator, which is true for UTF-8/16/32 but not
> necessarily true in arbitrary encodings.

Nitpick: It is true in UTF-8, but not -16 or -32.  There are about 70
characters in the BMP which, in UTF-16LE (and -32LE), begin with 0A:

    $ grep '^..0A;' /usr/share/misc/unicode.gz | head
    000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;
    010A;LATIN CAPITAL LETTER C WITH DOT ABOVE;Lu;0;L;0043 0307;;;;N;LATIN 
CAPITAL LETTER C DOT;;;010B;
    020A;LATIN CAPITAL LETTER I WITH INVERTED BREVE;Lu;0;L;0049 
0311;;;;N;;;;020B;
    030A;COMBINING RING ABOVE;Mn;230;NSM;;;;;N;NON-SPACING RING ABOVE;;;;
    040A;CYRILLIC CAPITAL LETTER NJE;Lu;0;L;;;;;N;;;;045A;
    050A;CYRILLIC CAPITAL LETTER KOMI NJE;Lu;0;L;;;;;N;;;;050B;
    060A;ARABIC-INDIC PER TEN THOUSAND SIGN;Po;0;ET;;;;;N;;;;;
    070A;SYRIAC CONTRACTION;Po;0;AL;;;;;N;;;;;
    080A;SAMARITAN LETTER KAAF;Lo;0;R;;;;;N;;;;;
    090A;DEVANAGARI LETTER UU;Lo;0;L;;;;;N;;;;;

Reply via email to