[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Tom Christiansen Sun, 18 Sep 2011 15:45:40 -0700

Tom Christiansen <tchr...@perl.com> added the comment:

"Terry J. Reedy" <rep...@bugs.python.org> wrote
   on Thu, 08 Sep 2011 18:56:11 -0000:


>On 9/8/2011 4:32 AM, Ezio Melotti wrote:

>> So to summarize a bit, there are different possible level of strictness:
>>    1) all the possible encodable values, including the ones>10FFFF;
>>    2) values in range 0..10FFFF;
>>    3) values in range 0..10FFFF except surrogates (aka scalar values);
>>    4) values in range 0..10FFFF except surrogates and noncharacters;

>> and this is what is currently available in Python:
>>    1) not available, probably it will never be;
>>    2) available through the 'surrogatepass' error handler;
>>    3) default behavior (i.e. with the 'strict' error handler);
>>    4) currently not available.

>> Now, assume that we don't care about option 1 and want to implement the 
>> missing option 4 (which I'm still not 100% sure about).  The possible 
>> options are:
>>    * add a new codec (actually one for each UTF encoding);
>>    * add a new error handler that explicitly disallows noncharacters;
>>    * change the meaning of 'strict' to match option 4;

> If 'strict' meant option 4, then 'scalarpass' could mean option 3. 
> 'surrogatepass' would then mean 'pass surragates also, in addition to 
> non-char scalers'.

I'm pretty sure that anything that claims to be UTF-{8,16,32} needs  
to reject both surrogates *and* noncharacters. Here's something from the
published Unicode Standard's p.24 about noncharacter code points:

    • Noncharacter code points are reserved for internal use, such as for 
      sentinel values. They should never be interchanged. They do, however,
      have well-formed representations in Unicode encoding forms and survive
      conversions between encoding forms. This allows sentinel values to be
      preserved internally across Unicode encoding forms, even though they are
      not designed to be used in open interchange.

And here from the Unicode Standard's chapter on Conformance, section 3.2, p. 59:

    C2 A process shall not interpret a noncharacter code point as an 
       abstract character.

        • The noncharacter code points may be used internally, such as for 
          sentinel values or delimiters, but should not be exchanged publicly.

I'd have to check the fine print, but I am pretty sure that "shall not" 
is an imperative form.  We have understand that to read that a comforming
process *must*not* do that.  It's because of that wording that in Perl,
using either of {en,de}code() with any of the "UTF-{8,16,32}" encodings,
including the LE/BE versions as appropriate, it will not produce nor accept
a noncharacter code point like FDD0 or FFFE.

Do you think we may perhaps have misread that conformance clause?

Using Perl's special, loose-fitting "utf8" encoding, you can get it do
noncharacter code points and even surrogates, but you have to suppress
certain things to make that happen quietly.  You can only do this with
"utf8", not any of the UTF-16 or UTF-32 flavors.  There we give them no 
choice, so you must be strict.  I agree this is not fully orthogonal.

Note that this is the normal thing that people do:

    binmode(STDOUT, ":utf8");

which is the *loose* version.  The strict one is "utf8-strict" or "UTF-8":

    open(my $fh, "< :encoding(UTF-8)", $pathname)

So it is a bit too easy to get the loose one.  We felt we had to do this
because we were already using the loose definition (and allowing up to
chr(2**32) etc) when the Unicode Consortium made clear what sorts of
things must not be accepted, or perhaps, before we made ourselves clear
on this.  This will have been back in 2003, when I wasn't paying very
close attention.

I think that just like Perl, Python has a legacy of the original loose
definition.  So some way to accommodate that legacy while still allowing
for a comformant application should be devised.  My concern with Python
is that people tend to make they own manual calls to encode/decode a lot
more often than they do in Perl.  That people that if you only catch it
on a stream encoding, you'll miss it, because they will use binary I/O
and miss the check.

--tom

    Below I show a bit of how this works in Perl.  Currently the builtin
    utf8 encoding is controlled somewhat differently from how the Encode
    module's encode/decode functions are.  Yes, this is not my idea of good.

    This shows that noncharacters and surrogates do not survive the
    encoding/decoding process for UTF-16:

        % perl -CS -MEncode -wle 'print decode("UTF-16", encode("UTF-16", 
chr(0xFDD0)))' | uniquote -v
        \N{REPLACEMENT CHARACTER}
        % perl -CS -MEncode -wle 'print decode("UTF-16", encode("UTF-16", 
chr(0xFFFE)))' | uniquote -v
        \N{REPLACEMENT CHARACTER}
        % perl -CS -MEncode -wle 'print decode("UTF-16", encode("UTF-16", 
chr(0xD800)))' | uniquote -v
        UTF-16 surrogate U+D800 in subroutine entry at 
/usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.

    If you pass a third argument to encode/decode, you can tell it what to
    do on error; an argument of 1 raises an exception.  Not supplying a
    third argument gets the "default" behavior, which varies by encoding.
    (The careful programmer is apt to want to pass in an appropropriate
     bit mask of things like DIE_ON_ERR, WARN_ON_ERR, RETURN_ON_ERR,
     LEAVE_SRC, PERLQQ, HTMLCREF, or XMLCREF.)

    With "utf8" vs "UTF-8" using encode(), the default behavior is to swap in 
    the Unicode replacement character for things that don't map to the given 
    encoding, as you saw above with UTF-16:

        % perl -C0 -MEncode -wle 'print encode("utf8", chr(0xFDD0))' | uniquote 
-v
        \N{U+FDD0}
        % perl -C0 -MEncode -wle 'print encode("UTF-8", chr(0xFDD0))' | 
uniquote -v
        \N{REPLACEMENT CHARACTER}

        % perl -C0 -MEncode= -wle 'print encode("utf8", chr(0xD800))' | 
uniquote -v
        \N{U+D800}
        % perl -C0 -MEncode= -wle 'print encode("UTF-8", chr(0xFDD0))' | 
uniquote -v
        \N{REPLACEMENT CHARACTER}

        % perl -C0 -MEncode=:all -wle 'print encode("utf8", chr(0x100_0000))' | 
uniquote -v
        \N{U+1000000}
        % perl -C0 -MEncode=:all -wle 'print encode("UTF-8", chr(0x100_0000))' 
| uniquote -v
        \N{REPLACEMENT CHARACTER}

    With the builtin "utf8" encoding, which does *not* go through the
    Encode module, you instead control all this through lexical
    warnings/exceptions categories.   By default, you get a warning if
    you try to use noncharacter, surrogate, or nonunicode code points
    even on a loose utf8 stream (which is what -CS gets you):

        % perl -CS -le 'print chr for 0xFDD0, 0xD800, 0x100_0000' | uniquote
        Unicode non-character U+FDD0 is illegal for open interchange at -e line 
1.
        Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
        Code point 0x1000000 is not Unicode, may not be portable at -e line 1.
        \N{U+FDD0}
        \N{U+D800}
        \N{U+1000000}

    Notice I didn't ask for warnings there, but I still got them.  This
    promotes all utf8 warnings into exceptions, thus dying on the first one
    it finds:

        % perl -CS -Mwarnings=FATAL,utf8 -le 'print chr for 0xFDD0, 0xD800, 
0x100_0000' | uniquote
        Unicode non-character U+FDD0 is illegal for open interchange at -e line 
1.

    You can control these separately.  For example, these all die of an
    exception:

        % perl -CS -Mwarnings=FATAL,utf8 -wle 'print chr(0xFDD0)'   
        Unicode non-character U+FDD0 is illegal for open interchange at -e line 
1.
        % perl -CS -Mwarnings=FATAL,utf8 -wle 'print chr(0xD800)'   
        Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
        % perl -CS -Mwarnings=FATAL,utf8 -wle 'print chr(0x100_0000)' 
        Code point 0x1000000 is not Unicode, may not be portable at -e line 1.

    While these do not:

        % perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings "nonchar";     print 
chr(0xFDD0)'     | uniquote
        \N{U+FDD0}
        % perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings "surrogate";   print 
chr(0xD800)'     | uniquote
        \N{U+D800}
        % perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings "non_unicode"; print 
chr(0x100_0000)' | uniquote
        \N{U+1000000}

        % perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings qw(nonchar surrogate 
non_unicode);
                        print chr for 0xFDD0, 0xD800, 0x100_0000' | uniquote
        \N{U+FDD0}
        \N{U+D800}
        \N{U+1000000}

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to