[issue10542] Py_UNICODE_NEXT and other macros for surrogates

Tom Christiansen Tue, 16 Aug 2011 04:43:11 -0700

Tom Christiansen <tchr...@perl.com> added the comment:

I now see there are lots of good things in the BOM FAQ that have come up
lately regarding surrogates and other illegal characters, and about what
can go in data streams.


I quote a few of these from http://unicode.org/faq/utf_bom.html below:

    Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? 

    A: A different issue arises if an unpaired surrogate is encountered when 
converting ill-formed UTF-16 data. 
       By represented such an *unpaired* surrogate on its own as a 3-byte 
sequence, the resulting UTF-8 data stream
       would become ill-formed. While it faithfully reflects the nature of the 
input, Unicode conformance requires
       that encoding form conversion always results in valid data stream. 
Therefore a converter *must* treat this
       as an error.

    Q: How do I convert an unpaired UTF-16 surrogate to UTF-32? 

    A: If an unpaired surrogate is encountered when converting ill-formed 
UTF-16 data, any conformant converter must
       treat this as an error. By representing such an unpaired surrogate on 
its own, the resulting UTF-32 data stream
       would become ill-formed. While it faithfully reflects the nature of the 
input, Unicode conformance requires that
       encoding form conversion always results in valid data stream.

    Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If 
yes, then can I still assume the remaining
       UTF-8 bytes are in big-endian order?

    A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the 
endianness of the byte stream. UTF-8
       always has the same byte order. An initial BOM is only used as a 
signature — an indication that an otherwise
       unmarked text file is in UTF-8. Note that some recipients of UTF-8 
encoded data do not expect a BOM. Where UTF-8
       is used transparently in 8-bit environments, the use of a BOM will 
interfere with any protocol or file format
       that expects specific ASCII characters at the beginning, such as the use 
of "#!" of at the beginning of Unix
       shell scripts.

    Q: What should I do with U+FEFF in the middle of a file?

    A: In the absence of a protocol supporting its use as a BOM and when not at 
the beginning of a text stream, U+FEFF
       should normally not occur. For backwards compatibility it should be 
treated as ZERO WIDTH NON-BREAKING SPACE
       (ZWNBSP), and is then part of the content of the file or string. The use 
of U+2060 WORD JOINER is strongly
       preferred over ZWNBSP for expressing word joining semantics since it 
cannot be confused with a BOM. When
       designing a markup language or data protocol, the use of U+FEFF can be 
restricted to that of Byte Order Mark. In
       that case, any U+FEFF occurring in the middle of a file can be treated 
as an unsupported character.

    Q: How do I tag data that does not interpret U+FEFF as a BOM?

    A: Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to 
indicate little-endian UTF-16 text. 
       If you do use a BOM, tag the text as simply UTF-16. 

    Q: Why wouldn’t I always use a protocol that requires a BOM?

    A: Where the data has an associated type, such as a field in a database, a 
BOM is unnecessary. In particular, 
       if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or 
UTF-32LE, a BOM is neither necessary *nor
       permitted*. Any U+FEFF would be interpreted as a ZWNBSP.  Do not tag 
every string in a database or set of fields
       with a BOM, since it wastes space and complicates string concatenation. 
Moreover, it also means two data fields
       may have precisely the same content, but not be binary-equal (where one 
is prefaced by a BOM).

Somewhat frustratingly, I am now almost more confused than ever by the last two 
sentences here:

    Q: What is a UTF?

    A: A Unicode transformation format (UTF) is an algorithmic mapping from 
every Unicode code point (except surrogate
       code points) to a unique byte sequence. The ISO/IEC 10646 standard uses 
the term “UCS transformation format” for
       UTF; the two terms are merely synonyms for the same concept.

       Each UTF is reversible, thus every UTF supports *lossless round 
tripping*: mapping from any Unicode coded
       character sequence S to a sequence of bytes and back will produce S 
again. To ensure round tripping, a UTF
       mapping *must also* map all code points that are not valid Unicode 
characters to unique byte sequences. These
       invalid code points are the 66 *noncharacters* (including FFFE and 
FFFF), as well as unpaired surrogates.

My confusion is about the invalid code points. The first two FAQs I cite at the 
top are quite clear that it is illegal
to have unpaired surrogates in a UTF stream.  I don’t understand therefore what 
it saying about “must also” mapping all
code points that aren’t valid Unicode characters to “unique byte sequences” to 
ensure roundtripping.  At first reading,
I’d almost say those appear to contradict each other.  I must just be being 
boneheaded though.  It’s very early morning
yet, and maybe it will become clearer upon a fifth or sixth reading.  Maybe it 
has to with replacement characters?  No, 
that can’t be right.  Muddle muddle.  Sigh.

Important material is also found in http://www.unicode.org/faq/basic_q.html:

    Q: Are surrogate characters the same as supplementary characters?

    A: This question shows a common confusion. It is very important to 
distinguish surrogate code points (in the range
       U+D800..U+DFFF) from supplementary code points (in the completely 
different range, U+10000..U+10FFFF). Surrogate
       code points are reserved for use, in pairs, in representing 
supplementary code points in UTF-16.

       There are supplementary characters (i.e. encoded characters represented 
with a single supplementary code point), but
       there are not and will never be surrogate characters (i.e. encoded 
characters represented with a single surrogate
       code point).

    Q: What is the difference between UCS-2 and UTF-16?

    A: UCS-2 is obsolete terminology which refers to a Unicode implementation 
up to Unicode 1.1, before surrogate code
       points and UTF-16 were added to Version 2.0 of the standard. This term 
should now be avoided.

       UCS-2 does not define a distinct data format, because UTF-16 and UCS-2 
are identical for purposes of data exchange.
       Both are 16-bit, and have exactly the same code unit representation.

       Sometimes in the past an implementation has been labeled "UCS-2" to 
indicate that it does not support supplementary
       characters and doesn't interpret pairs of surrogate code points as 
characters. Such an implementation would not
       handle processing of character properties, code point boundaries, 
collation, etc. for supplementary characters.

And in reference to UTF-16 being slower by code point than by code unit:

    Q: How about using UTF-32 interfaces in my APIs?

    A: Except in some environments that store text as UTF-32 in memory, most 
Unicode APIs are using UTF-16. With UTF-16
       APIs  the low level indexing is at the storage or code unit level, with 
higher-level mechanisms for graphemes or
       words specifying their boundaries in terms of the code units. This 
provides efficiency at the low levels, and the
       required functionality at the high levels.

        If its [sic] ever necessary to locate the nᵗʰ character, indexing by 
character can be implemented as a high
        level operation. However, while converting from such a UTF-16 code unit 
index to a character index or vice versa
        is fairly straightforward, it does involve a scan through the 16-bit 
units up to the index point. In a test run,
        for example, accessing UTF-16 storage as characters, instead of code 
units resulted in a 10× degradation. While
        there are some interesting optimizations that can be performed, it will 
always be slower on average. Therefore
        locating other boundaries, such as grapheme, word, line or sentence 
boundaries proceeds directly from the code
        unit index, not indirectly via an intermediate character code index.

I am somewhat amused by this summary:

    Q: What does Unicode conformance require?

    A: Chapter 3, Conformance discusses this in detail. Here's a very informal 
version: 

        * Unicode characters don't fit in 8 bits; deal with it.
        * 2 [sic] Byte order is only an issue in I/O.
        * If you don't know, assume big-endian.
        * Loose surrogates have no meaning.
        * Neither do U+FFFE and U+FFFF.
        * Leave the unassigned codepoints alone.
        * It's OK to be ignorant about a character, but not plain wrong.
        * Subsets are strictly up to you.
        * Canonical equivalence matters.
        * Don't garble what you don't understand.
        * Process UTF-* by the book.
        * Ignore illegal encodings.
        * Right-to-left scripts have to go by bidi rules. 

And don’t know what I think about this, except that there sure a lot of 
screw‐ups out there if it is truly as easy as they would would have you believe:

    Given that any industrial-strength text and internationalization support 
API has to be able to handle sequences of
    characters, it makes little difference whether the string is internally 
represented by a sequence of [...] code
    units, or by a sequence of code-points [...]. Both UTF-16 and UTF-8 are 
designed to make working with substrings
    easy, by the fact that the sequence of code units for a given code point is 
unique.

Take this all with a grain of salt, since I found various typos in these FAQs
and occasionally also language that seems to reflect an older nomenclature than
is now seen in the current published Unicode Standard, meaning 6.0.0.  Probably
best then to take only general directives from their FAQs and leave language‐
lawyering to the formal printed Standard, insofar as that is possible — which
sometimes it is not, because they do make mistakes from time to time, and even
less frequently, correct these.  :)

--tom

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue10542>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue10542] Py_UNICODE_NEXT and other macros for surrogates

Reply via email to