New submission from Wim:

Encoding a (well-formed) Unicode string containing a non-BMP character, using 
the xmlcharrefreplace error handler, will produce two XML entities for 
surrogate codepoints instead of one entity for the actual character.

Here's a transcript (Python 2.7.3, x86_64):


  >>> b = '\xf0\x9f\x92\x9d'
  >>> u = b.decode('utf8')
  >>> u
  u'\U0001f49d'
  >>> u.encode('ascii', errors='xmlcharrefreplace')
  '��'
  >>> ( u'\U0001f49d' ).encode('ascii', errors='xmlcharrefreplace')
  '��'
  >>> list(u)
  [u'\ud83d', u'\udc9d']
  >>> u.encode('utf8', errors='xmlcharrefreplace')
  '\xf0\x9f\x92\x9d'

The utf8 bytestring is correctly decoded, and the print representation shows 
one single Unicode character. Encoding using xmlcharrefreplace produces two XML 
entities, which is wrong[1]: a single non-BMP character should be represented 
in XML as a single entity reference, in this case presumably '💝'.

As the last two lines show, I'm using a narrow build (so the unicode strings 
are represented internally in UTF-16, I guess). Converting the string back to 
utf8 does the right thing, and emits a single UTF8 sequence representing the 
supplementary-plane codepoint.

(FWIW, the backslashreplace error handler also emits a surrogate pair, but I 
don't know if there is a complete specification for what that handler does, so 
it's possible that it's not wrong.)

[1] http://www.w3.org/International/questions/qa-escapes#bytheway

----------
components: Library (Lib), Unicode
messages: 169886
nosy: ezio.melotti, wiml
priority: normal
severity: normal
status: open
title: encode(..., 'xmlcharrefreplace') produces entities for surrogate pairs
type: behavior
versions: Python 2.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue15866>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to