unicode string alteration

BAvant Garde Thu, 12 Aug 2010 08:47:05 -0700

HELP!!!
I need help with a unicode issue that has me stumped. I must be doing 
something  wrong because I don't believe this condition would have slipped thru 
testing.


Wherever the string u'\udbff\udc00' occurs u'\U0010fc00' or unichr(1113088) is 
substituted and the file loses 1 character resulting in all trailing characters 
being shifted out of position. No other corrupt strings have been detected.
    
The condition was noticed while testing in Python 2.6.5 on Ubuntu 10.04 where 
the maximum ord # is 1114111 (wide Python build).
    
Using Python 2.5.4 on Windows-ME where the maximum ord # is 65535 (narrow 
Python build) the string u'\U0010fc00' also occurs and it "seems" that the 
substitution takes place but no characters are lost and file sizes are ok. Note 
that ord(u'\U0010fc00')
 causes the following error: 
             "TypeError: ord() expected a character, but string of length 2 
found"
The condition is otherwise invisible in 2.5.4 and is handled internally without 
any apparent effect on processing with characters u'\udbff' and u'\udc00' each 
being separately accessible. 

The first part of the attachment repeats this email but also has examples and 
illustrates other related oddities.
    
Any help would be greatly appreciated.
Bruce

Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "copyright", "credits" or "license()" for more information.

    ****************************************************************
    Personal firewall software may warn about the connection IDLE
    makes to its subprocess using this computer's internal loopback
    interface.  This connection is not visible on any external
    interface and no data is sent to or received from the Internet.
    ****************************************************************
    
IDLE 2.6.5      
>>> ===================================== RESTART 
>>> =====================================
>>> 
HELP!!!
I need help with a unicode issue that has me stumped. I must be doing something
 wrong because I don't believe this condition would have slipped thru testing.

Script output in the next section w/data = map(unichr,xrange(56317,56323))
 Reveals a condition where a unicode string is corrupted when created.

    Wherever the string u'\udbff\udc00' occurs u'\U0010fc00'...unichr(1113088)
      is substituted and the file loses 1 character resulting in all trailing
      characters being shifted out of position. No other corrupt strings have
      been detected.
    
    The condition was noticed while testing in Python 2.6.5 on Ubuntu 10.04
      where the maximum ord # is 1114111 (wide Python build).
    
    Using Python 2.5.4 on Windows-ME where the maximum ord # is 65535
      (narrow Python build) the string u'\U0010fc00' also occurs and it "seems"
      that the substitution takes place but no characters are lost and file
      sizes are ok. Note that ord(u'\U0010fc00') causes the following error:
        "TypeError: ord() expected a character, but string of length 2 found"
      The condition is otherwise invisible in 2.5.4 and is handled internally
      without any apparent effect on processing with characters u'\udbff'
      and u'\udc00' each being separately accessible. 
    
In Python 2.5.4/win32(ME) the system substitutes U0010fc00 but is othersize ok
as the system handles these events internally without any other known effect.

NOTE:
In Python 2.6.5/linux2(ubuntu 10.04) the system may not substitute immediately
but returns different lengths depending on the method of length request.
len(u'\udbff\udc00') returns a shorter len than: u2 = u'\udbff\udc00'; len(u2)
In those cases the variable reference method (len(u2)) is more accurate
eventhough the variable value and the string value are identical.
I have evidence of this string resulting in a file losing 1 character for each
time the string occurs on write to file. As a test I reversed the string
sequence and used the same commands to write to file, everything was fine.
I think the problem is confined to strings "u'\udbff\udc00'" & "u'\U0010fc00'".

Any help would be greatly appreciated.
Bruce
================================================================================
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2

u = map(unichr, xrange(56317, 56323))
[u'\udbfd', u'\udbfe', u'\udbff', u'\udc00', u'\udc01', u'\udc02'] .. length 6, 
ords(56317-56322)

  #-------------------------------------------------------------------------#
  # EXHIBIT 1:                                                              #
  # The following commands were used to write to file and then to read back #
  #   to illustrate the effect of the substitute string on a file's size at #
  #   least in Python 2.6.5 linux2(ubuntu 10.04), maybe other combinations. #
  #                                                                         #
  #     import codecs                                                       #
  #     u = map(unichr, (xrange(56317, 56323)                               #
  #     dw = ''.join(u)                                                     #
  #     f = codecs.open(fn, 'wb', 'utf8'); f.write(dw); f.close()           #
  #     f = codecs.open(fn, 'rb', 'utf8'); dr = f.read(); f.close()         #
  #                                                                         #
  #     u.reverse()                                                         #
  #     dw = ''.join(u); u.reverse()                                        #
  #     f = codecs.open(fn, 'wb', 'utf8'); f.write(dw); f.close()           #
  #     f = codecs.open(fn, 'rb', 'utf8'); dr = f.read(); f.close()         #
  #-------------------------------------------------------------------------#

u'\udbfd\udbfe\udbff\udc00\udc01\udc02' .. .writelength 6, ords(56317-56322)
u'\udbfd\udbfe\U0010fc00\udc01\udc02' .. .read.length 5, ords(56317-1113088)

...(same data but in reverse order)...
u'\udc02\udc01\udc00\udbff\udbfe\udbfd' .. .writelength 6, ords(56317-56322)
u'\udc02\udc01\udc00\udbff\udbfe\udbfd' .. .read.length 6, ords(56317-56322)

  #-------------------------------------------------------------------------#
  # EXHIBIT 2a:                                                             #
  #       sizes obtained using len(variablename)...len(u2) or len(u3).      #
  #                           (always accurate)                             #
  #                                                                         #
  # (<--) indicates where u'\udbff\udc00' is/was part of the string.        #
  #-------------------------------------------------------------------------#

u2 = u'\udbfd\udbfe'       len(u2) returns 2
u3 = u'\udbfd\udbfe\udbff' len(u3) returns 3

u2 = u'\udbfe\udbff'       len(u2) returns 2
u3 = u'\udbfe\udbff\udc00' len(u3) returns 3<--

u2 = u'\udbff\udc00'       len(u2) returns 2<--
u3 = u'\udbff\udc00\udc01' len(u3) returns 3<--

u2 = u'\udc00\udc01'       len(u2) returns 2
u3 = u'\udc00\udc01\udc02' len(u3) returns 3

  #-------------------------------------------------------------------------#
  # EXHIBIT 2b:                                                             #
  # sizes obtained using len('string')...("u'\U0010fc00'" in Python 2.5.4). #
  #       (This is where size discrepancies show up in Python 2.6.5)        #
  #                                                                         #
  # (<--) indicates where u'\udbff\udc00' is/was part of the string.        #
  #-------------------------------------------------------------------------#

len(u'\udbfd\udbfe')       returns 2   
len(u'\udbfd\udbfe\udbff') returns 3   

len(u'\udbfe\udbff')       returns 2   
len(u'\udbfe\udbff\udc00') returns 2<--

len(u'\udbff\udc00')       returns 1<--
len(u'\udbff\udc00\udc01') returns 2<--

len(u'\udc00\udc01')       returns 2   
len(u'\udc00\udc01\udc02') returns 3   

.. for `shellview` (Cut & Paste) commands, press <q>
>>>

-- 
http://mail.python.org/mailman/listinfo/python-list

unicode string alteration

Reply via email to