[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-06 Thread zy

New submission from zy :

let s='\xff\n' 
The expected result of s.decode('gb2312', 'ignore') is u"\n", while in 2.6.6 it 
is u"".
  s can be replaced with chr(m) + chr(n) , where m is in range of 128~255, and 
n in 0~127.
  In the above cases, try decoding from chr(n) will never interfere with later 
parts in the string if there is any, since chr(n) do not start a multibyte 
sequence.

--
components: Unicode
messages: 135268
nosy: cdqzzy
priority: normal
severity: normal
status: open
title: Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')
type: behavior
versions: Python 2.6

___
Python tracker 
<http://bugs.python.org/issue12016>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-07 Thread zy

zy  added the comment:

> So the correct result for b'\xff\n'.decode('gb2312', 'replace') is u'?\n'?

I think it should be so. This behavior does not leave out possible information, 
has no side-effect on later decodings, and should the '\n'  indeed be 
redundant, an output of u'?\n' would unlikely cause confusions.

Though, I have no knowledge on this subject code-wise. If a change of the 
behavior will have an impact on performance, maybe the change should not come 
in.

--

___
Python tracker 
<http://bugs.python.org/issue12016>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12016] Wrong behavior for '\xff\n'.decode('gb2312', 'ignore')

2011-05-07 Thread zy

zy  added the comment:

I do not have documents on this subject. Though, I found that GNU iconv(1) 
behaves the same as my proposed behavior. My reading of the source code 
suggests that iconv(1) treat all encodings equally, which I think should also 
be true for python.

As of security concerns, I do not think the change in decoding function itself 
would introduce any security vulnerabilities. If a security issue arises 
because of the proposed change, there must be improper code out side of python, 
which is out of python's control. That said, the proposed change is unlikely to 
introduce new security vulnerability, as all it does in effect is retaining a 
few ascii characters in the string to the output as opposed to removing.  In 
the issue of wordpress, if we suppose that wordpress was written in python, and 
that the attacker was using gb2312 encoded strings instead of gbk, then my 
proposed change would by chance fix the issue, as the backslash would be 
retained when we decode the string.

--

___
Python tracker 
<http://bugs.python.org/issue12016>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com