[issue2980] Pickle stream for unicode object may contain non-ASCII characters.
Dan Dibagh <[EMAIL PROTECTED]> added the comment: Your reasoning shows a lack of understanding how Python is actually used from a programmers point of view. Why do you think that "noticing" a problem is the same thing as entering as a python bug report? In practice there are several steps between noticing a problem in a python program and entering it as a bug report in the python development system. It is very difficult so see why any of these steps would happen automatically. Believe me, people have had real problems due to this bug. They have just selected other solutions than reporting it. You are yourself reluctant so seek out the roots of this problem and fix it. Why should other people behave differently and report it? A not so uncommon "fix" to pickle problems out there is to not using pickle at all. There are Python programmers who gives the advice to avoid pickle since "it's too shaky". It is a solution, but is it the solution you desire? The capability to serialize stuff into ASCII strings isn't just an implementation detail that happens to be nice for human readability. It is a feature people need for technical reasons. If the data is ASCII, it can be dealt with in any ASCII-compatible context which might be network protocols, file formats and database interfaces. There is the real use. Programs depend on it to work properly. The solution the change the documentation is in practice breaking compatibility (which programming language designers normally tries to avoid or do in a very controlled manner). How is a documentation fix going to help all the code out there written with the assumption that pickle protocol 0 is always ASCII? Is there a better solution around than changing pickle to meet actual expectations? Well, nobody has reported it as a bug in 8 years. How long do you think that code will stay around based on the ASCII assumption? 8 years? 16 years? 24 years? Maybe all the time in the world for this to become an issue again and again and again? It is difficult to grasp why there is "no way to fix it now". From a programmers point of view an obvious "fix" is to ditch pickle and use something that delivers a consistent result rather than debugging hours. When I try to see it from the Python library developers point of view I see code implemented in C which produces a result with reasonable performance. It is perfectly possible to write the code which implements the expected result within reasonable performance. What is the problem? Perhaps it is the raw-unicode-escape encoding that should be fixed? I failed to find exact information about what raw-unicode-escape means. In particular, where is the information which states that raw-unicode-escape is always an 8-bit format? The closest I've come is PEP 100 and PEP 263 (which I notice is written by you guys), which describes how to decode raw unicode escape strings from Python source and how to define encoding formats for python source code. The sole original purpose of both unicode-escape and raw-unicode-escape appears to be representing unicode strings in Python source code as u' and ur' strings respectively. It is clear that the decoding of a raw unicode escaped or unicode escaped string depends on the actual encoding of the python source, but how goes the logic that when something is _encoded_ into a raw unicode string then the target source must be of some 8-bit encoding. Especially considering that the default python source encoding is ASCII. For unicode-escape this makes sense: >>> f = file("test.py", "wb") >>> f.write('s = u"%s"\n' % u"\u0080".encode("unicode-escape")) >>> f.close() >>> ^Z python test.py (executes silently without errors) But for raw-unicode-escape the outcome is a different thing: >>> f = file("test.py", "wb") >>> f.write('s = ur"%s"\n' % u"\u0080".encode("raw-unicode-escape")) >>> f.close() >>> ^Z python test.py File "test.py", line 1 SyntaxError: Non-ASCII character '\x80' in file test.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details Huh? For someone who trusts the Standard Encodings section Python Library reference this isn't what one would expect. If the documentation states "Produce a string that is suitable as raw Unicode literal in Python source code" then why isn't it suitable? -- nosy: +dddibagh ___ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue2980> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2980] Pickle stream for unicode object may contain non-ASCII characters.
Dan Dibagh <[EMAIL PROTECTED]> added the comment: I am well aware why my example produces an error from a technical standpoint. What I'm getting at is the decision to implement PyUnicode_EncodeRawUnicodeEscape the way it is. Probably there is nothing wrong with it, but how am I supposed to know? I read the PEP, which serves as a specification of raw unicode escape (at least for the decoding bit) and the reference documentation. Then I read the source trying to map between specified behavior in the documentation and the implementation in the source code. When it comes to the part which causes the problem with non-ASCII characters, it is difficult to follow. Or in other words: what is the high level reason why the codec won't escape \x80 in my test program? To use a real-world term; an interface specification, in this case the pickle documentation, is the contract between the consumer of the library and the provider of the library. If it states "ASCII", ASCII is expected. If it doesn't state "for debugging only" it will be used for non-debugging purposes. There isn't much you can do about it without breaking the contract. What makes you think that the problem cannot be fixed without changing the existing pickle format 0? Note that base64 is "a common" way to deal with binary data in ascii streams rather than "the common". (But why should I care when my data is already ascii?) ___ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue2980> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2980] Pickle stream for unicode object may contain non-ASCII characters.
Dan Dibagh <[EMAIL PROTECTED]> added the comment: > Which PEP specifically? PEP 263 only mentions the unicode-escape > encoding in its problem statement, i.e. as a pre-existing thing. > It doesn't specify it, nor does it give a rationale for why it behaves > the way it does. PEP 100 and PEP 263. What I looked for was a description of the functional intention and a technical definition of raw unicode escape. The term "raw" tends to have different meanings depending on the context in which it appears. PEP 263 is of interest in the overall understanding of the intention of raw unicode escape. If raw unicode escape is to convert from python source into unicode strings then the decoding of raw unicode escape strings depends on the source code encoding. Then perhaps it would give an idea what the encoding part is supposed to do... PEP 100 is of interest for the technical description. It describes the section "unicode constructors" as the definition. > What code are you looking at, and where do you find it difficult to > follow it? Maybe you get confused between the "unicode-escape" codec, > and the "raw-unicode-escape" codec, also. Since it is the issue with non-ASCII characters in pickle output I look at, it is raw-unicode-escape being in focus. For the decoding bit the distinction between unicode-escape and raw-unicode-escape is very clear. I look at the function PyUnicode_EncodeRawUnicodeEscape in Objects/unicodeobject.c. At the point of the comment "/* Copy everything else as-is */", given the perceived intentions of the encoding type, I try to figure out why there isn't a "/* Map non-printable US ASCII to '\xhh' */" section like in the unicodeescape_string function. The background in older pythons you explained is essentially what I guessed. > The raw-unicode-escape codec? It was designed to support parsing of > Python 2.0 source code, and of "raw" unicode strings (ur"") in > particular. In Python 2.0, you only needed to escape characters above > U+0100; Latin-1 characters didn't need escaping. Python, itself, only > relied on the decoding directory. That the codec choses not to escape > Latin-1 characters on encoding is an arbitrary choice (I guess); it's > still symmetric with decoding. I suppose you mean symmetric with decoding as long as you stick to the latin-1 character set, as raw unicode escaping isn't a one-to-one mapping. When PEP 263 came into the picture, wouldn't it have made sense to change PyUnicode_EncodeRawUnicodeEscape to produce ASCII-only output, or perhaps output conforming to the current default encoding? Given the intention of the raw unicode escape, encoding something with it means producing python source code. But it is in latin-1 while the rest of Python has moved on to use ASCII by default or whatever being configured in the source. I tried to put shine on that problem in my previous example. > Even though the choice was arbitrary, you shouldn't change it now, > because people may rely on how this codec works. > Applications might rely on what was implemented rather than what was > specified. If they had implemented their own pickle readers, such > readers might break if the pickle format is changed. In principle, > even the old pickle readers of Python 2.0..2.6 might break if the >format changes in 2.7 - we would have to go back and check that they don't > break (although I do believe that they would work fine). Then let me ask: How far reaching is the aim to maintain compatibility with programs which depends on Python internals? Even if the internal thing is a bug and the thing which depends on the bug is also a bug? Maybe it is a provoking question, let me explain. The question(s) applies to some extent to the workings of the codec but it is really the pickle problem I think of. In the case of older Python releases, it is just a matter of testing, just as you say. It is boring and perhaps tedious but there is nothing special which prevents it from being done. If there are many versions there ought to be a way to write a program which does it automatically. In the case of those who have implemented their own pickle readers, the source and the comments in pickletools.py clearly states that unicode strings are raw unicode escaped in format 0. Now raw unicode escape isn't a canonical format. The letter A can be represented either as \u0041 or as itself as A. If a hypothetical implementor gets the idea that characters in the range 0-255 cannot be represented by \u00xx sequences then the fact that pickle replaces \ with \u005c and \n with \u000a should give a hint that he is wrong. So if characters in the range 128-255 gets escaped with \u00xx any pickle reader should handle it. I've tried to come up with some sensible way to write a pickle implemenati