[issue2980] Pickle stream for unicode object may contain non-ASCII characters.

2008-10-21 Thread Dan Dibagh

Dan Dibagh <[EMAIL PROTECTED]> added the comment:

Your reasoning shows a lack of understanding how Python is actually used
from a programmers point of view.

Why do you think that "noticing" a problem is the same thing as entering
as a python bug report? In practice there are several steps between
noticing a problem in a python program and entering it as a bug report
in the python development system. It is very difficult so see why any of
these steps would happen automatically. Believe me, people have had real
problems due to this bug. They have just selected other solutions than
reporting it.

You are yourself reluctant so seek out the roots of this problem and fix
it. Why should other people behave differently and report it? A not so
uncommon "fix" to pickle problems out there is to not using pickle at
all. There are Python programmers who gives the advice to avoid pickle
since "it's too shaky". It is a solution, but is it the solution you
desire? 

The capability to serialize stuff into ASCII strings isn't just an
implementation detail that happens to be nice for human readability. It
is a feature people need for technical reasons. If the data is ASCII, it
can be dealt with in any ASCII-compatible context which might be network
protocols, file formats and database interfaces. There is the real use.
Programs depend on it to work properly.

The solution the change the documentation is in practice breaking
compatibility (which programming language designers normally tries to
avoid or do in a very controlled manner). How is a documentation fix
going to help all the code out there written with the assumption that
pickle protocol 0 is always ASCII? Is there a better solution around
than changing pickle to meet actual expectations?

Well, nobody has reported it as a bug in 8 years. How long do you think
that code will stay around based on the ASCII assumption? 8 years? 16
years? 24 years? Maybe all the time in the world for this to become an
issue again and again and again?

It is difficult to grasp why there is "no way to fix it now". From a
programmers point of view an obvious "fix" is to ditch pickle and use
something that delivers a consistent result rather than debugging hours.
When I try to see it from the Python library developers point of view I
see code implemented in C which produces a result with reasonable
performance. It is perfectly possible to write the code which implements
the expected result within reasonable performance. What is the problem?

Perhaps it is the raw-unicode-escape encoding that should be fixed? I
failed to find exact information about what raw-unicode-escape means. In
particular, where is the information which states that
raw-unicode-escape is always an 8-bit format? The closest I've come is
PEP 100 and PEP 263 (which I notice is written by you guys), which
describes how to decode raw unicode escape strings from Python source
and how to define encoding formats for python source code. The sole
original purpose of both unicode-escape and raw-unicode-escape appears
to be representing unicode strings in Python source code as u' and ur'
strings respectively. It is clear that the decoding of a raw unicode
escaped or unicode escaped string depends on the actual encoding of the
python source, but how goes the logic that when something is _encoded_
into a raw unicode string then the target source must be of some 8-bit
encoding. Especially considering that the default python source encoding
is ASCII. For unicode-escape this makes sense:

>>> f = file("test.py", "wb")
>>> f.write('s = u"%s"\n' % u"\u0080".encode("unicode-escape"))
>>> f.close()
>>> ^Z

python test.py (executes silently without errors)

But for raw-unicode-escape the outcome is a different thing:

>>> f = file("test.py", "wb")
>>> f.write('s = ur"%s"\n' % u"\u0080".encode("raw-unicode-escape"))
>>> f.close()
>>> ^Z

python test.py

  File "test.py", line 1
SyntaxError: Non-ASCII character '\x80' in file test.py on line 1, but
no encoding declared; see http://www.python.org/peps/pep-0263.html for
details

Huh? For someone who trusts the Standard Encodings section Python
Library reference this isn't what one would expect. If the documentation
states "Produce a string that is suitable as raw Unicode literal in
Python source code" then why isn't it suitable?

--
nosy: +dddibagh

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2980] Pickle stream for unicode object may contain non-ASCII characters.

2008-10-21 Thread Dan Dibagh

Dan Dibagh <[EMAIL PROTECTED]> added the comment:

I am well aware why my example produces an error from a technical
standpoint. What I'm getting at is the decision to implement
PyUnicode_EncodeRawUnicodeEscape the way it is. Probably there is
nothing wrong with it, but how am I supposed to know? I read the PEP,
which serves as a specification of raw unicode escape (at least for the
decoding bit) and the reference documentation. Then I read the source
trying to map between specified behavior in the documentation and the
implementation in the source code. When it comes to the part which
causes the problem with non-ASCII characters, it is difficult to follow.

Or in other words: what is the high level reason why the codec won't
escape \x80 in my test program?

To use a real-world term; an interface specification, in this case the
pickle documentation, is the contract between the consumer of the
library and the provider of the library. If it states "ASCII", ASCII is
expected. If it doesn't state "for debugging only" it will be used for
non-debugging purposes. There isn't much you can do about it without
breaking the contract. 

What makes you think that the problem cannot be fixed without changing
the existing pickle format 0?
 
Note that base64 is "a common" way to deal with binary data in ascii
streams rather than "the common". (But why should I care when my data is
already ascii?)

___
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2980>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue2980] Pickle stream for unicode object may contain non-ASCII characters.

2008-10-24 Thread Dan Dibagh

Dan Dibagh <[EMAIL PROTECTED]> added the comment:

> Which PEP specifically? PEP 263 only mentions the unicode-escape
> encoding in its problem statement, i.e. as a pre-existing thing.
> It doesn't specify it, nor does it give a rationale for why it behaves
> the way it does.

PEP 100 and PEP 263. What I looked for was a description of the
functional intention and a technical definition of raw unicode escape.
The term "raw" tends to have different meanings depending on the context
in which it appears. PEP 263 is of interest in the overall understanding
of the intention of raw unicode escape. If raw unicode escape is to
convert from python source into unicode strings then the decoding of raw
unicode escape strings depends on the source code encoding. Then perhaps
it would give an idea what the encoding part is supposed to do... PEP
100 is of interest for the technical description. It describes the
section "unicode constructors" as the definition.

> What code are you looking at, and where do you find it difficult to
> follow it? Maybe you get confused between the "unicode-escape" codec,
> and the "raw-unicode-escape" codec, also.

Since it is the issue with non-ASCII characters in pickle output I look
at, it is raw-unicode-escape being in focus. For the decoding bit the
distinction between unicode-escape and raw-unicode-escape is very clear. 

I look at the function PyUnicode_EncodeRawUnicodeEscape in
Objects/unicodeobject.c. At the point of the comment "/* Copy everything
else as-is */", given the perceived intentions of the encoding type, I
try to figure out why there isn't a "/* Map non-printable US ASCII to
'\xhh' */" section like in the unicodeescape_string function. The
background in older pythons you explained is essentially what I guessed.

> The raw-unicode-escape codec? It was designed to support parsing of
> Python 2.0 source code, and of "raw" unicode strings (ur"") in
> particular. In Python 2.0, you only needed to escape characters above
> U+0100; Latin-1 characters didn't need escaping. Python, itself, only
> relied on the decoding directory. That the codec choses not to escape
> Latin-1 characters on encoding is an arbitrary choice (I guess); it's
> still symmetric with decoding.

I suppose you mean symmetric with decoding as long as you stick to the
latin-1 character set, as raw unicode escaping isn't a one-to-one mapping.

When PEP 263 came into the picture, wouldn't it have made sense to
change PyUnicode_EncodeRawUnicodeEscape to produce ASCII-only output, or
perhaps output conforming to the current default encoding? Given the
intention of the raw unicode escape, encoding something with it means
producing python source code. But it is in latin-1 while the rest of
Python has moved on to use ASCII by default or whatever being configured
in the source. I tried to put shine on that problem in my previous example.

> Even though the choice was arbitrary, you shouldn't change it now,
> because people may rely on how this codec works.

> Applications might rely on what was implemented rather than what was
> specified. If they had implemented their own pickle readers, such
> readers might break if the pickle format is changed. In principle, 
> even the old pickle readers of Python 2.0..2.6 might break if the
>format changes in 2.7 - we would have to go back and check that they don't
> break (although I do believe that they would work fine).

Then let me ask: How far reaching is the aim to maintain compatibility
with programs which depends on Python internals? Even if the internal
thing is a bug and the thing which depends on the bug is also a bug?
Maybe it is a provoking question, let me explain. The question(s)
applies to some extent to the workings of the codec but it is really the
pickle problem I think of. In the case of older Python releases, it is
just a matter of testing, just as you say. It is boring and perhaps
tedious but there is nothing special which prevents it from being done.
If there are many versions there ought to be a way to write a program
which does it automatically. 

In the case of those who have implemented their own pickle readers, the
source and the comments in pickletools.py clearly states that unicode
strings are raw unicode escaped in format 0. Now raw unicode escape
isn't a canonical format. The letter A can be represented either as
\u0041 or as itself as A. If a hypothetical implementor gets the idea
that characters in the range 0-255 cannot be represented by \u00xx
sequences then the fact that pickle replaces \ with \u005c and \n with
\u000a should give a hint that he is wrong. So if characters in the
range 128-255 gets escaped with \u00xx any pickle reader should handle
it. I've tried to come up with some sensible way to write a pickle
implemenati