Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread Stephen J. Turnbull
"Martin v. Löwis" writes:
 > Am 20.11.2010 05:11, schrieb Stephen J. Turnbull:
 > > "Martin v. Löwis" writes:
 > > 
 > >  > The term "UCS-2" is a character set that can encode only encode 65536
 > >  > characters; it thus refers to Unicode 1.1. According to the Unicode
 > >  > Consortium's FAQ, the term UCS-2 should be avoided these days.
 > > 
 > > So what do you propose we call the Python implementation?
 > 
 > A technical correct description would be to say that Python uses either
 > 16-bit code units or 32-bit code units; for brevity, these can be called
 > narrow and wide code units.

I agree that's technically correct.  Unfortunately, it's also useless
to anybody who doesn't already know more about Unicode than anybody
should have to know.

 > > and therefore is not UTF-16 conforming.
 > 
 > I disagree. Python does "conform" to "UTF-16"

I'm sure the codecs do.  But the Unicode standard doesn't care about
the parts of the process, it cares about what it does as a whole.
Python's internal coding does not conform to UTF-16, and that internal
coding can, under certain conditions, escape to the outside world as
invalid "Unicode" output.

 > > AFAIK this was not supposed to change in Python 3; indexing and
 > > slicing go by code unit (isomorphic to UCS-n), not character, and due
 > > to PEP 383 4-octet builds do not conform (internally) to UTF-32, and
 > > can produce output that conforms to Unicode not at all (as a user
 > > option, of course, but it's still non-conformant).
 > 
 > What behavior specifically do you consider non-conforming, and what
 > specific specification do you think it is not conforming to? For
 > example, it *is* fully conforming with UTF-8.

Oh,

f = open('/tmp/broken','wt',encoding='utf8',errors='surrogateescape')
f.write(chr(int('dc80',16)))
f.close()

for one.  That produces a non-UTF-8 file in a 32-bit-code-unit build.
You can say, "oh, but that's not really a UTF-8 codec", and I'd agree.
Nevertheless, the program is able to produce output from internal
"Unicode" strings that does not conform to Unicode at all.  A Unicode-
conforming Python implementation would error at the chr() call, or
perhaps would not provide surrogateescape error handlers.

It is, of course, possible to write Python programs that conform (and
easier than in any other language I know), but Python itself does not
conform to post-1.1 Unicode standards.  Too bad for the standards:
"Although practicality beats purity."

The point is that internal code is *not* UTF-16 (or -32), but it *is*
isomorphic to UCS-2 (or -4).  *That is very useful information to
users*, it's not a technical detail of interest only to Unicode geeks.
It means that if you stick to defined characters in the BMP when
giving Python input, then slicing and indexing unicode (Python 2) or
str (Python 3) objects gives only valid output even in builds with
16-bit code units.  OTOH, invalid processing (involving functions like
'chr' or input using surrogateescape codecs) can lead to invalid
output even in builds with 32-bit code units.

IMO, saying "UCS-2" or "UCS-4" tells ordinary developers most of what
they need to know about the limitations of their Python vis-a-vis full
conformance, at least with respect to the string manipulation functions.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Web servers, bytes, str, documentation, Python 3.2a4

2010-11-21 Thread R. David Murray
On Sat, 20 Nov 2010 23:52:45 -0800, Glenn Linderman  
wrote:
> Sadly, cgi.py input handling seems to depend on the email module, 
> thought to be fixed for 3.2, but it is not clear if that has been 
> achieved, or if the surrogate encode workaround is sufficient for this.  
> More testing needed, but I don't have such a test case developed yet.

Indeed, this should theoretically be fixable now.  The email module
is now perfectly capable of both consuming and producing binary data.
The user of the module doesn't need to care how this was achieved unless
they want to do processing of non-RFC conformant data.

I want to look at the CGI issue, but I'm not sure when I'll get to it.

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mercurial Schedule

2010-11-21 Thread Jesus Cea
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

What is the impact in the buildbot architecture?. Slaves must do
anything?. At least they need to have mercurial installed, I guess.

What, as a buildslave manager, must I do to ready my server for the
migration?.

- -- 
Jesus Cea Avion _/_/  _/_/_/_/_/_/
[email protected] - http://www.jcea.es/ _/_/_/_/  _/_/_/_/  _/_/
jabber / xmpp:[email protected] _/_/_/_/  _/_/_/_/_/
.  _/_/  _/_/_/_/  _/_/  _/_/
"Things are not so easy"  _/_/  _/_/_/_/  _/_/_/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/_/_/_/  _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQCVAwUBTOlWjplgi5GaxT1NAQKwJAP/W1w/mn3Jv9XECxGCLKFj1Xvjz4fKq8im
e1oKpvrl5hzXfKfYtIC4K2fy5G4O3iP1gS/Iwy0iGSSqcpnxFIfpwcTpjigRGaBi
rpZp956TosaSLTGZxS2Wb11KFxsGlhAcgVF2ooFF7Z+wL73wCyVjfUqMXCB/50Nr
dztlJuv3Wvg=
=ntFy
-END PGP SIGNATURE-
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread R. David Murray
On Sun, 21 Nov 2010 21:55:12 +0900, "Stephen J. Turnbull"  
wrote:
> "Martin v. Löwis" writes:
>  > Am 20.11.2010 05:11, schrieb Stephen J. Turnbull:
>  > > "Martin v. Löwis" writes:
>  > >
>  > >  > The term "UCS-2" is a character set that can encode only encode 65536
>  > >  > characters; it thus refers to Unicode 1.1. According to the Unicode
>  > >  > Consortium's FAQ, the term UCS-2 should be avoided these days.
>  > >
>  > > So what do you propose we call the Python implementation?
>  >
>  > A technical correct description would be to say that Python uses either
>  > 16-bit code units or 32-bit code units; for brevity, these can be called
>  > narrow and wide code units.
> 
> I agree that's technically correct.  Unfortunately, it's also useless
> to anybody who doesn't already know more about Unicode than anybody
> should have to know.

[...]
 
> The point is that internal code is *not* UTF-16 (or -32), but it *is*
> isomorphic to UCS-2 (or -4).  *That is very useful information to
> users*, it's not a technical detail of interest only to Unicode geeks.
> It means that if you stick to defined characters in the BMP when
> giving Python input, then slicing and indexing unicode (Python 2) or
> str (Python 3) objects gives only valid output even in builds with
> 16-bit code units.  OTOH, invalid processing (involving functions like
> 'chr' or input using surrogateescape codecs) can lead to invalid
> output even in builds with 32-bit code units.
> 
> IMO, saying "UCS-2" or "UCS-4" tells ordinary developers most of what
> they need to know about the limitations of their Python vis-a-vis full
> conformance, at least with respect to the string manipulation functions.

I'm sorry, but I have to disagree.  As a relative unicode ignoramus,
"UCS-2" and "UCS-4" convey almost no information to me, and the bits I
have heard about them on this list have only confused me.  On the other
hand, I understand that 'narrow' means that fewer bytes are used for
each internal character, meaning that some unicode characters need to
be represented by more than one string element, and thus that slicing
strings containing such characters on a narrow build causes problems.
Now, you could tell me the same information using the terms 'UCS-2'
and 'UCS-4' instead of 'narrow' and 'wide', but to my ear 'narrow'
and 'wide' convey a better gut level feeling for what is going on than
'UCS-2' and 'UCS-4' do.  And it avoids any question of whether or not
Python's internal representation actually conforms to whatever standard
it is that UCS refers to, a point on which there seems to be some
dissension.

Having written the above, I googled for UCS-2 and got the Wikipedia
article on UTF16/UCS-2 [1].  Scanning that article, I do not see anything
that would clue me in to the problems of slicing strings in a Python
narrow build.  Indeed, reading that article with my limited unicode
knowledge, if I were told Python used UCS-2, I would assume that non-BMP
characters could not be processed by a Python narrow build.

--
R. David Murray  www.bitdance.com

[1] http://en.wikipedia.org/wiki/UTF-16/UCS-2
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Mercurial Schedule

2010-11-21 Thread Georg Brandl
Am 21.11.2010 18:27, schrieb Jesus Cea:
> What is the impact in the buildbot architecture?. Slaves must do
> anything?. At least they need to have mercurial installed, I guess.
> 
> What, as a buildslave manager, must I do to ready my server for the
> migration?.

Apart from having Mercurial installed and "hg" in the PATH (that will
be important for Windows I assume), I don't think anything else is required.

Georg

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread Raymond Hettinger

On Nov 21, 2010, at 9:38 AM, R. David Murray wrote:
> 
> I'm sorry, but I have to disagree.  As a relative unicode ignoramus,
> "UCS-2" and "UCS-4" convey almost no information to me, and the bits I
> have heard about them on this list have only confused me. 

>From the users point of view, it doesn't much matter which encoding is
used internally.  

Neither UTF-16 nor UCS-2 is exactly correct anyway.  The former encodes
the entire range of unicode characters in a variable length code 
(a character is usually 2 bytes but is sometimes 4 bytes long).  The latter
encodes only a subset of unicode (the basic mulitlingual plane) in a
fixed-length code of bytes per character).

What we use internally looks like utf-16 but a character encoded with
4 bytes is treated as two 2-byte characters (hence the subject of this
thread).   Our hybrid internal coding lets use handle the entire
range of unicode while getting speed and simplicity by doing len() 
and slicing with a surrogate pair being treated as two separate
characters).

For the "wide" build, the entire range of unicode is encoded at
4 bytes per character and slicing/len operate correctly since
every character is the same length.   This used to be called UCS-4
and is now UTF-32.

So, with "wide" builds there isn't much confusion (except perhaps
unfamiliar terminology).   The real issue seems to be that for 
"narrow" builds, none of the usual encoding names is exactly correct.  

>From a users point-of-view, the actual encoding or encoding name 
doesn't matter much.  They just need to be able to predict the relevant
behaviors (memory consumption and len/slicing behavior).

For the narrow build, that behavior is:
- Characters in the BMP consume 2 bytes and count as one char
  for purposes of len and slicing.
- Characters above the BMP consume 4 bytes and counts as
  two distinct chars for purpose of len and slicing.

For wide builds, all characters are 4 bytes and count as a single
char for len and slicing.

Hope this helps,


Raymond
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread Martin v. Löwis
>  > I disagree. Python does "conform" to "UTF-16"
> 
> I'm sure the codecs do.  But the Unicode standard doesn't care about
> the parts of the process, it cares about what it does as a whole.

Chapter and verse?

> Python's internal coding does not conform to UTF-16, and that internal
> coding can, under certain conditions, escape to the outside world as
> invalid "Unicode" output.

I'm fairly certain there are provisions in the Unicode standard for such
behavior (taking into account "certain conditions").

>  > What behavior specifically do you consider non-conforming, and what
>  > specific specification do you think it is not conforming to? For
>  > example, it *is* fully conforming with UTF-8.
> 
> Oh,
> 
> f = open('/tmp/broken','wt',encoding='utf8',errors='surrogateescape')
> f.write(chr(int('dc80',16)))
> f.close()
> 
> for one.  That produces a non-UTF-8 file

Right. You are using an API that does not promise to create UTF-8, and
hence isn't UTF-8. The Unicode standard certainly allows implementations
to use character encoding schemes other than UTF-8; this one being
"UTF-8 with surrogate escapes", which is different from "UTF-8" (IANA
MIBEnum 106).

> You can say, "oh, but that's not really a UTF-8 codec", and I'd agree.

See above :-)

> Nevertheless, the program is able to produce output from internal
> "Unicode" strings that does not conform to Unicode at all.

*Any* Unicode implementation will do that, since they all have to
support legacy encodings in some form. This is certainly conforming to
the Unicode standard, and in fact one of the primary Unicode design
principles.

> A Unicode-
> conforming Python implementation would error at the chr() call, or
> perhaps would not provide surrogateescape error handlers.

Chapter and verse?

> "Although practicality beats purity."

The Unicode standard itself is based on practicality. It wouldn't
have received the success it did if it was based on purity only
(and indeed, was often rejected in cases where it put purity over
practicality, e.g. with the Hangul syllables).

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread R. David Murray
On Sun, 21 Nov 2010 10:17:57 -0800, Raymond Hettinger 
 wrote:
> On Nov 21, 2010, at 9:38 AM, R. David Murray wrote:
> > I'm sorry, but I have to disagree.  As a relative unicode ignoramus,
> > "UCS-2" and "UCS-4" convey almost no information to me, and the bits I
> > have heard about them on this list have only confused me.

[...]

> 6rom a users point-of-view, the actual encoding or encoding name
> doesn't matter much.  They just need to be able to predict the relevant
> behaviors (memory consumption and len/slicing behavior).
> 
> For the narrow build, that behavior is:
> - Characters in the BMP consume 2 bytes and count as one char
>   for purposes of len and slicing.
> - Characters above the BMP consume 4 bytes and counts as
>   two distinct chars for purpose of len and slicing.
> 
> For wide builds, all characters are 4 bytes and count as a single
> char for len and slicing.
> 
> Hope this helps,

Thank you, that nicely summarizes and confirms what I thought I knew about
wide versus narrow build.  And as I said, using the names UCS-2/UCS-4
would only *confuse* that understanding, not clarify it.

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread Alexander Belopolsky
On Fri, Nov 19, 2010 at 4:43 PM, "Martin v. Löwis"  wrote:
>> In my opinion, the question is more what was it not fixed in Python2. I 
>> suppose
>> that the answer is something ugly like "backward compatibility" or 
>> "historical
>> reasons" :-)
>
> No, there was a deliberate decision to not support that, see
>
> http://www.python.org/dev/peps/pep-0261/
>
> There had been a long discussion on this specific detail when PEP 261
> was written, and in the end, an explicit, deliberate, considered
> decision was made to raise a ValueError.
>

Yes, the existence of PEP 261 was one of the reasons I was surprised
that a change like this was made without a deliberation.   Personally,
I've never used chr() or ord() other than on the python command
prompt.  Processing text one character at a time is just too slow in
Python.  So for my own use cases, the change is quite welcome.  I also
find that with bytes() items being int in 3.x more or less removes the
need for ord().  On the other hand any 2.x program that uses unichr()
and ord() is very likely to exhibit subtly buggy behavior when ported
to 3.x.  I don't think len(chr(i)) = 2 is likely to cause problems,
but map(ord, s) not being an iterator over code points is likely to
break naive programs.   This is especially true because as far as I
can tell there is no easy way to iterate over code points in a Python
string on a narrow build.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] r86633 - in python/branches/py3k: Doc/library/inspect.rst Doc/whatsnew/3.2.rst Lib/inspect.py Lib/test/test_inspect.py Misc/NEWS

2010-11-21 Thread Éric Araujo
> Author: nick.coghlan
> New Revision: 86633
> 
> Issue #10220: Add inspect.getgeneratorstate(). Initial patch by Rodolpho 
> Eckhardt
> 
> Modified: python/branches/py3k/Doc/library/inspect.rst
> ==
> --- python/branches/py3k/Doc/library/inspect.rst  (original)
> +++ python/branches/py3k/Doc/library/inspect.rst  Sun Nov 21 04:44:04 2010
> @@ -620,3 +620,25 @@
> # in which case the descriptor itself will
> # have to do
> pass
> +
> +Current State of a Generator
> +
> +
> +When implementing coroutine schedulers and for other advanced uses of
> +generators, it is useful to determine whether a generator is currently
> +executing, is waiting to start or resume or execution, or has already
> +terminated. func:`getgeneratorstate` allows the current state of a
> +generator to be determined easily.
> +
> +.. function:: getgeneratorstate(generator)
> +
> +Get current state of a generator-iterator.
> +
> +Possible states are:
> +  GEN_CREATED: Waiting to start execution.
> +  GEN_RUNNING: Currently being executed by the interpreter.
> +  GEN_SUSPENDED: Currently suspended at a yield expression.
> +  GEN_CLOSED: Execution has completed.

I wonder if those shouldn’t be marked up as :data: or something to make
them indexed.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Web servers, bytes, str, documentation, Python 3.2a4

2010-11-21 Thread Glenn Linderman

On 11/21/2010 9:18 AM, R. David Murray wrote:

I want to look at the CGI issue, but I'm not sure when I'll get to it.


Actually, since this code was working before 3.x, and if email.parser 
can now accept binary streams, it seems like maybe the only thing that 
might be wrong is that presently it is getting a text stream instead, so 
that is something cgi.py or the application program would have to 
switch, and then maybe some testing would discover correctness, or maybe 
a specification of UTF-8 as the encoding to use for the text parts would 
have to be done.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Web servers, bytes, str, documentation, Python 3.2a4

2010-11-21 Thread R. David Murray
On Sun, 21 Nov 2010 19:59:54 -0800, Glenn Linderman  
wrote:
> On 11/21/2010 9:18 AM, R. David Murray wrote:
> > I want to look at the CGI issue, but I'm not sure when I'll get to it.
> 
> Actually, since this code was working before 3.x, and if email.parser 
> can now accept binary streams, it seems like maybe the only thing that 
> might be wrong is that presently it is getting a text stream instead, so 
> that is something cgi.py or the application program would have to 
> switch, and then maybe some testing would discover correctness, or maybe 
> a specification of UTF-8 as the encoding to use for the text parts would 
> have to be done.

Well, given the bytes/string split in Python3, code definitely has to
be changed to make this work, since you have to explicitly call bytes
processing routines (message_from_bytes, message_from_binary_file,
BytesFeedparser, etc) to parse binary data, and likewise use
BytesGenerator to emit binary data.

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bug week-end on the 20th-21st?

2010-11-21 Thread Brian Curtin
On Mon, Oct 25, 2010 at 15:04, Antoine Pitrou  wrote:

> On Mon, 25 Oct 2010 11:32:42 -0400
> "R. David Murray"  wrote:
> > On Mon, 25 Oct 2010 12:22:24 -0200, Rodrigo Bernardo Pimentel <
> [email protected]> wrote:
> > >> Am 23.10.2010 19:08, schrieb Antoine Pitrou:
> > >>> The first 3.2 beta is scheduled by Georg for November 13th.
> > >>> What would you think of scheduling a bug week-end one week later,
> that
> > >>> is on November 20th and 21st? We would need enough core developers to
> > >>> be available on #python-dev.
> > >
> > >FWIW, I'm +1, and I'll try to get the Sao Paulo users group to
> participate.
> >
> > I think this is a great idea (both Antoine's initial suggestion and the
> > idea of getting users groups to participate).
> >
> > I'll be around and able to participate that weekend except for evening
> > US Eastern time.
>
> Ok, so 20th-21st of November it shall be!
>
> Regards
>
> Antoine.


Although a few time zones are still celebrating Bug Weekend, it looks like
at least 76 bugs got closed out [0]. Some of those happened thanks to a
number of first time contributors. Thanks to everyone for their efforts!

[0]
http://bugs.python.org/issue?%40columns=title&%40columns=id&activity=from+2010-11-20+to+2010-11-22&%40columns=activity&%40sort=activity&%40group=priority&status=2&%40columns=status&%40pagesize=50&%40startwith=0&%40action=search
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread Stephen J. Turnbull
"Martin v. Löwis" writes:

 > Chapter and verse?

Unicode 5.0, Chapter 3, verse C9:

When a process generates a code unit sequence which purports to be
in a Unicode character encoding form, it shall not emit ill-formed
code sequences.

I think anything called "UTF-8 something" is likely to be taken to
"purport".  Furthermore, users don't necessarily see which error
handlers are being used.  A user who specifies "utf8" as the output
codec is likely to be rather surprised if non-UTF-8 is emitted because
the app specified surrogateescape.  Eg, consider a script which munges
file descriptions into reasonable-length file names on Unix.  Yes,
technically the non-Unicode output is the app's fault, but I expect
many users will put some blame on Python.

I am in full agreement with you about the technicalities, but I am
looking for ways to clue in users that (a) the technicalities matter,
and (b) that Python does a *very* good job of making things as safe as
possible without becoming unable to handle bytes.  I think "wide"
vs. "narrow" fails at both.  It focuses on storage issues, which of
course are important, but at the cost of ignoring the fact that for
users of non-BMP characters 32-bit code units are much safer.  Users
who need non-BMP characters are relatively few, and at least at the
present time most are painfully aware of the need to care for
technicalities.  I expect them to be pleasantly surprised by how easy
it is to get reasonably safe behavior even from a 16-bit build.

 > > Python's internal coding does not conform to UTF-16, and that internal
 > > coding can, under certain conditions, escape to the outside world as
 > > invalid "Unicode" output.
 > 
 > I'm fairly certain there are provisions in the Unicode standard for such
 > behavior (taking into account "certain conditions").

Sure.  There's nothing in the Unicode standard that says you have to
conform to it unless you claim to conform to it.

So it is valid to say that Python's Unicode codecs without
surrogateescape do conform.  The point is that Python does not, even
if all of the input is valid Unicode, because of the provision of
surrogateescape and the lack of Unicode conformance-checking for
certain internal functionality like chr() and slicing.

You can say "we don't make any such claim", but IMO the distinction in
question is too fine a point for most users, and requires a very large
amount of Unicode knowledge (not to mention standards geekiness) to
even understand the precise statement.

"Unicode support" to users should mean that Python does the right
thing, not that if you look hard enough in the documentation you will
discover that Python doesn't claim to do the right thing even though
in practice it mostly does.  IMO, "UCS-2" is a pretty good description
of what the user can leave up to Python in perfect safety.  RDM's
reply worries me a little, but I'll reply to his message separately.

 > *Any* Unicode implementation will do that, since they all have to
 > support legacy encodings in some form. This is certainly conforming to
 > the Unicode standard, and in fact one of the primary Unicode design
 > principles.

No.  Support for legacy encodings takes you outside of the realm of
Unicode conformance by definition.  Their names tell you that,
however.  "UTF-8 with surrogate escapes" on the other hand is an
entirely different kettle of fish.  It pretends to be UTF-8, but
isn't.  I think that users who give Python valid input should be able
to expect valid output, but they can't.

Chapter 3, verse C7:

When a process purports not to modify the interpretation of a
valid coded character sequence, it shall make no change to that
coded character sequence other than the possible replacement of
character sequences by their canonical-equivalent sequences, or
the deletion of *noncharacter* code points.

Sure, you can tell users the truth: "Python may modify your Unicode
characters if you slice or index Unicode strings.  It may even
silently turn them into invalid codes which will eventually raise
Errors."  Then you are conformant, but why would anyone want to use
such a program?

If you tell them "UCS-2[sic] Python is safe to use with *no* extra
care if you use only UCS-2 [or BMP] characters", suddenly Python looks
very nice indeed again.  "UCS-4" Python is even better; all you have
to do is to avoid surrogateescape codecs.  However, you're still
vulnerable to hard-to-diagnose errors at the output stage in case of
program bugs, because not enough checking of values is done by Python
itself.

 > > A Unicode-conforming Python implementation would error at the
 > > chr() call, or perhaps would not provide surrogateescape error
 > > handlers.
 > 
 > Chapter and verse?

Chapter 3, verse C9 again.

 > > "Although practicality beats purity."
 > 
 > The Unicode standard itself is based on practicality. It wouldn't
 > have received the success it did if it was based on purity only
 > (and indeed, was often rejected in ca

[Python-Dev] is this a bug? no environment variables

2010-11-21 Thread Glenn Linderman
In reviewing my notes from my experimentations with CGIHTTPServer 
(Python2.6) and then http.server (Python 3.2a4), I note one behavior I 
haven't reported as a bug, nor do I know where to start to figure it 
out, other than experimentally.


The experiment: launching CGIHTTPServer without environment variables, 
by the simple expedient of using a batch file to unset all the existing 
environment variables, and then launching Python2.6 with CGIHTTPServer.


So it failed early: random.py fails at line 110 (Python 2.6).

I suppose it is possible that some environment variables are used by 
Python directly (but I can't seem to find a documented list of them) 
although I would expect that usage to be optional, with fall-back 
defaults when they don't exist.  I suppose it is even possible that some 
Windows APIs might depend on some environment variables, but I expected 
that the registry had replaced such usage completely, by now, with the 
environment variables mostly being a convenience tool for batch files, 
or for optional, temporary alteration of particular settings.


If anyone knows of documentation listing what environment variables are 
required by Python on Windows, I would appreciate a pointer, searches 
and doc browsing having not turned it up.


I'll attempt to recreate the test situation later this week with Python 
3.2a4, if no one responds, but the only debug technique I can think of 
is to slowly remove environment variables until I find the minimum set 
required to run http.server successfully for my tests with CGI files.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] len(chr(i)) = 2?

2010-11-21 Thread Stephen J. Turnbull
R. David Murray writes:

 > I'm sorry, but I have to disagree.  As a relative unicode ignoramus,
 > "UCS-2" and "UCS-4" convey almost no information to me, and the bits I
 > have heard about them on this list have only confused me.

OK, point taken.

 > On the other hand, I understand that 'narrow' means that fewer
 > bytes are used for each internal character, meaning that some
 > unicode characters need to be represented by more than one string
 > element, and thus that slicing strings containing such characters
 > on a narrow build causes problems.  Now, you could tell me the same
 > information using the terms 'UCS-2' and 'UCS-4' instead of 'narrow'
 > and 'wide', but to my ear 'narrow' and 'wide' convey a better gut
 > level feeling for what is going on than 'UCS-2' and 'UCS-4' do.

I think that is probably conditioned by your long experience with
Python's Unicode features, specifically the knowledge that Python's
Unicode strings are not arrays of characters, which often is referred
to on this list.

My guess is that very few newbies would know that, and it is not
implied by "narrow".  For example, both Emacs (for sure) and Perl
(IIUC) index strings of variable-width character by characters (at
great expense of performance in Emacs, at least), not as code units.

 > And it avoids any question of whether or not Python's internal
 > representation actually conforms to whatever standard it is that
 > UCS refers to, a point on which there seems to be some dissension.

UCS-2 refers to ISO 10646, Annex 1 IIRC.[1]  Anyway, it's somewhere in
ISO 10646.  I don't think there's actually dissension on conformance
to UCS-2, as that's very easy to achieve.  Rather, Guido explicitly
pronounced that Python processes arrays of code units, not
characters.  My point is that if you pretend that Python is processing
*characters* according to UCS-2 rules for characters, you'll always
come to the same conclusion about what Python will do as if you use
the technically correct terminology of code units.  (At least for the
BMP and UTF-16 private areas.  There will necessarily be some
confusion about surrogates, since in UCS-2 they are characters while
in UTF-16 they're merely "code points", and the Unicode characters
they represent can't be represented at all in UCS-2.)

 > Indeed, reading that article with my limited unicode knowledge, if
 > I were told Python used UCS-2, I would assume that non-BMP
 > characters could not be processed by a Python narrow build.

Actually, I'm almost happy with that.

That is, the precise formulation is "could not be processed *safely
without extra care* by a Python narrow build."  Specifically, AFAIK if
you range check characters that have been indexed out of a string, or
are located at slice boundaries, or produced by chr() or a
surrogateescape input codec, you're safe.  But practically speaking
few apps will actually do those checks and therefore they are unsafe:
processing non-BMP characters can easily lead to show-stopping
Exceptions.  It's very analogous to the kind of show-stopping "bad
character in a header" exception that plagued Mailman for so long, and
had to be fixed on a case-by-case basis.  But the restriction to BMP
characters is much more reasonable (at least for now) than RFC 822's
restriction to ASCII!

But evidently you take it much more stringently.  So the question is,
"what fraction of developers who think as you do would therefore be
put off from using Python to build their applications?"  If most would
say "OK, we'll stick with BMP for now and use UCS-4 or some hack to
deal with extended characters later -- it can't really be true that
it's absolutely impossible to use non-BMP characters," I don't mind
that misunderstanding.

OTOH, yes, it would be bad if the use of "UCS-2" were to imply to more
than a couple of developers that 16-bit builds of Python can't handle
UTF-16 *at all*.


Footnotes: 
[1]  It simply says "we have a subset of the Unicode character set all
of whose code points can be represented in 16 bits, excluding 0x."
It goes on to define a private area, reserved for use by applications
that will never be standardized, and it says that if you don't know
what a code point in the character area is, don't change it (you can
delete it, however).  ISTR that a later Amendment added 0xFFFE to the
short-list of non-characters.

The surrogate area was taken out of the private area, so a UCS-2
application will simply consider each surrogate to be an unknown
character and pass it through unchanged -- unless it deletes it, or
inserts other characters between the code points of a surrogate pair.
And that's why UCS-2 isn't UTF-16 conforming -- which is basically why
Python isn't either.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com