Re: [Python-Dev] len(chr(i)) = 2?
"Martin v. Löwis" writes:
> Am 20.11.2010 05:11, schrieb Stephen J. Turnbull:
> > "Martin v. Löwis" writes:
> >
> > > The term "UCS-2" is a character set that can encode only encode 65536
> > > characters; it thus refers to Unicode 1.1. According to the Unicode
> > > Consortium's FAQ, the term UCS-2 should be avoided these days.
> >
> > So what do you propose we call the Python implementation?
>
> A technical correct description would be to say that Python uses either
> 16-bit code units or 32-bit code units; for brevity, these can be called
> narrow and wide code units.
I agree that's technically correct. Unfortunately, it's also useless
to anybody who doesn't already know more about Unicode than anybody
should have to know.
> > and therefore is not UTF-16 conforming.
>
> I disagree. Python does "conform" to "UTF-16"
I'm sure the codecs do. But the Unicode standard doesn't care about
the parts of the process, it cares about what it does as a whole.
Python's internal coding does not conform to UTF-16, and that internal
coding can, under certain conditions, escape to the outside world as
invalid "Unicode" output.
> > AFAIK this was not supposed to change in Python 3; indexing and
> > slicing go by code unit (isomorphic to UCS-n), not character, and due
> > to PEP 383 4-octet builds do not conform (internally) to UTF-32, and
> > can produce output that conforms to Unicode not at all (as a user
> > option, of course, but it's still non-conformant).
>
> What behavior specifically do you consider non-conforming, and what
> specific specification do you think it is not conforming to? For
> example, it *is* fully conforming with UTF-8.
Oh,
f = open('/tmp/broken','wt',encoding='utf8',errors='surrogateescape')
f.write(chr(int('dc80',16)))
f.close()
for one. That produces a non-UTF-8 file in a 32-bit-code-unit build.
You can say, "oh, but that's not really a UTF-8 codec", and I'd agree.
Nevertheless, the program is able to produce output from internal
"Unicode" strings that does not conform to Unicode at all. A Unicode-
conforming Python implementation would error at the chr() call, or
perhaps would not provide surrogateescape error handlers.
It is, of course, possible to write Python programs that conform (and
easier than in any other language I know), but Python itself does not
conform to post-1.1 Unicode standards. Too bad for the standards:
"Although practicality beats purity."
The point is that internal code is *not* UTF-16 (or -32), but it *is*
isomorphic to UCS-2 (or -4). *That is very useful information to
users*, it's not a technical detail of interest only to Unicode geeks.
It means that if you stick to defined characters in the BMP when
giving Python input, then slicing and indexing unicode (Python 2) or
str (Python 3) objects gives only valid output even in builds with
16-bit code units. OTOH, invalid processing (involving functions like
'chr' or input using surrogateescape codecs) can lead to invalid
output even in builds with 32-bit code units.
IMO, saying "UCS-2" or "UCS-4" tells ordinary developers most of what
they need to know about the limitations of their Python vis-a-vis full
conformance, at least with respect to the string manipulation functions.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Web servers, bytes, str, documentation, Python 3.2a4
On Sat, 20 Nov 2010 23:52:45 -0800, Glenn Linderman wrote: > Sadly, cgi.py input handling seems to depend on the email module, > thought to be fixed for 3.2, but it is not clear if that has been > achieved, or if the surrogate encode workaround is sufficient for this. > More testing needed, but I don't have such a test case developed yet. Indeed, this should theoretically be fixable now. The email module is now perfectly capable of both consuming and producing binary data. The user of the module doesn't need to care how this was achieved unless they want to do processing of non-RFC conformant data. I want to look at the CGI issue, but I'm not sure when I'll get to it. -- R. David Murray www.bitdance.com ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Mercurial Schedule
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 What is the impact in the buildbot architecture?. Slaves must do anything?. At least they need to have mercurial installed, I guess. What, as a buildslave manager, must I do to ready my server for the migration?. - -- Jesus Cea Avion _/_/ _/_/_/_/_/_/ [email protected] - http://www.jcea.es/ _/_/_/_/ _/_/_/_/ _/_/ jabber / xmpp:[email protected] _/_/_/_/ _/_/_/_/_/ . _/_/ _/_/_/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/_/_/ _/_/_/_/ _/_/ "My name is Dump, Core Dump" _/_/_/_/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQCVAwUBTOlWjplgi5GaxT1NAQKwJAP/W1w/mn3Jv9XECxGCLKFj1Xvjz4fKq8im e1oKpvrl5hzXfKfYtIC4K2fy5G4O3iP1gS/Iwy0iGSSqcpnxFIfpwcTpjigRGaBi rpZp956TosaSLTGZxS2Wb11KFxsGlhAcgVF2ooFF7Z+wL73wCyVjfUqMXCB/50Nr dztlJuv3Wvg= =ntFy -END PGP SIGNATURE- ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] len(chr(i)) = 2?
On Sun, 21 Nov 2010 21:55:12 +0900, "Stephen J. Turnbull" wrote: > "Martin v. Löwis" writes: > > Am 20.11.2010 05:11, schrieb Stephen J. Turnbull: > > > "Martin v. Löwis" writes: > > > > > > > The term "UCS-2" is a character set that can encode only encode 65536 > > > > characters; it thus refers to Unicode 1.1. According to the Unicode > > > > Consortium's FAQ, the term UCS-2 should be avoided these days. > > > > > > So what do you propose we call the Python implementation? > > > > A technical correct description would be to say that Python uses either > > 16-bit code units or 32-bit code units; for brevity, these can be called > > narrow and wide code units. > > I agree that's technically correct. Unfortunately, it's also useless > to anybody who doesn't already know more about Unicode than anybody > should have to know. [...] > The point is that internal code is *not* UTF-16 (or -32), but it *is* > isomorphic to UCS-2 (or -4). *That is very useful information to > users*, it's not a technical detail of interest only to Unicode geeks. > It means that if you stick to defined characters in the BMP when > giving Python input, then slicing and indexing unicode (Python 2) or > str (Python 3) objects gives only valid output even in builds with > 16-bit code units. OTOH, invalid processing (involving functions like > 'chr' or input using surrogateescape codecs) can lead to invalid > output even in builds with 32-bit code units. > > IMO, saying "UCS-2" or "UCS-4" tells ordinary developers most of what > they need to know about the limitations of their Python vis-a-vis full > conformance, at least with respect to the string manipulation functions. I'm sorry, but I have to disagree. As a relative unicode ignoramus, "UCS-2" and "UCS-4" convey almost no information to me, and the bits I have heard about them on this list have only confused me. On the other hand, I understand that 'narrow' means that fewer bytes are used for each internal character, meaning that some unicode characters need to be represented by more than one string element, and thus that slicing strings containing such characters on a narrow build causes problems. Now, you could tell me the same information using the terms 'UCS-2' and 'UCS-4' instead of 'narrow' and 'wide', but to my ear 'narrow' and 'wide' convey a better gut level feeling for what is going on than 'UCS-2' and 'UCS-4' do. And it avoids any question of whether or not Python's internal representation actually conforms to whatever standard it is that UCS refers to, a point on which there seems to be some dissension. Having written the above, I googled for UCS-2 and got the Wikipedia article on UTF16/UCS-2 [1]. Scanning that article, I do not see anything that would clue me in to the problems of slicing strings in a Python narrow build. Indeed, reading that article with my limited unicode knowledge, if I were told Python used UCS-2, I would assume that non-BMP characters could not be processed by a Python narrow build. -- R. David Murray www.bitdance.com [1] http://en.wikipedia.org/wiki/UTF-16/UCS-2 ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Mercurial Schedule
Am 21.11.2010 18:27, schrieb Jesus Cea: > What is the impact in the buildbot architecture?. Slaves must do > anything?. At least they need to have mercurial installed, I guess. > > What, as a buildslave manager, must I do to ready my server for the > migration?. Apart from having Mercurial installed and "hg" in the PATH (that will be important for Windows I assume), I don't think anything else is required. Georg ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] len(chr(i)) = 2?
On Nov 21, 2010, at 9:38 AM, R. David Murray wrote: > > I'm sorry, but I have to disagree. As a relative unicode ignoramus, > "UCS-2" and "UCS-4" convey almost no information to me, and the bits I > have heard about them on this list have only confused me. >From the users point of view, it doesn't much matter which encoding is used internally. Neither UTF-16 nor UCS-2 is exactly correct anyway. The former encodes the entire range of unicode characters in a variable length code (a character is usually 2 bytes but is sometimes 4 bytes long). The latter encodes only a subset of unicode (the basic mulitlingual plane) in a fixed-length code of bytes per character). What we use internally looks like utf-16 but a character encoded with 4 bytes is treated as two 2-byte characters (hence the subject of this thread). Our hybrid internal coding lets use handle the entire range of unicode while getting speed and simplicity by doing len() and slicing with a surrogate pair being treated as two separate characters). For the "wide" build, the entire range of unicode is encoded at 4 bytes per character and slicing/len operate correctly since every character is the same length. This used to be called UCS-4 and is now UTF-32. So, with "wide" builds there isn't much confusion (except perhaps unfamiliar terminology). The real issue seems to be that for "narrow" builds, none of the usual encoding names is exactly correct. >From a users point-of-view, the actual encoding or encoding name doesn't matter much. They just need to be able to predict the relevant behaviors (memory consumption and len/slicing behavior). For the narrow build, that behavior is: - Characters in the BMP consume 2 bytes and count as one char for purposes of len and slicing. - Characters above the BMP consume 4 bytes and counts as two distinct chars for purpose of len and slicing. For wide builds, all characters are 4 bytes and count as a single char for len and slicing. Hope this helps, Raymond ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] len(chr(i)) = 2?
> > I disagree. Python does "conform" to "UTF-16"
>
> I'm sure the codecs do. But the Unicode standard doesn't care about
> the parts of the process, it cares about what it does as a whole.
Chapter and verse?
> Python's internal coding does not conform to UTF-16, and that internal
> coding can, under certain conditions, escape to the outside world as
> invalid "Unicode" output.
I'm fairly certain there are provisions in the Unicode standard for such
behavior (taking into account "certain conditions").
> > What behavior specifically do you consider non-conforming, and what
> > specific specification do you think it is not conforming to? For
> > example, it *is* fully conforming with UTF-8.
>
> Oh,
>
> f = open('/tmp/broken','wt',encoding='utf8',errors='surrogateescape')
> f.write(chr(int('dc80',16)))
> f.close()
>
> for one. That produces a non-UTF-8 file
Right. You are using an API that does not promise to create UTF-8, and
hence isn't UTF-8. The Unicode standard certainly allows implementations
to use character encoding schemes other than UTF-8; this one being
"UTF-8 with surrogate escapes", which is different from "UTF-8" (IANA
MIBEnum 106).
> You can say, "oh, but that's not really a UTF-8 codec", and I'd agree.
See above :-)
> Nevertheless, the program is able to produce output from internal
> "Unicode" strings that does not conform to Unicode at all.
*Any* Unicode implementation will do that, since they all have to
support legacy encodings in some form. This is certainly conforming to
the Unicode standard, and in fact one of the primary Unicode design
principles.
> A Unicode-
> conforming Python implementation would error at the chr() call, or
> perhaps would not provide surrogateescape error handlers.
Chapter and verse?
> "Although practicality beats purity."
The Unicode standard itself is based on practicality. It wouldn't
have received the success it did if it was based on purity only
(and indeed, was often rejected in cases where it put purity over
practicality, e.g. with the Hangul syllables).
Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] len(chr(i)) = 2?
On Sun, 21 Nov 2010 10:17:57 -0800, Raymond Hettinger wrote: > On Nov 21, 2010, at 9:38 AM, R. David Murray wrote: > > I'm sorry, but I have to disagree. As a relative unicode ignoramus, > > "UCS-2" and "UCS-4" convey almost no information to me, and the bits I > > have heard about them on this list have only confused me. [...] > 6rom a users point-of-view, the actual encoding or encoding name > doesn't matter much. They just need to be able to predict the relevant > behaviors (memory consumption and len/slicing behavior). > > For the narrow build, that behavior is: > - Characters in the BMP consume 2 bytes and count as one char > for purposes of len and slicing. > - Characters above the BMP consume 4 bytes and counts as > two distinct chars for purpose of len and slicing. > > For wide builds, all characters are 4 bytes and count as a single > char for len and slicing. > > Hope this helps, Thank you, that nicely summarizes and confirms what I thought I knew about wide versus narrow build. And as I said, using the names UCS-2/UCS-4 would only *confuse* that understanding, not clarify it. -- R. David Murray www.bitdance.com ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] len(chr(i)) = 2?
On Fri, Nov 19, 2010 at 4:43 PM, "Martin v. Löwis" wrote: >> In my opinion, the question is more what was it not fixed in Python2. I >> suppose >> that the answer is something ugly like "backward compatibility" or >> "historical >> reasons" :-) > > No, there was a deliberate decision to not support that, see > > http://www.python.org/dev/peps/pep-0261/ > > There had been a long discussion on this specific detail when PEP 261 > was written, and in the end, an explicit, deliberate, considered > decision was made to raise a ValueError. > Yes, the existence of PEP 261 was one of the reasons I was surprised that a change like this was made without a deliberation. Personally, I've never used chr() or ord() other than on the python command prompt. Processing text one character at a time is just too slow in Python. So for my own use cases, the change is quite welcome. I also find that with bytes() items being int in 3.x more or less removes the need for ord(). On the other hand any 2.x program that uses unichr() and ord() is very likely to exhibit subtly buggy behavior when ported to 3.x. I don't think len(chr(i)) = 2 is likely to cause problems, but map(ord, s) not being an iterator over code points is likely to break naive programs. This is especially true because as far as I can tell there is no easy way to iterate over code points in a Python string on a narrow build. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] r86633 - in python/branches/py3k: Doc/library/inspect.rst Doc/whatsnew/3.2.rst Lib/inspect.py Lib/test/test_inspect.py Misc/NEWS
> Author: nick.coghlan > New Revision: 86633 > > Issue #10220: Add inspect.getgeneratorstate(). Initial patch by Rodolpho > Eckhardt > > Modified: python/branches/py3k/Doc/library/inspect.rst > == > --- python/branches/py3k/Doc/library/inspect.rst (original) > +++ python/branches/py3k/Doc/library/inspect.rst Sun Nov 21 04:44:04 2010 > @@ -620,3 +620,25 @@ > # in which case the descriptor itself will > # have to do > pass > + > +Current State of a Generator > + > + > +When implementing coroutine schedulers and for other advanced uses of > +generators, it is useful to determine whether a generator is currently > +executing, is waiting to start or resume or execution, or has already > +terminated. func:`getgeneratorstate` allows the current state of a > +generator to be determined easily. > + > +.. function:: getgeneratorstate(generator) > + > +Get current state of a generator-iterator. > + > +Possible states are: > + GEN_CREATED: Waiting to start execution. > + GEN_RUNNING: Currently being executed by the interpreter. > + GEN_SUSPENDED: Currently suspended at a yield expression. > + GEN_CLOSED: Execution has completed. I wonder if those shouldn’t be marked up as :data: or something to make them indexed. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Web servers, bytes, str, documentation, Python 3.2a4
On 11/21/2010 9:18 AM, R. David Murray wrote: I want to look at the CGI issue, but I'm not sure when I'll get to it. Actually, since this code was working before 3.x, and if email.parser can now accept binary streams, it seems like maybe the only thing that might be wrong is that presently it is getting a text stream instead, so that is something cgi.py or the application program would have to switch, and then maybe some testing would discover correctness, or maybe a specification of UTF-8 as the encoding to use for the text parts would have to be done. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Web servers, bytes, str, documentation, Python 3.2a4
On Sun, 21 Nov 2010 19:59:54 -0800, Glenn Linderman wrote: > On 11/21/2010 9:18 AM, R. David Murray wrote: > > I want to look at the CGI issue, but I'm not sure when I'll get to it. > > Actually, since this code was working before 3.x, and if email.parser > can now accept binary streams, it seems like maybe the only thing that > might be wrong is that presently it is getting a text stream instead, so > that is something cgi.py or the application program would have to > switch, and then maybe some testing would discover correctness, or maybe > a specification of UTF-8 as the encoding to use for the text parts would > have to be done. Well, given the bytes/string split in Python3, code definitely has to be changed to make this work, since you have to explicitly call bytes processing routines (message_from_bytes, message_from_binary_file, BytesFeedparser, etc) to parse binary data, and likewise use BytesGenerator to emit binary data. -- R. David Murray www.bitdance.com ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Bug week-end on the 20th-21st?
On Mon, Oct 25, 2010 at 15:04, Antoine Pitrou wrote: > On Mon, 25 Oct 2010 11:32:42 -0400 > "R. David Murray" wrote: > > On Mon, 25 Oct 2010 12:22:24 -0200, Rodrigo Bernardo Pimentel < > [email protected]> wrote: > > >> Am 23.10.2010 19:08, schrieb Antoine Pitrou: > > >>> The first 3.2 beta is scheduled by Georg for November 13th. > > >>> What would you think of scheduling a bug week-end one week later, > that > > >>> is on November 20th and 21st? We would need enough core developers to > > >>> be available on #python-dev. > > > > > >FWIW, I'm +1, and I'll try to get the Sao Paulo users group to > participate. > > > > I think this is a great idea (both Antoine's initial suggestion and the > > idea of getting users groups to participate). > > > > I'll be around and able to participate that weekend except for evening > > US Eastern time. > > Ok, so 20th-21st of November it shall be! > > Regards > > Antoine. Although a few time zones are still celebrating Bug Weekend, it looks like at least 76 bugs got closed out [0]. Some of those happened thanks to a number of first time contributors. Thanks to everyone for their efforts! [0] http://bugs.python.org/issue?%40columns=title&%40columns=id&activity=from+2010-11-20+to+2010-11-22&%40columns=activity&%40sort=activity&%40group=priority&status=2&%40columns=status&%40pagesize=50&%40startwith=0&%40action=search ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] len(chr(i)) = 2?
"Martin v. Löwis" writes: > Chapter and verse? Unicode 5.0, Chapter 3, verse C9: When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code sequences. I think anything called "UTF-8 something" is likely to be taken to "purport". Furthermore, users don't necessarily see which error handlers are being used. A user who specifies "utf8" as the output codec is likely to be rather surprised if non-UTF-8 is emitted because the app specified surrogateescape. Eg, consider a script which munges file descriptions into reasonable-length file names on Unix. Yes, technically the non-Unicode output is the app's fault, but I expect many users will put some blame on Python. I am in full agreement with you about the technicalities, but I am looking for ways to clue in users that (a) the technicalities matter, and (b) that Python does a *very* good job of making things as safe as possible without becoming unable to handle bytes. I think "wide" vs. "narrow" fails at both. It focuses on storage issues, which of course are important, but at the cost of ignoring the fact that for users of non-BMP characters 32-bit code units are much safer. Users who need non-BMP characters are relatively few, and at least at the present time most are painfully aware of the need to care for technicalities. I expect them to be pleasantly surprised by how easy it is to get reasonably safe behavior even from a 16-bit build. > > Python's internal coding does not conform to UTF-16, and that internal > > coding can, under certain conditions, escape to the outside world as > > invalid "Unicode" output. > > I'm fairly certain there are provisions in the Unicode standard for such > behavior (taking into account "certain conditions"). Sure. There's nothing in the Unicode standard that says you have to conform to it unless you claim to conform to it. So it is valid to say that Python's Unicode codecs without surrogateescape do conform. The point is that Python does not, even if all of the input is valid Unicode, because of the provision of surrogateescape and the lack of Unicode conformance-checking for certain internal functionality like chr() and slicing. You can say "we don't make any such claim", but IMO the distinction in question is too fine a point for most users, and requires a very large amount of Unicode knowledge (not to mention standards geekiness) to even understand the precise statement. "Unicode support" to users should mean that Python does the right thing, not that if you look hard enough in the documentation you will discover that Python doesn't claim to do the right thing even though in practice it mostly does. IMO, "UCS-2" is a pretty good description of what the user can leave up to Python in perfect safety. RDM's reply worries me a little, but I'll reply to his message separately. > *Any* Unicode implementation will do that, since they all have to > support legacy encodings in some form. This is certainly conforming to > the Unicode standard, and in fact one of the primary Unicode design > principles. No. Support for legacy encodings takes you outside of the realm of Unicode conformance by definition. Their names tell you that, however. "UTF-8 with surrogate escapes" on the other hand is an entirely different kettle of fish. It pretends to be UTF-8, but isn't. I think that users who give Python valid input should be able to expect valid output, but they can't. Chapter 3, verse C7: When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences, or the deletion of *noncharacter* code points. Sure, you can tell users the truth: "Python may modify your Unicode characters if you slice or index Unicode strings. It may even silently turn them into invalid codes which will eventually raise Errors." Then you are conformant, but why would anyone want to use such a program? If you tell them "UCS-2[sic] Python is safe to use with *no* extra care if you use only UCS-2 [or BMP] characters", suddenly Python looks very nice indeed again. "UCS-4" Python is even better; all you have to do is to avoid surrogateescape codecs. However, you're still vulnerable to hard-to-diagnose errors at the output stage in case of program bugs, because not enough checking of values is done by Python itself. > > A Unicode-conforming Python implementation would error at the > > chr() call, or perhaps would not provide surrogateescape error > > handlers. > > Chapter and verse? Chapter 3, verse C9 again. > > "Although practicality beats purity." > > The Unicode standard itself is based on practicality. It wouldn't > have received the success it did if it was based on purity only > (and indeed, was often rejected in ca
[Python-Dev] is this a bug? no environment variables
In reviewing my notes from my experimentations with CGIHTTPServer (Python2.6) and then http.server (Python 3.2a4), I note one behavior I haven't reported as a bug, nor do I know where to start to figure it out, other than experimentally. The experiment: launching CGIHTTPServer without environment variables, by the simple expedient of using a batch file to unset all the existing environment variables, and then launching Python2.6 with CGIHTTPServer. So it failed early: random.py fails at line 110 (Python 2.6). I suppose it is possible that some environment variables are used by Python directly (but I can't seem to find a documented list of them) although I would expect that usage to be optional, with fall-back defaults when they don't exist. I suppose it is even possible that some Windows APIs might depend on some environment variables, but I expected that the registry had replaced such usage completely, by now, with the environment variables mostly being a convenience tool for batch files, or for optional, temporary alteration of particular settings. If anyone knows of documentation listing what environment variables are required by Python on Windows, I would appreciate a pointer, searches and doc browsing having not turned it up. I'll attempt to recreate the test situation later this week with Python 3.2a4, if no one responds, but the only debug technique I can think of is to slowly remove environment variables until I find the minimum set required to run http.server successfully for my tests with CGI files. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] len(chr(i)) = 2?
R. David Murray writes: > I'm sorry, but I have to disagree. As a relative unicode ignoramus, > "UCS-2" and "UCS-4" convey almost no information to me, and the bits I > have heard about them on this list have only confused me. OK, point taken. > On the other hand, I understand that 'narrow' means that fewer > bytes are used for each internal character, meaning that some > unicode characters need to be represented by more than one string > element, and thus that slicing strings containing such characters > on a narrow build causes problems. Now, you could tell me the same > information using the terms 'UCS-2' and 'UCS-4' instead of 'narrow' > and 'wide', but to my ear 'narrow' and 'wide' convey a better gut > level feeling for what is going on than 'UCS-2' and 'UCS-4' do. I think that is probably conditioned by your long experience with Python's Unicode features, specifically the knowledge that Python's Unicode strings are not arrays of characters, which often is referred to on this list. My guess is that very few newbies would know that, and it is not implied by "narrow". For example, both Emacs (for sure) and Perl (IIUC) index strings of variable-width character by characters (at great expense of performance in Emacs, at least), not as code units. > And it avoids any question of whether or not Python's internal > representation actually conforms to whatever standard it is that > UCS refers to, a point on which there seems to be some dissension. UCS-2 refers to ISO 10646, Annex 1 IIRC.[1] Anyway, it's somewhere in ISO 10646. I don't think there's actually dissension on conformance to UCS-2, as that's very easy to achieve. Rather, Guido explicitly pronounced that Python processes arrays of code units, not characters. My point is that if you pretend that Python is processing *characters* according to UCS-2 rules for characters, you'll always come to the same conclusion about what Python will do as if you use the technically correct terminology of code units. (At least for the BMP and UTF-16 private areas. There will necessarily be some confusion about surrogates, since in UCS-2 they are characters while in UTF-16 they're merely "code points", and the Unicode characters they represent can't be represented at all in UCS-2.) > Indeed, reading that article with my limited unicode knowledge, if > I were told Python used UCS-2, I would assume that non-BMP > characters could not be processed by a Python narrow build. Actually, I'm almost happy with that. That is, the precise formulation is "could not be processed *safely without extra care* by a Python narrow build." Specifically, AFAIK if you range check characters that have been indexed out of a string, or are located at slice boundaries, or produced by chr() or a surrogateescape input codec, you're safe. But practically speaking few apps will actually do those checks and therefore they are unsafe: processing non-BMP characters can easily lead to show-stopping Exceptions. It's very analogous to the kind of show-stopping "bad character in a header" exception that plagued Mailman for so long, and had to be fixed on a case-by-case basis. But the restriction to BMP characters is much more reasonable (at least for now) than RFC 822's restriction to ASCII! But evidently you take it much more stringently. So the question is, "what fraction of developers who think as you do would therefore be put off from using Python to build their applications?" If most would say "OK, we'll stick with BMP for now and use UCS-4 or some hack to deal with extended characters later -- it can't really be true that it's absolutely impossible to use non-BMP characters," I don't mind that misunderstanding. OTOH, yes, it would be bad if the use of "UCS-2" were to imply to more than a couple of developers that 16-bit builds of Python can't handle UTF-16 *at all*. Footnotes: [1] It simply says "we have a subset of the Unicode character set all of whose code points can be represented in 16 bits, excluding 0x." It goes on to define a private area, reserved for use by applications that will never be standardized, and it says that if you don't know what a code point in the character area is, don't change it (you can delete it, however). ISTR that a later Amendment added 0xFFFE to the short-list of non-characters. The surrogate area was taken out of the private area, so a UCS-2 application will simply consider each surrogate to be an unknown character and pass it through unchanged -- unless it deletes it, or inserts other characters between the code points of a surrogate pair. And that's why UCS-2 isn't UTF-16 conforming -- which is basically why Python isn't either. ___ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
