Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Chris Angelico
On Thu, Jan 9, 2014 at 5:50 PM, Lennart Regebro  wrote:
> To be honest, you can define text as "A stream of bytes that are split
> up in lines separated by a linefeed", and do some basic text
> processing like that. Just very *basic*, but still. Replacing
> characters. Extracting certain lines etc.

You would have to define it as "A stream of bytes encoded in
{ASCII|Latin-1|CP-1252|UTF-8} that" etc etc. Otherwise, those bytes
might be EBCDIC, UTF-16, or anything else, and your code will fail.
And once you've demanded that, well, you're right back here with
clarifying encodings, so you may as well just pass encoding="ascii"
and do it honestly.

ChrisA
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Lennart Regebro
On Thu, Jan 9, 2014 at 8:16 AM, Ben Finney  wrote:
> Nick Coghlan  writes:
>> Set the mode to "rb", process it as binary. Done.
>
> Which entails abandoning the stated goal of “just want to parse text
> files” :-)

Only if your definition of "text files" means it's unicode.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Mark Shannon

On 09/01/14 00:07, Ben Finney wrote:

Kristján Valur Jónsson  writes:


Believe it or not, sometimes you really don't care about encodings.
Sometimes you just want to parse text files.


Files don't contain text, they contain bytes. Bytes only become text
when filtered through the correct encoding.


I'm glad someone pointed this out.

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Kristján Valur Jónsson


> -Original Message-
> From: Python-Dev [mailto:python-dev-
> [email protected]] On Behalf Of Ben Finney
> Sent: 9. janúar 2014 00:50
> To: [email protected]
> Subject: Re: [Python-Dev] Python3 "complexity"
> 
> Kristján Valur Jónsson  writes:
> 
> > I didn't used to must.  Why must I must now?  Did the universe just
> > shift when I fired up python3?
> 
> In a sense, yes. The world of software has been shifting for decades, as a
> reasult of broader changes in how different segments of humanity have
> changed their interactions, and thereby changed their expectations of what
> computers can do with their data.

Do I speak Chinese to my grocer because china is a growing force in the world?  
Or start every discussion with my children with a negotiation on what language 
to use?
I get all the talk about Unicode, and interoperability and foreign languages 
and the world (I'm Icelandic, after all.)
The point I'm trying to make, and which I think you are missing is this:
A tool that I have been happily using on my own system, to my own ends (I'm not 
writing international spam posts or hosting a United Nations election, but 
parsing and writing config.ini files, say)
just became harder to use for that purpose.
I think I'm not the only one to realize this, otherwise, PEP460 wouldn't be 
there.

Anyway, I'll duck out now
*ducks*

K
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Paul Moore
On 9 January 2014 09:01, Mark Shannon  wrote:
> On 09/01/14 00:07, Ben Finney wrote:
>>
>> Kristján Valur Jónsson  writes:
>>
>>> Believe it or not, sometimes you really don't care about encodings.
>>> Sometimes you just want to parse text files.
>>
>>
>> Files don't contain text, they contain bytes. Bytes only become text
>> when filtered through the correct encoding.
>>
> I'm glad someone pointed this out.

Try working on Windows with Powershell as your default shell for a
while. You learn that message *very* fast. You end up with a mix of
CP1250 and UTF-16 files, and you can no longer even assume that a file
of "simple text" is in an ASCII-compatible encoding. After tools like
grep fail to work often enough, you get a really strong sense of why
knowing the encoding matters (and you feel this urge to rewrite all
the GNU tools in Python 3 ;-)). And that's on a single PC in an
English-speaking locale :-( (You also get this fun with the £ sign
being encoded differently in the console and the GUI). So it's not
just people that "use funny foreign languages" (apologies to 99% of
the globe for that :-)) who are affected. I assume Kristján knows all
this, given the "á" in his name :-)

But certainly just using open without specifying an encoding has
always served me fine in Python 3, in the sense that it does at least
as well as Python 2  So I think that if this discussion is to be of
any real benefit, a specific example is needed. I honestly don't think
I've ever encountered a case where "Sometimes [I] just want to parse
text files" and code that uses the default encoding (i.e., looks
pretty much identical to Python 2) has *failed* to do the job for me.

PEP460 is addressing a very specific use case, and certainly isn't for
"just parsing text files" - at least as I understand it.

Paul.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Stefan Ring
> just became harder to use for that purpose.

The entire discussion reminds me very much of the situation with file
names in OS X. Whenever I want to look at an old zip file or tarball
which happens to have been lying around on my hard drive for a decade
or more, I can't because OS X insist that file names be encoded in
UTF-8 and just throw errors if that requirement is not met. And
certainly I cannot be required to re-encode all files to the
then-favored encoding continually – although favors don’t change often
and I’m willing to bet that UTF-8 is here to stay, but it has already
happened twice in my active computer life (DOS -> latin-1 -> UTF-8).

Going back to the old tarballs, OS X is completely useless for
handling them as a result of their encoding decision, and I have to
move to a Linux machine which just does not care about encodings.

PS I was very relieved to find out that os.listdir() – jut to pick one
file name-related function – will still return bytes if requested, as
it is not at all uncommon (at least for me) to have conflicting file
name encodings in different parts of a filesystem.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [RELEASED] Python 3.4.0b2

2014-01-09 Thread Martin v. Löwis
Am 08.01.14 16:03, schrieb Nick Coghlan:
> On 9 January 2014 00:43, Bob Hanson  wrote:
>> When I read this comment of yours, Guido, I immediately started
>> wondering about this. You may well be right -- indeed, I have a
>> very old install (c.2007) which has not been updated (other than
>> one or three new MS "drivers").
>>
>> Perhaps the Python 3.4.0b2 MSI installer uses a new capability,
>> which, as you say, causes the installer to at least attempt to
>> upgrade...?
> 
> I believe the pip bootstrapping involves an MSI feature we haven't
> previously used (MvL would be able to confirm). If so, then MSI may be
> looking for a new version to interpret that new setting.

That's not true. The pip bootstrapping uses a custom action, and
we already have one that is similar (compile to pyc), although that
isn't run by default.

My guess is that it might try verifying signatures, and somehow tries
to obtain the CA certificates (although it's puzzling that it would get
them from akamai - perhaps MS is hosting the CA bundle there).

Regards,
Martin

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Stephen J. Turnbull
Paul Moore writes:

 > So I think that if this discussion is to be of any real benefit, a
 > specific example is needed. I honestly don't think I've ever
 > encountered a case where "Sometimes [I] just want to parse text
 > files" and code that uses the default encoding (i.e., looks pretty
 > much identical to Python 2) has *failed* to do the job for me.

I don't understand why it fails for Kristján, but I can tell you why
it failed for me:  Mac OS X "Snow Leopard" (at least on my box, and
perhaps due to my misconfiguration) doesn't set the locale variables
and for some reason the fallback for locale.getpreferredencoding() is
not UTF-8 (== sys.getfilesystemencoding()) nor some Japanese encoding
(Japanese is my system language), but US-ASCII!

Naturally, putting LANG=ja_JP.UTF-8 in my shell startup fixed that
once and for all, so as I say I don't understand why Kristján has a
problem.

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Kristján Valur Jónsson

> -Original Message-
> From: Python-Dev [mailto:python-dev-
> [email protected]] On Behalf Of Stefan Ring
> Sent: 9. janúar 2014 09:32
> To: [email protected]
> Subject: Re: [Python-Dev] Python3 "complexity"
> 
> > just became harder to use for that purpose.
> 
> The entire discussion reminds me very much of the situation with file names
> in OS X. Whenever I want to look at an old zip file or tarball which happens 
> to
> have been lying around on my hard drive for a decade or more, I can't
> because OS X insist that file names be encoded in
> UTF-8 and just throw errors if that requirement is not met. And certainly I
> cannot be required to re-encode all files to the then-favored encoding
> continually – although favors don’t change often and I’m willing to bet that
> UTF-8 is here to stay, but it has already happened twice in my active
> computer life (DOS -> latin-1 -> UTF-8).

Well, yes.
Also, the problem I'm describing has to do with real world stuff.
This is the python 2 program:
with open(fn1) as f1:
with open(fn2, 'w') as f2:
f2.write(process_text(f1.read())

Moving to python 3, I found that this quickly caused problems.  So, I 
explicitly added an encoding.  Better guess an encoding, something that is 
likely, e.g. cp1252
with open(fn1, encoding='cp1252') as f1:
with open(fn2, 'w', encoding='cp1252') as f2:
f2.write(process_text(f1.read())

This mostly worked.  But then, with real world data, sometimes we found that 
even files we declared to be cp1252, sometimes contained invalid code points.  
Was the file really in cp1252?  Or did someone mess up somewhere?  Or simply 
take a small poet's leave with the specification? 
This is when it started to become annoying.  I mean, clearly something was 
broken at some point, or I don't know the exactly correct encoding of the file. 
  But this is not the place to correct that mistake.  I want my program to be 
robust towards such errors.  And these errors exist.

So, the third version was:
with open(fn1, "b") as f1:
with open(fn2, 'wb') as f2:
f2.write(process_bytes(f1.read())

This works, but now I have a bytes object which is rather limited in what it 
can do.  Also, all all string constants in my process_bytes() function have to 
be b'foo', rather than 'foo'.

Only much later did I learn about 'surrogateescape'.  How is a new user to 
python to know about it?  The final version would probably be this:
with open(fn1, encoding='cp1252', errors='surrogateescape') as f1:
with open(fn2, 'w', encoding='cp1252', errors='surrogateescape') as f2:
f2.write(process_text(f1.read())

Will this always work?  I don't know.  I hope so.  But it seems very verbose 
when all you want to do is munge on some bytes.  And the 'surrogateescape' 
error handler is not something that a newcomer to the language, or someone 
coming from python2, is likely to automatically know about.

Could this be made simpler?  What If we had an encoding that combines 'ascii' 
and 'surrogateescape'?  Something that allows you to read ascii text with 
unknown high order bytes without this unneeded verbosity?  Something that would 
be immediately obvious to the newcomer?

K

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-09 Thread Nick Coghlan
On 9 Jan 2014 11:29, "INADA Naoki"  wrote:
>
>
>> And I think everyone was well intentioned - and python3 covers most of
the
>> bases, but working with binary data is not only a "wire-protocol
programmer's"
>> problem.

If you're working with binary data, use the binary API offered by bytes,
bytearray and memoryview.

> Needing a library to wrap bytesthing.format('ascii', 'surrogateescape')
>> or some such thing makes python3 less approachable for those who haven't
>> learned that yet - which was almost all of us at some point when we
started
>> programming.
>
> Totally agree with you.

If you're on a relatively modern OS, everything should be UTF-8 and you
should be fine as a beginner.

When you start encountered malformed data, Python 3 should throw an error,
and provide an opportunity to learn more (by looking up the error message),
where Python 2 would silently corrupt the data stream.

Python 2 enshrined a data model eminently suitable for boundary code that
dealt with ASCII compatible binary protocols (like web frameworks) as the
default text model. Application code then needed to take special steps to
get correct behaviour for the full Unicode range. In essence, the Python 2
text model is the POSIX text model with Unicode support bolted on to the
side to make it at least *possible* to write correct application code.

This is completely backwards. Web applications vastly outnumber web
frameworks, and the same goes for every other domain: applications are
vastly more common than the libraries and frameworks that handle data
transformations at system boundaries on their behalf, so making the latter
easier to write at the expense of the former is a deeply flawed design
choice.

So Python 3 reverses the situation: the core text model is now more
appropriate for the central application code, *after* the boundary code has
cleaned up the murky details of wire protocols and file formats.

This is pretty easy to deal with for *new* Python 3 code, since you just
write things to deal with either bytes or text as appropriate.

However, there is some code written for Python 2 that relies more heavily
on the ability to treat ascii compatible binary data as both binary data
*and* as text. This is the use case that Python 3 treats as a more
specialised use case (perhaps benefitting from a specialised third party
type), whereas Python 2 supports it by default.

This is also the use case that relied most heavily on implicit encoding and
decoding, since that's the mechanism that allows the 8-bit and Unicode
paths to share string literals.

Cheers,
Nick.

>
>
> --
> INADA Naoki  
>
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
>
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [RELEASED] Python 3.4.0b2

2014-01-09 Thread Martin v. Löwis
Am 06.01.14 17:26, schrieb Michael Urman:
> Here's some more guesswork. Does it seem possible that msiexec is
> trying to verify the revocation status of the certificate used to sign
> the python .msi file? Per
> http://blogs.technet.com/b/pki/archive/2006/11/30/basic-crl-checking-with-certutil.aspx
> it looks like crl.microsoft.com is the host; this is hosted on akamai:
>crl.microsoft.com is an alias for crl.www.ms.akadns.net.
>crl.www.ms.akadns.net is an alias for a1363.g.akamai.net.

I think that could be close. The MSI file has two signatures in it: the
PSF code signing signature, and a Verisign timestamping signature.

For the PSF certificate, the CRL is at csc3-2010-crl.verisign.com,
which is (here) a CNAME for crl.ws.symantec.com.edgekey.net, which
in turn is a CNAME for e6845.ce.akamaiedge.net.

The timestamping signature has its CRL at ts-crl.ws.symantec.com,
which is a CNAME for crl.ws.symantec.com.edgekey.net again.

So the most plausible reason is indeed that it tries to download
CRLs, though not Microsoft ones, but Verisign/Symantic ones.

Regards,
Martin


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Paul Moore
On 9 January 2014 10:15, Kristján Valur Jónsson  wrote:
> Also, the problem I'm describing has to do with real world stuff.
> This is the python 2 program:
> with open(fn1) as f1:
> with open(fn2, 'w') as f2:
> f2.write(process_text(f1.read())
>
> Moving to python 3, I found that this quickly caused problems.

You don't say what problems, but I assume encoding/decoding errors. So
the files apparently weren't in the system encoding. OK, at that point
I'd probably say to heck with it and use latin-1. Assuming I was sure
that (a) I'd never hit a non-ascii compatible file (e.g., UTF16) and
(b) I didn't have a decent means of knowing the encoding.

One thing that genuinely is difficult is that because disk files don't
have any out-of-band data defining their encoding, it *can* be hard to
know what encoding to use in an environment where more than one
encoding is common. But this isn't really a Python issue - as I say,
I've hit it with GNU tools, and I've had to explain the issue to
colleagues using Java on many occasions. The key difference is that
with grep, people blame the file, whereas with Python people blame the
language :-) (Of course, with Java, people expect this sort of problem
so they blame the perverseness of the universe as a whole... ;-))

Paul.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Changing Clinic's output

2014-01-09 Thread Serhiy Storchaka

07.01.14 22:51, Ethan Furman написав(ла):

On 01/07/2014 12:39 PM, Serhiy Storchaka wrote:

* It clutters up hg log and hg blame results. Every time when you
change clinic.py to generate different output, it
touches multiple lines in all files which use Argument Clinic and
clutters up their history.


I think this is the reason to focus on -- the others seem like editor
issues, or easily resolved by the second or third options.


AFAIK you don't write much C code. So perhaps C sources maintainability 
is not too valuable for you.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Steven D'Aprano
On Thu, Jan 09, 2014 at 05:11:06PM +1000, Nick Coghlan wrote:
> On 9 January 2014 10:07, Ben Finney  wrote:

> > So, if what you want is to parse text and not get gibberish, you need to
> > *tell* Python what the encoding is. That's a brute fact of the world of
> > text in computing.
> 
> Set the mode to "rb", process it as binary. Done.

A nice point, but really, you lose a lot by doing so. Even simple things 
like the ability to write:

if word[0] == 'X'

instead you have to write things like:

if word[0:1] = b'X'
if chr(word[0]) == 'X'
if word[0] == ord('X')
if word[0] == 0x58

(pick the one that annoys you the least). And while bytes objects 
do have a surprising (to me) number of string-ish methods, like 
upper(), there are a few missing, like format() and isnumeric(). So it's 
not quite as straightforward as "done". If it were, we wouldn't need 
text strings :-)



-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Antoine Pitrou
On Thu, 9 Jan 2014 10:15:08 +
Kristján Valur Jónsson  wrote:
> 
> Moving to python 3, I found that this quickly caused problems.  So, I 
> explicitly added an encoding.  Better guess an encoding, something that is 
> likely, e.g. cp1252
> with open(fn1, encoding='cp1252') as f1:
> with open(fn2, 'w', encoding='cp1252') as f2:
> f2.write(process_text(f1.read())

If you don't "care" about the encoding, why don't you use latin1?
Things will roundtrip fine and work as well as under Python 2.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-09 Thread Antoine Pitrou
On Thu, 09 Jan 2014 03:54:13 +
MRAB  wrote:
> I'm thinking that the "i" format could be used for signed integers and
> the "u" for unsigned integers. The width would be the number of bytes.
> You would also need to have a way of specifying the endianness.
> 
> For example:
> 
>  >>> b'{:<2i}'.format(256)
> b'\x01\x00'
>  >>> b'{:>2i}'.format(256)
> b'\x00\x01'

The goal is not to add an alternative to the struct module. If you need
binary packing/unpacking, just use struct.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-09 Thread Antoine Pitrou
On Thu, 9 Jan 2014 17:09:10 +1000
Nick Coghlan  wrote:
> 
> There's also the fact that POSIX folks are used to "r" and "rb" being
> the same thing.

Which fails immediately under Windows :-)

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Kristján Valur Jónsson


> -Original Message-
> From: Paul Moore [mailto:[email protected]]
> Sent: 9. janúar 2014 10:53
> To: Kristján Valur Jónsson
> Cc: Stefan Ring; [email protected]
> > Moving to python 3, I found that this quickly caused problems.
> 
> You don't say what problems, but I assume encoding/decoding errors. So the
> files apparently weren't in the system encoding. OK, at that point I'd
> probably say to heck with it and use latin-1. Assuming I was sure that (a) I'd
> never hit a non-ascii compatible file (e.g., UTF16) and
> (b) I didn't have a decent means of knowing the encoding.
Right.  But even latin-1, or better, cp1252 (on windows) does not solve it 
because these have undefined
code points.  So you need 'surrogateescape' error handling as well.  Something 
that I didn't know at
the time, having just come from python 2 and knowing its Unicode model well.

> 
> One thing that genuinely is difficult is that because disk files don't have 
> any
> out-of-band data defining their encoding, it *can* be hard to know what
> encoding to use in an environment where more than one encoding is
> common. But this isn't really a Python issue - as I say, I've hit it with GNU
> tools, and I've had to explain the issue to colleagues using Java on many
> occasions. The key difference is that with grep, people blame the file,
> whereas with Python people blame the language :-) (Of course, with Java,
> people expect this sort of problem so they blame the perverseness of the
> universe as a whole... ;-))

Which reminds me, can Python3 read text files with BOM automatically yet?

K

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Martin v. Löwis

> Right.  But even latin-1, or better, cp1252 (on windows) does not solve it 
> because these have undefined
> code points. 

That's not true. latin-1 does not have undefined code points.


Regards,
Martin

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Kristján Valur Jónsson


> -Original Message-
> From: Python-Dev [mailto:python-dev-
> [email protected]] On Behalf Of Antoine Pitrou
> Sent: 9. janúar 2014 12:42
> To: [email protected]
> Subject: Re: [Python-Dev] Python3 "complexity"
> 
> On Thu, 9 Jan 2014 10:15:08 +
> Kristján Valur Jónsson  wrote:
> >
> > Moving to python 3, I found that this quickly caused problems.  So, I
> > explicitly added an encoding.  Better guess an encoding, something that is
> likely, e.g. cp1252 with open(fn1, encoding='cp1252') as f1:
> > with open(fn2, 'w', encoding='cp1252') as f2:
> > f2.write(process_text(f1.read())
> 
> If you don't "care" about the encoding, why don't you use latin1?
> Things will roundtrip fine and work as well as under Python 2.

Because latin1 does not define all code points, giving you errors there.  Same 
with cp1252.
Which is why you need 'surrogateescape' in addition.

K

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Antoine Pitrou
On Thu, 9 Jan 2014 12:55:35 +
Kristján Valur Jónsson  wrote:
> > If you don't "care" about the encoding, why don't you use latin1?
> > Things will roundtrip fine and work as well as under Python 2.
> 
> Because latin1 does not define all code points, giving you errors there.

>>> b = bytes(range(256))
>>> b.decode('latin1')
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
 
!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'

Not sure which errors you were getting?

Regards

Antoine.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Paul Moore
On 9 January 2014 13:00, Kristján Valur Jónsson  wrote:
>> You don't say what problems, but I assume encoding/decoding errors. So the
>> files apparently weren't in the system encoding. OK, at that point I'd
>> probably say to heck with it and use latin-1. Assuming I was sure that (a) 
>> I'd
>> never hit a non-ascii compatible file (e.g., UTF16) and
>> (b) I didn't have a decent means of knowing the encoding.
> Right.  But even latin-1, or better, cp1252 (on windows) does not solve it 
> because these have undefined
> code points.  So you need 'surrogateescape' error handling as well.  
> Something that I didn't know at
> the time, having just come from python 2 and knowing its Unicode model well.

>>> bin = bytes(range(256))
>>> bin
b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\
x1d\x1e\x1f 
!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\
x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x
9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb
8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4
\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\
xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>>> bin.decode('latin-1')
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x
1d\x1e\x1f 
!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x
80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9
c\x9d\x9e\x9f\xa0¡¢£\xa4¥\xa6\xa7\xa8\xa9ª«¬\xad\xae\xaf°±²\xb3\xb4µ\xb6·\xb8\xb9º»¼½\xbe¿\xc0\xc1\xc2\xc3ÄÅÆÇ\xc
8É\xca\xcb\xcc\xcd\xce\xcf\xd0Ñ\xd2\xd3\xd4\xd5Ö\xd7\xd8\xd9\xda\xdbÜ\xdd\xdeßàáâ\xe3äåæçèéêëìíîï\xf0ñòóô\xf5ö÷\x
f8ùúûü\xfd\xfeÿ'

No undefined bytes there. If you mean that latin-1 can't encode all of
the Unicode code points, then how did those code points get in there?
Presumably you put them in, and so you're not just playing with the
ASCII text parts. And you *do* need to understand encodings.

>> One thing that genuinely is difficult is that because disk files don't have 
>> any
>> out-of-band data defining their encoding, it *can* be hard to know what
>> encoding to use in an environment where more than one encoding is
>> common. But this isn't really a Python issue - as I say, I've hit it with GNU
>> tools, and I've had to explain the issue to colleagues using Java on many
>> occasions. The key difference is that with grep, people blame the file,
>> whereas with Python people blame the language :-) (Of course, with Java,
>> people expect this sort of problem so they blame the perverseness of the
>> universe as a whole... ;-))
>
> Which reminds me, can Python3 read text files with BOM automatically yet?

If by "automatically" you mean "reads the BOM and chooses an
appropriate encoding based on it" then I don't know, but I suspect
not. But unless you're worried about 2-byte encodings (see! you need
to understand encodings again!) latin-1 will still work.

It sounds to me like what you *really* want is something that
autodetects encodings on Windows in the same sort of way as other
Windows tools like Notepad does. That's a fair thing to want, but no,
Python doesn't provide it (nor did Python 2). I suspect that it would
be possible to write a codec to do this, though. Maybe there's even
one on PyPI.

Paul
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Kristján Valur Jónsson


> -Original Message-
> From: Python-Dev [mailto:python-dev-
> [email protected]] On Behalf Of Antoine Pitrou
> Sent: 9. janúar 2014 13:18
> To: [email protected]
> Subject: Re: [Python-Dev] Python3 "complexity"
> 
> On Thu, 9 Jan 2014 12:55:35 +
> Kristján Valur Jónsson  wrote:
> > > If you don't "care" about the encoding, why don't you use latin1?
> > > Things will roundtrip fine and work as well as under Python 2.
> >
> > Because latin1 does not define all code points, giving you errors there.
> 
> >>> b = bytes(range(256))
> >>> b.decode('latin1')
> '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12
> \x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-
> ./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijkl
> mnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x
> 8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9
> c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎ
> ÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'

You are right.  I'm talking about "cp1252" which is the windows version thereof:
>>> s = ''.join(chr(i) for i in range(256))
>>> s.decode('cp1252')
Traceback (most recent call last):
  File "", line 1, in 
  File "C:\Python27\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 129: 
character maps to 

This definition is funny, because according to Wikipedia, it is a "superset" of 
8869-1 ( latin1)
See http://en.wikipedia.org/wiki/Cp1252
Also, see 
http://en.wikipedia.org/wiki/Latin1 

There is confusion there.  The iso8859-1 does in fact not define the control 
codes in range 128 to 158, whereas the
Unicode page Latin 1 does.  
Strictly speaking, then, a Latin1 (or more specifically, ISO8859-1) decoder 
should error on these characters.
the 'Latin1' codec therefore is not a true 8859-1 codec.

K
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Victor Stinner
2014/1/9 Kristján Valur Jónsson :
> This definition is funny, because according to Wikipedia, it is a "superset" 
> of 8869-1 ( latin1)

Bytes 0x80..0x9f are unassigned in ISO/CEI 8859-1... but are assigned
in (IANA's) ISO-8859-1.

Python implements the latter, ISO-8859-1.

Wikipedia says "This encoding is a superset of ISO 8859-1, but differs
from the IANA's ISO-8859-1".

Victor
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Kristján Valur Jónsson


> -Original Message-
> From: Python-Dev [mailto:python-dev-
> [email protected]] On Behalf Of Kristján Valur
> Jónsson
> Sent: 9. janúar 2014 13:37
> To: Antoine Pitrou; [email protected]
> Subject: Re: [Python-Dev] Python3 "complexity"
> 
> This definition is funny, because according to Wikipedia, it is a "superset" 
> of
> 8869-1 ( latin1) See http://en.wikipedia.org/wiki/Cp1252
> Also, see
> http://en.wikipedia.org/wiki/Latin1
> 
> There is confusion there.  The iso8859-1 does in fact not define the control
> codes in range 128 to 158, whereas the Unicode page Latin 1 does.
> Strictly speaking, then, a Latin1 (or more specifically, ISO8859-1) decoder
> should error on these characters.
> the 'Latin1' codec therefore is not a true 8859-1 codec.


See also:  http://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)
for the latin-1 supplement, not to be confused with 8859-1.
The header of the 8859-1 page is telling:

"""
ISO/IEC 8859-1
From Wikipedia, the free encyclopedia
  (Redirected from Latin1)
For the Unicode block also called "Latin 1", see Latin-1 Supplement (Unicode 
block). For the character encoding commonly mislabeled as "ISO-8859-1", see 
Windows-1252.
"""

K 
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-09 Thread Daniel Holth
So the customer you're looking for is the person who cares a lot about
encodings, knows how to do Unicode correctly, and has noticed that
certain valid cases not limited to imperialist simpletons (dealing
with specific common things invented before 1996, dealing with mixed
encodings, doing what Nick describes as "ASCII compatible binary
protocols") are *more complicated to do correctly* in Python 3 because
Python 3 undeniably has more complicated though probably better
*Unicode* support. N.b. WSGI, email, url parsing etc. The same person
loves Python, all the other Python 3 features, and probably you
personally, but mostly does not write programs in the domains that
Python 3 makes easier. They emphatically do not want the Python 2
model especially not implicit coercion. They only want additional
tools for text or string processing in Python 3.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-09 Thread Antoine Pitrou
On Thu, 9 Jan 2014 09:03:40 -0500
Daniel Holth  wrote:
> They emphatically do not want the Python 2
> model especially not implicit coercion. They only want additional
> tools for text or string processing in Python 3.

That's a good point. Now it's up to people who need those additional
tools to propose them. We can't second-guess everyone's needs.

Regards

Antoine.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Steven D'Aprano
On Thu, Jan 09, 2014 at 01:00:59PM +, Kristján Valur Jónsson wrote:

> Which reminds me, can Python3 read text files with BOM automatically yet?

I'm not sure what you mean by that. If you mean, can Python3 distinguish 
between UTF-16BE and UTF-16LE on the basis of a BOM, then it's been able 
to do that for a long time:

steve@orac:~$ hexdump sample-utf-16.txt
000 feff 0048 0065 006c 006c 006f 0020 0057
010 006f 0072 006c 0064 0021 000a 00a2 00a3
020 00a7 2022 00b6 00df 03c0 2248 2206 000a
030
steve@orac:~$ python3.1 -c "print(open('sample-utf-16.txt', 
encoding='utf-16').read())"
Hello World!
¢£§•¶ßπ≈∆


If you mean, "Will Python assume that the presence of bytes FEFF or FFFE
at the start of a file means that it is encoded in UTF-16?", then as 
far as I know, the answer is "No":

[steve@ando ~]$ python3.3 -c "print(open('sample-utf-16.txt').read())"
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/lib/python3.3/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: 
invalid start byte


I wouldn't want it to guess the encoding by default. See the Zen about 
ambiguity.


-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Kristján Valur Jónsson


> -Original Message-
> From: Victor Stinner [mailto:[email protected]]
> Sent: 9. janúar 2014 13:51
> To: Kristján Valur Jónsson
> Cc: Antoine Pitrou; [email protected]
> Subject: Re: [Python-Dev] Python3 "complexity"
> 
> 2014/1/9 Kristján Valur Jónsson :
> > This definition is funny, because according to Wikipedia, it is a
> > "superset" of 8869-1 ( latin1)
> 
> Bytes 0x80..0x9f are unassigned in ISO/CEI 8859-1... but are assigned in
> (IANA's) ISO-8859-1.
> 
> Python implements the latter, ISO-8859-1.
> 
> Wikipedia says "This encoding is a superset of ISO 8859-1, but differs from
> the IANA's ISO-8859-1".
> 

Thanks.  That's entirely non-confusing :)
" ISO-8859-1 is the IANA preferred name for this standard when supplemented 
with the C0 and C1 control codes from ISO/IEC 6429."

So anyway, yes, Python's "latin1" encoding does cover the entire 256 range.  
But on windows we use cp1252 instead which does not,
but instead defines useful and common windows characters in many of the control 
caracters slots.
Hence the need for "surrogateescape" to be able to roundtrip characters.

Again, this is non-obvious, and knowing from my experience with cp1252, I had 
no way of guessing that the "subset", i.e. latin1, would indeed cover all the 
range.  Two things then I have learned since my initial foray into parsing 
ascii files with python3:  Surrogateescapes and "latin1 in python == IANA's 
ISO-8859-1 which does indeed define the whole 8 bit range".

K
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Changing Clinic's output

2014-01-09 Thread Ethan Furman

On 01/09/2014 03:39 AM, Serhiy Storchaka wrote:

07.01.14 22:51, Ethan Furman написав(ла):

AFAIK you don't write much C code. So perhaps C sources maintainability is not 
too valuable for you.


I don't write much C code yet, no, but C source maintainability is even more important to me because of it.  Having to 
search several files for something makes it more difficult for me to find what I need.


I have the same issues with Python code, too.  Back in my windows days I had some custom functions to make py code 
browsing much nicer in Vim; then I changed jobs, forgot to grab the code, and now my py files are unfolded and large. 
So far just searching for what I'm looking for has worked well enough that I haven't reimplemented my lost functions. 
But that is an editor issue, not a file issue.


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Stephen J. Turnbull
Steven D'Aprano writes:

 > If it were, we wouldn't need text strings :-)

Speak for yourself, Kemosabe.  Red man need Unicode, full meal not
just a few bytes.

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] A test case for what's missing in Python 3 (Re: RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5)

2014-01-09 Thread Barry Warsaw
(Resending with an adjusted Subject and not through Gmane.  Apologies for
duplicates.)

On Jan 08, 2014, at 01:51 PM, Stephen J. Turnbull wrote:

>Benjamin Peterson writes:
>
> > I agree. This is a very important, much-requested feature for low-level
> > networking code.
>
>I hear it's much-requested, but is there any description of typical
>use cases?

The two unported libraries that are preventing me from switching Mailman 3 to
Python 3 are restish and storm.  For storm, there's a viable alternative in
SQLAlchemy though I haven't looked at how difficult it will be to port the
model layer (even though we once did use SA).

restish is tougher.  I've investigated flask, pecan, wsme, and a few others
that already have Python 3 support and none of them provide an API that I
consider as nice a fit as restish for our standalone WSGI-based REST admin
server.  That's not to denigrate those other projects, it's just that I think
restish hit the sweet spot, and porting Mailman 3 to some other framework so
far has proven unworkable (I've tried with each of them).

restish is plumbing so I think it's a good test case for Nick's observations
of a wire-protocol layer library, and it's obvious that it Just Works in
Python 2 but does work at all in Python 3.  There have been at least 3
attempts to port restish to Python 3 and all of them get stuck in various
places where you actually *can't* decide whether some data structure should be
a bytes or str.  Make one choice and you get stuck over here, make the other
chose and you get stuck over there.

I've got two abandoned branches on github with (rather old) porting attempts,
and I know other developers have some branches as well.  Having given up on
trying to switch to a different framework, I'm starting over again with
restish (really, it's wonderful :).

I plan on keeping more detailed notes this time specifically so that I can
help contribute to this discussion.  If anybody wants to pitch in, both for
the specific purpose of porting the library, and for the more general insights
it could provide for this thread, please get in touch.

Cheers,
-Barry


signature.asc
Description: PGP signature
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-09 Thread Barry Warsaw
On Jan 08, 2014, at 01:51 PM, Stephen J. Turnbull wrote:

>Benjamin Peterson writes:
>
> > I agree. This is a very important, much-requested feature for low-level
> > networking code.
>
>I hear it's much-requested, but is there any description of typical
>use cases?

The two unported libraries that are preventing me from switching Mailman 3 to
Python 3 are restish and storm.  For storm, there's a viable alternative in
SQLAlchemy though I haven't looked at how difficult it will be to port the
model layer (even though we once did use SA).

restish is tougher.  I've investigated flask, pecan, wsme, and a few others
that already have Python 3 support and none of them provide an API that I
consider as nice a fit as restish for our standalone WSGI-based REST admin
server.  That's not to denigrate those other projects, it's just that I think
restish hit the sweet spot, and porting Mailman 3 to some other framework so
far has proven unworkable (I've tried with each of them).

restish is plumbing so I think it's a good test case for Nick's observations
of a wire-protocol layer library, and it's obvious that it Just Works in
Python 2 but does work at all in Python 3.  There have been at least 3
attempts to port restish to Python 3 and all of them get stuck in various
places where you actually *can't* decide whether some data structure should be
a bytes or str.  Make one choice and you get stuck over here, make the other
chose and you get stuck over there.

I've got two abandoned branches on github with (rather old) porting attempts,
and I know other developers have some branches as well.  Having given up on
trying to switch to a different framework, I'm starting over again with
restish (really, it's wonderful :).

I plan on keeping more detailed notes this time specifically so that I can
help contribute to this discussion.  If anybody wants to pitch in, both for
the specific purpose of porting the library, and for the more general insights
it could provide for this thread, please get in touch.

Cheers,
-Barry


signature.asc
Description: PGP signature
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-09 Thread Nick Coghlan
On 9 Jan 2014 22:08, "Antoine Pitrou"  wrote:
>
> On Thu, 9 Jan 2014 09:03:40 -0500
> Daniel Holth  wrote:
> > They emphatically do not want the Python 2
> > model especially not implicit coercion. They only want additional
> > tools for text or string processing in Python 3.
>
> That's a good point. Now it's up to people who need those additional
> tools to propose them. We can't second-guess everyone's needs.

Note that I've tried to find prettier ways to write the standard library's
URL parsing code. In addition to the original alternatives I explored, I'm
currently experimenting with a generic function based approach with mixed
results.  I'm reserving judgement until I see how the completed conversion
looks, but currently it doesn't seem any simpler than my current higher
order function approach.

However, the implicit conversions are *critical* to sharing constants
between the two code paths in Python 2 without coercing bytes to str or
vice-versa (disabling the implicit coercion breaks Unicode handling), so
I'm still not sure the goal is achievable without creating a new type
*specifically* for that task.

Python 3 only code is generally much simpler - you can usually pick binary
or text and just support one of them, rather than trying to support both in
the same API.

Cheers,
Nick.

>
> Regards
>
> Antoine.
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Nick Coghlan
On 9 Jan 2014 22:25, "Kristján Valur Jónsson"  wrote:
>
>
>
> > -Original Message-
> > From: Victor Stinner [mailto:[email protected]]
> > Sent: 9. janúar 2014 13:51
> > To: Kristján Valur Jónsson
> > Cc: Antoine Pitrou; [email protected]
> > Subject: Re: [Python-Dev] Python3 "complexity"
> >
> > 2014/1/9 Kristján Valur Jónsson :
> > > This definition is funny, because according to Wikipedia, it is a
> > > "superset" of 8869-1 ( latin1)
> >
> > Bytes 0x80..0x9f are unassigned in ISO/CEI 8859-1... but are assigned in
> > (IANA's) ISO-8859-1.
> >
> > Python implements the latter, ISO-8859-1.
> >
> > Wikipedia says "This encoding is a superset of ISO 8859-1, but differs
from
> > the IANA's ISO-8859-1".
> >
>
> Thanks.  That's entirely non-confusing :)
> " ISO-8859-1 is the IANA preferred name for this standard when
supplemented with the C0 and C1 control codes from ISO/IEC 6429."
>
> So anyway, yes, Python's "latin1" encoding does cover the entire 256
range.  But on windows we use cp1252 instead which does not,
> but instead defines useful and common windows characters in many of the
control caracters slots.
> Hence the need for "surrogateescape" to be able to roundtrip characters.
>
> Again, this is non-obvious, and knowing from my experience with cp1252, I
had no way of guessing that the "subset", i.e. latin1, would indeed cover
all the range.  Two things then I have learned since my initial foray into
parsing ascii files with python3:  Surrogateescapes and "latin1 in python
== IANA's ISO-8859-1 which does indeed define the whole 8 bit range".

http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.htmlis
currently linked from the Unicode HOWTO. However, I'd be happy to
offer
it for direct inclusion to help make it more discoverable.

Cheers,
Nick.

>
> K
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-09 Thread Nick Coghlan
On 9 Jan 2014 06:43, "Antoine Pitrou"  wrote:
>
>
> Hi,
>
> With Victor's consent, I overhauled PEP 460 and made the feature set
> more restricted and consistent with the bytes/str separation.

+1

I was initially dubious about the idea, but the proposed semantics look
good to me.

We should probably include format_map for consistency with the str API.

>However, I
> also added bytearray into the mix, as bytearray objects should
> generally support the same operations as bytes (and they can be useful
> *especially* for network programming).

So we'd define the *format* string as mutable to get a mutable result out
of the formatting operations? This seems a little weird to me.

It also seems weird for a format method on a mutable type to *not* perform
in-place mutation.

On the other hand, I don't see another obvious way to control the output
type.

Cheers,
Nick.

>
> Regards
>
> Antoine.
>
>
>
> On Mon, 6 Jan 2014 14:24:50 +0100
> Victor Stinner  wrote:
> > Hi,
> >
> > bytes % args and bytes.format(args) are requested by Mercurial and
> > Twisted projects. The issue #3982 was stuck because nobody proposed a
> > complete definition of the "new" features. Here is a try as a PEP.
> >
> > The PEP is a draft with open questions. First, I'm not sure that both
> > bytes%args and bytes.format(args) are needed. The implementation of
> > .format() is more complex, so why not only adding bytes%args? Then,
> > the following points must be decided to define the complete list of
> > supported features (formatters):
>
>
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-09 Thread Antoine Pitrou
On Fri, 10 Jan 2014 05:26:04 +1000
Nick Coghlan  wrote:
> 
> We should probably include format_map for consistency with the str API.

Yes, you're right.

> >However, I
> > also added bytearray into the mix, as bytearray objects should
> > generally support the same operations as bytes (and they can be useful
> > *especially* for network programming).
> 
> So we'd define the *format* string as mutable to get a mutable result out
> of the formatting operations? This seems a little weird to me.
> 
> It also seems weird for a format method on a mutable type to *not* perform
> in-place mutation.

It's consistent with bytearray.join's behaviour:

>>> x = bytearray()
>>> x.join([b"abc"])
bytearray(b'abc')
>>> x
bytearray(b'')


Regards

Antoine.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Kristján Valur Jónsson
Thanks Nick.  This does seem to cover it all.  Perhaps it is worth mentioning 
cp1252 as the windows version of latin1, which _does_not_ cover all code points 
and hence requires surrogateescapes for best effort solution.

K




From: Nick Coghlan [[email protected]]
Sent: Thursday, January 09, 2014 18:08
To: Kristján Valur Jónsson
Cc: Victor Stinner; Antoine Pitrou; [email protected]
Subject: Re: [Python-Dev] Python3 "complexity"




http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
 is currently linked from the Unicode HOWTO. However, I'd be happy to offer it 
for direct inclusion to help make it more discoverable.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Chris Barker
This has all gotten a bit complicated because everyone has been thinking in
terms of actual encodings and actual text files. But I think the use-case
here is something different:

A file with a bunch of bytes in it, _some_of which are ascii, and the rest
are other bytes (maybe binary data, maybe non-ascii-encoded text).

I think this is the use-case that "just worked" in py2, but doesn't in py3
-- i.e. in py3 you have to choose either the binary interpretation or the
ascii one, but you can't have both. If you choose ascii, it will barf when
you try to decode it, if you choose binary, you lose the ability to do
simple stuff with the ascii subset -- parsing, substitution, etc.

Some folks have suggested using latin-1 (or other 8-bit encoding) -- is
that guaranteed to work with any binary data, and round-trip accurately?

and will surrogateescape work for arbitrary binary data?

If this is a common need, then it would be nice for py3 to address. I know
that I work with a couple file formats that have text headers followed by
binary data (not as hard to deal with, but still harder in py3). And from
this discussion , it seems that "wire protocols" commonly mix ascii and
binary.

So the decisions to be made:

Is this a use-case worth supporting in the standard library?

If so, how?
  1) add some of the basic stuff to the bytes object - i.e. string
formatting, what this all started with.
  2) create a custom encoding that could losslessly convert to from this
mixture to/from a unicode object. I
'm not sure if that is even possible, but it would be kind of cool.
  3) create a new object, neither a string nor a bytes object that did what
we want (it would look a lot like the py2 string...)
  4) create a module for doing the stuff wanted with a bytes object (not
very OO)

Does that clarify the discussion at all?

On Thu, Jan 9, 2014 at 2:15 AM, Kristján Valur Jónsson <
[email protected]> wrote:

> This is the python 2 program:
> with open(fn1) as f1:
> with open(fn2, 'w') as f2:
> f2.write(process_text(f1.read())
>

I think the key point here is that this worked because a common case was
ascii text and arbitrary binary mixed. As long as all the process_text()
stuff is ascii only, that would work, either with arbitrary binary data or
ascii-compatible encoding. The fact that it would NOT work with arbitrarily
encoded data doesn't mean it's not useful for this special, but perhaps
common, case.

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[email protected]
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] peps: PEP 460: add .format_map()

2014-01-09 Thread Eric V. Smith
I'm not sure how format_map helps in porting from 2 to 3, since it
doesn't exist in any version of 2.

Although that said, it's no doubt a useful feature, just not useful in
code that supports both 2 and 3 with a single code base or when porting
to 3.

Eric.

On 1/9/2014 4:02 PM, antoine.pitrou wrote:
> http://hg.python.org/peps/rev/8947cdc6b22e
> changeset:   5341:8947cdc6b22e
> user:Antoine Pitrou 
> date:Thu Jan 09 22:02:01 2014 +0100
> summary:
>   PEP 460: add .format_map()
> 
> files:
>   pep-0460.txt |  6 +-
>   1 files changed, 5 insertions(+), 1 deletions(-)
> 
> 
> diff --git a/pep-0460.txt b/pep-0460.txt
> --- a/pep-0460.txt
> +++ b/pep-0460.txt
> @@ -24,12 +24,16 @@
>similar in syntax to ``str.format()`` (accepting positional as well as
>keyword arguments).
>  
> +* ``bytes.format_map(...)`` and ``bytearray.format_map(...)`` for an
> +  API similar to ``str.format_map(...)``, with the same formatting
> +  syntax and semantics as ``bytes.format()`` and ``bytearray.format()``.
> +
>  
>  Rationale
>  =
>  
>  In Python 2, ``str % args`` and ``str.format(args)`` allow the formatting
> -and interpolation of bytes strings.  This feature has commonly been used
> +and interpolation of bytestrings.  This feature has commonly been used
>  for the assembling of protocol messages when protocols are known to use
>  a fixed encoding.
>  
> 
> 
> 
> ___
> Python-checkins mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-checkins
> 

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Antoine Pitrou
On Thu, 9 Jan 2014 13:36:05 -0800
Chris Barker  wrote:
> 
> Some folks have suggested using latin-1 (or other 8-bit encoding) -- is
> that guaranteed to work with any binary data, and round-trip accurately?

Yes, it is.

> and will surrogateescape work for arbitrary binary data?

Yes, it will.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Chris Barker
On Thu, Jan 9, 2014 at 1:45 PM, Antoine Pitrou  wrote:

> > latin-1 guaranteed to work with any binary data, and round-trip
> accurately?
>
> Yes, it is.
>
> > and will surrogateescape work for arbitrary binary data?
>
> Yes, it will.
>

Then maybe this is really a documentation issue, after all.

I know I learned something.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[email protected]
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Brett Cannon
On Thu, Jan 9, 2014 at 5:00 PM, Chris Barker  wrote:

> On Thu, Jan 9, 2014 at 1:45 PM, Antoine Pitrou wrote:
>
>> > latin-1 guaranteed to work with any binary data, and round-trip
>> accurately?
>>
>> Yes, it is.
>>
>> > and will surrogateescape work for arbitrary binary data?
>>
>> Yes, it will.
>>
>
> Then maybe this is really a documentation issue, after all.
>
> I know I learned something.
>

I think the other issue is everyone is talking about keeping the data from
the file in a single object. If you slice it up into pieces and decode the
parts as necessary this also solves the issue. So if you had an HTTP header
you could do::

  raw_header, body = data.split(b'\r\n\r\n)
  header = raw_header.decode('ascii')  # Ort whatever HTTP headers are
encoded in.

Now that might not easily solve the issue of the ASCII text interspersed
(such as Kristján's "phone number in the middle of stuff" example), but it
will deal with the problem. And if the numbers were separated with clean
markers then this would probably still work.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Paul Moore
On 9 January 2014 22:00, Chris Barker  wrote:
> On Thu, Jan 9, 2014 at 1:45 PM, Antoine Pitrou  wrote:
>>
>> > latin-1 guaranteed to work with any binary data, and round-trip
>> > accurately?
>>
>> Yes, it is.
>>
>> > and will surrogateescape work for arbitrary binary data?
>>
>> Yes, it will.
>
>
> Then maybe this is really a documentation issue, after all.

Certainly, the idea that you can use the latin1 codec and you'll get
the same sort of "ascii works and you can safely ignore the rest"[1]
behaviour that you get in Python 2 is not well promoted, and is
non-obvious.

Paul

[1] Where "safely" means "probably not as safely as you think, but
I'll try not to nag you" :-) And of course you have to make sure you
don't *add* any content that uses unicode characters beyond 255, or
you get encoding errors. But you weren't going to do that, were you?
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Ethan Furman

On 01/09/2014 02:00 PM, Chris Barker wrote:

On Thu, Jan 9, 2014 at 1:45 PM, Antoine Pitrou wrote:

Chris Barker wrote:


latin-1 guaranteed to work with any binary data, and round-trip accurately?


Yes, it is.



and will surrogateescape work for arbitrary binary data?


Yes, it will.



Then maybe this is really a documentation issue, after all.

I know I learned something.


If latin1 is used to convert binary to text, how convoluted is it to then take chunks of that text and convert to int, 
or some other variety of unicode?


For example:  b'\x01\x00\xd1\x80\xd1\83\xd0\x80'

If that were decoded using latin1 how would I then get the first two bytes to the integer 256 and the last six bytes to 
their Cyrillic meaning?  (Apologies for not testing myself, short on time.)


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Paul Moore
On 9 January 2014 22:08, Ethan Furman  wrote:
> For example:  b'\x01\x00\xd1\x80\xd1\83\xd0\x80'
>
> If that were decoded using latin1 how would I then get the first two bytes
> to the integer 256 and the last six bytes to their Cyrillic meaning?
> (Apologies for not testing myself, short on time.)

I cannot conceive why you would. Slice the bytes then use
struct.unpack on the first 2 bytes and decode on the last 6. We're
talking about using latin1 for cases where you want to treat the text
as essentially ascii (with a few bits of binary junk you want to
ignore). Please don't take away the message that latin1 makes things
"just like Python 2.X" - that's completely the wrong idea.

Paul
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Ethan Furman

On 01/09/2014 02:54 PM, Paul Moore wrote:

On 9 January 2014 22:08, Ethan Furman wrote:


For example:  b'\x01\x00\xd1\x80\xd1\83\xd0\x80'

If that were decoded using latin1 how would I then get the first two bytes
to the integer 256 and the last six bytes to their Cyrillic meaning?
(Apologies for not testing myself, short on time.)


I cannot conceive why you would.


Sorry, I was too short with my example.  My use case is binary files, with ASCII metadata and binary metadata, as well 
as ASCII-encoded numeric values, binary-coded numeric values, ASCII-encoded boolean values, and who-knows-what-(before 
checking the in-band metadata)-encoded text.  I have to process all of it, and before we say "It's just a documentation 
issue" I want to make sure it /is/ just a documentation issue.


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Chris Barker
On Thu, Jan 9, 2014 at 2:54 PM, Paul Moore
> For example:  b'\x01\x00\xd1\x80\xd1\83\xd0\x80'

>  >
> > If that were decoded using latin1 how would I then get the first two
> bytes
> > to the integer 256 and the last six bytes to their Cyrillic meaning?
> > (Apologies for not testing myself, short on time.)
>
> I cannot conceive why you would. Slice the bytes then use
> struct.unpack on the first 2 bytes and decode on the last 6.


exactly.


> We're
> talking about using latin1 for cases where you want to treat the text
> as essentially ascii (with a few bits of binary junk you want to ignore).


as so --  I want to replace a bit of ascii text surrounded by arbitrary
binary:
(apologies for the py2...)

In [24]: b
Out[24]: '\x01\x00\xd1\x80\xd1a name\xd0\x80'

In [25]: u = b.decode('latin-1')

In [26]: u2 = u.replace('a name', 'a different name')

In [28]: b2 = u2.encode('latin-1')

In [29]: b2
Out[29]: '\x01\x00\xd1\x80\xd1a different name\xd0\x80'

-Chris







> Please don't take away the message that latin1 makes things
> "just like Python 2.X" - that's completely the wrong idea.
>
> Paul
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/chris.barker%40noaa.gov
>



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[email protected]
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Ethan Furman

On 01/09/2014 02:54 PM, Paul Moore wrote:

On 9 January 2014 22:08, Ethan Furman  wrote:

For example:  b'\x01\x00\xd1\x80\xd1\83\xd0\x80'

If that were decoded using latin1 how would I then get the first two bytes
to the integer 256 and the last six bytes to their Cyrillic meaning?
(Apologies for not testing myself, short on time.)


Please don't take away the message that latin1 makes things
"just like Python 2.X" - that's completely the wrong idea.


Sure is!

--> struct.unpack('>h', '\x01\x00')
Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'str' does not support the buffer interface

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Chris Barker
On Thu, Jan 9, 2014 at 3:14 PM, Ethan Furman  wrote:

> Sorry, I was too short with my example.  My use case is binary files, with
> ASCII metadata and binary metadata, as well as ASCII-encoded numeric
> values, binary-coded numeric values, ASCII-encoded boolean values, and
> who-knows-what-(before checking the in-band metadata)-encoded text.  I have
> to process all of it, and before we say "It's just a documentation issue" I
> want to make sure it /is/ just a documentation issue.
>

As I am coming to understand it -- yes, using latin-1 would let you work
with all that. You could decode the binary data using latin-1, which would
give you a unicode object, which would:

1) act like ascii for ascii values, for the normal string operations,
search, replace, etc, etc...

2) have a 1:1 mapping of indexes to bytes in the original.

3) be not-too-bad for memory and other performance (as I understand it py3
now has a cool unicode implementation that does not waste a  lot of bytes
for low codepoints)

4) would preserve the binary data that was not directly touched.

Though you'd still have to encode() to bytes to get chunks that could be
used as binary -- i.e. passed to the struct module, or to a frombytes() or
frombuffer() method of say numpy, or PIL or something...

But I'm no expert

-Chris















>
> --
> ~Ethan~
>
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/
> chris.barker%40noaa.gov
>



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[email protected]
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread INADA Naoki
latin1 is OK but is it Pythonic?

I've posted suggestion about add 'bytes' as a alias for 'latin1'.
http://comments.gmane.org/gmane.comp.python.ideas/10315

I want one Pythonic way to handle "binary containing ascii (or latin1 or
utf-8 or other ascii compatible)".



On Fri, Jan 10, 2014 at 8:53 AM, Chris Barker  wrote:

> On Thu, Jan 9, 2014 at 3:14 PM, Ethan Furman  wrote:
>
>> Sorry, I was too short with my example.  My use case is binary files,
>> with ASCII metadata and binary metadata, as well as ASCII-encoded numeric
>> values, binary-coded numeric values, ASCII-encoded boolean values, and
>> who-knows-what-(before checking the in-band metadata)-encoded text.  I have
>> to process all of it, and before we say "It's just a documentation issue" I
>> want to make sure it /is/ just a documentation issue.
>>
>
> As I am coming to understand it -- yes, using latin-1 would let you work
> with all that. You could decode the binary data using latin-1, which would
> give you a unicode object, which would:
>
> 1) act like ascii for ascii values, for the normal string operations,
> search, replace, etc, etc...
>
> 2) have a 1:1 mapping of indexes to bytes in the original.
>
> 3) be not-too-bad for memory and other performance (as I understand it py3
> now has a cool unicode implementation that does not waste a  lot of bytes
> for low codepoints)
>
> 4) would preserve the binary data that was not directly touched.
>
> Though you'd still have to encode() to bytes to get chunks that could be
> used as binary -- i.e. passed to the struct module, or to a frombytes() or
> frombuffer() method of say numpy, or PIL or something...
>
> But I'm no expert
>
> -Chris
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>>
>> --
>> ~Ethan~
>>
>> ___
>> Python-Dev mailing list
>> [email protected]
>> https://mail.python.org/mailman/listinfo/python-dev
>> Unsubscribe: https://mail.python.org/mailman/options/python-dev/
>> chris.barker%40noaa.gov
>>
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> [email protected]
>
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com
>
>


-- 
INADA Naoki  
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread anatoly techtonik
On Thu, Jan 9, 2014 at 10:00 AM, Mark Lawrence  wrote:
> On 09/01/2014 06:50, Lennart Regebro wrote:
>>
>> On Thu, Jan 9, 2014 at 1:07 AM, Ben Finney 
>> wrote:
>>>
>>> Kristján Valur Jónsson  writes:
>>>
 Believe it or not, sometimes you really don't care about encodings.
 Sometimes you just want to parse text files.
>>>
>>>
>>> Files don't contain text, they contain bytes. Bytes only become text
>>> when filtered through the correct encoding.
>>
>>
>> To be honest, you can define text as "A stream of bytes that are split
>> up in lines separated by a linefeed", and do some basic text
>> processing like that. Just very *basic*, but still. Replacing
>> characters. Extracting certain lines etc.
>>
>> This is harder in Python 3, as bytes does not have all the
>> functionality strings has, like formatting. This can probably be fixed
>> in Python 3.5, if the relevant PEP gets finished.
>>
>> For the battery analogy, that's like saying:
>>
>> "I want a battery."
>>
>> "What kind?"
>>
>> "It doesn't matter, as long as it's over 5V."
>>
>> //Lennart
>>
>
> "That Python 3 battery you sold me blew up when I tried using it".
>
> "We've been telling you for years that could happen".
>
> "I didn't think you actually meant it".

  "These new nuclear cells are awesome! But you stop from from
leaking on their users?"

A1: "The nuclear power is radioactive. Accept it."

A2: "This is the basic stdlib container. You're supposed to protect yourself."

A3: "The world is changing. Everybody should learn nuclear fission to
use things properly."

  "..."

and while we are at it, if the battery became more advanced, there is
no reason to
strip off simple default interface. This interface is not an abstract
discussion here,
but a real user experience study (I am going to spread UX virus),
which starts with:

  1. expectations
  2. experience
  3. outcomes

and progressively iterate over 2 to get 3 matching 1 as close as
possibly, without
trying to change 1. 1 is equal to changing people - it is simple and
natural solution
that people practicing every day on children and subordinates. The
only problem is
that it is ineffective, hard and useless activity in open source
environment, because
most people by the nature of their neural network processes become conservative
with ages. That's why people invented forks. However, for the encoding problem,
there are some good default solutions. You'll have choose between different
interests anyway, but here it is:

  1. always open() text files in UTF-8 by default
  2. introduce autodetect mode to open functions
 1. read and transform on the fly, maintaining a buffer that
stores original bytes
 and their mapping to letters. The mapping is updated as bytes frequency
 changes. When the buffer is full, you have the best candidate.
  3. provide sane error messages
 1. messages that users do actually understand
 2. messages that tell how to fix the problem

If interface becomes more complicated - the last thing you should do is to leave
user 1:1 with interface problems.

And to conclude, I am not saying that people should not learn about unicode,
but the learning curve should not be as steep as Python 3 demands it.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Joao S. O. Bueno
On 9 January 2014 04:50, Lennart Regebro  wrote:
> To be honest, you can define text as "A stream of bytes that are split
> up in lines separated by a linefeed", and do some basic text
> processing like that. Just very *basic*, but still. Replacing
> characters. Extracting certain lines etc.

That is, until you hit a character which has a byte with the same
value of ASCII newline in the middle of a multi-byte character.

So, this approach is broken to start with.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Chris Angelico
On Fri, Jan 10, 2014 at 11:53 AM, anatoly techtonik  wrote:
>   2. introduce autodetect mode to open functions
>  1. read and transform on the fly, maintaining a buffer that
> stores original bytes
>  and their mapping to letters. The mapping is updated as bytes 
> frequency
>  changes. When the buffer is full, you have the best candidate.
>

Bad idea. Bad, bad idea! No biscuit. Sit!

This sort of magic is what brings the "bush hid the facts" bug in
Windows Notepad. If byte value distribution is used to guess encoding,
there's no end to the craziness that can result. How do you know that
the byte values 0x41 0x42 0x43 0x44 are supposed to mean upper-case
ASCII letters and not a 32-bit integer or floating-point value, or
some accented lower-case letter A's in EBCDIC, or anything else? Maybe
if you have a whole document, AND you know for sure that it's
linguistic text, then maybe - MAYBE - you could guess with reasonable
reliability. But even then, how can you be sure? Remember, too, you
might have to deal with something that's actually mis-encoded. If
you're told this is UTF-8 and you find the byte sequence ED B3 BF, do
you decide that it can't possibly be UTF-8 and pick a different
encoding to decode with? That would produce no end of trouble, where
the actual result you want is (most likely) to throw an error.

ChrisA
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-09 Thread Nick Coghlan
On 10 Jan 2014 03:32, "Antoine Pitrou"  wrote:
>
> On Fri, 10 Jan 2014 05:26:04 +1000
> Nick Coghlan  wrote:
> >
> > We should probably include format_map for consistency with the str API.
>
> Yes, you're right.
>
> > >However, I
> > > also added bytearray into the mix, as bytearray objects should
> > > generally support the same operations as bytes (and they can be useful
> > > *especially* for network programming).
> >
> > So we'd define the *format* string as mutable to get a mutable result
out
> > of the formatting operations? This seems a little weird to me.
> >
> > It also seems weird for a format method on a mutable type to *not*
perform
> > in-place mutation.
>
> It's consistent with bytearray.join's behaviour:
>
> >>> x = bytearray()
> >>> x.join([b"abc"])
> bytearray(b'abc')
> >>> x
> bytearray(b'')

Yeah, I guess I'm OK with us being consistent on that one. It's still
weird, but also clearly useful :)

Will the new binary format ever call __format__? I assume not, but it's
probably best to make that absolutely explicit in the PEP.

Cheers,
Nick.

>
>
> Regards
>
> Antoine.
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Steven D'Aprano
On Thu, Jan 09, 2014 at 02:08:57PM -0800, Ethan Furman wrote:

> If latin1 is used to convert binary to text, how convoluted is it to then 
> take chunks of that text and convert to int, or some other variety of 
> unicode?
> 
> For example:  b'\x01\x00\xd1\x80\xd1\83\xd0\x80'
> 
> If that were decoded using latin1 how would I then get the first two bytes 
> to the integer 256 and the last six bytes to their Cyrillic meaning?  
> (Apologies for not testing myself, short on time.)

Not terribly convoluted, but there is some double-processing. When you 
know up-front that some data is non-text, you shouldn't convert it to 
text, otherwise you're just double-processing:

py> b = b'\x01\x00\xd1\x80\xd1\x83\xd0\x80'
py> s = b.decode('latin1')
py> num, = struct.unpack('>h', s[:2].encode('latin1'))
py> assert num == 0x100

Better to just go straight from bytes to the struct, if you can:

py> struct.unpack('>h', b[:2])
(256,)


As for the last six bytes and "their Cyrillic meaning", which Cyrillic 
meaning did you have in mind?

py> s = b'\x01\x00\xd1\x80\xd1\x83\xd0\x80'.decode('latin1')
py> for encoding in "cp1251 ibm866 iso-8859-5 koi8-r koi8-u 
mac_cyrillic".split():
... print(s[-6:].encode('latin1').decode(encoding))
...
СЂСѓРЂ
╤А╤Г╨А
бба
я─я┐п─
я─я┐п─
—А—Г–А


I understand that Cyrillic is an especially poor choice, since there 
are many incompatible Cyrillic code-pages. On the other hand, it's also 
an especially good example of how you need to know the encoding before 
you can make sense of the data.

Again, note that if you know the encoding you are intending to use is 
not Latin-1, decoding to Latin-1 first just ends up double-handling. If 
you can, it is best to split your data into fields up front, and then 
decode each piece once only.


-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Terry Reedy

On 1/9/2014 6:25 PM, Chris Barker wrote:


as so --  I want to replace a bit of ascii text surrounded by arbitrary
binary:
(apologies for the py2...)
In [24]: b
Out[24]: '\x01\x00\xd1\x80\xd1a name\xd0\x80'
In [25]: u = b.decode('latin-1')
In [26]: u2 = u.replace('a name', 'a different name')
In [28]: b2 = u2.encode('latin-1')
In [29]: b2
Out[29]: '\x01\x00\xd1\x80\xd1a different name\xd0\x80'


Just to check, with 3.4
print(b'\x01\x00\xd1\x80\xd1a name\xd0\x80'
  .decode('latin-1').
  replace('a name', 'a different name')
 .encode('latin-1')
  == b'\x01\x00\xd1\x80\xd1a different name\xd0\x80')
>>>
True

The b prefix works in 2.6/7, so this code does the same thing in 2.6+ 
and 3.x.


--
Terry Jan Reedy

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Steven D'Aprano
On Fri, Jan 10, 2014 at 12:22:02PM +1100, Chris Angelico wrote:
> On Fri, Jan 10, 2014 at 11:53 AM, anatoly techtonik  
> wrote:
> >   2. introduce autodetect mode to open functions
> >  1. read and transform on the fly, maintaining a buffer that
> > stores original bytes
> >  and their mapping to letters. The mapping is updated as bytes 
> > frequency
> >  changes. When the buffer is full, you have the best candidate.
> >
> 
> Bad idea. Bad, bad idea! No biscuit. Sit!
> 
> This sort of magic is what brings the "bush hid the facts" bug in
> Windows Notepad. If byte value distribution is used to guess encoding,
> there's no end to the craziness that can result.

I think that heuristics to guess the encoding have their role to play, 
if the caller understands the risks. For example, an application might 
give the user the choice of specifying the codec, or having the app 
guess it. (I dislike the term "Auto detect", since that implies a level 
of certainty which often doesn't apply to real files.)

There is already a third-party library, chardet, which does this. 
Perhaps the std lib should include this? Perhaps chardet should be 
considered best-of-breed "atomic reactor", but the std lib could include 
a "battery" to do something similar. I don't think we ought to dismiss 
this idea out of hand.


> How do you know that
> the byte values 0x41 0x42 0x43 0x44 are supposed to mean upper-case
> ASCII letters and not a 32-bit integer or floating-point value, 

Presumably if you're reading a file intended to be text, they'll be 
meant to be text and not arbitrary binary blobs. Given that it is 2014 
and not 1974, chances are reasonably good that bytes 0x41 0x42 0x43 0x44 
are meant as ASCII letters rather than EBCDIC. But you can't be 
certain, and even if "ASCII capital A" is the right way to bet with
byte 0x41, it's much harder to guess what 0xC9 is intended as:

py> for encoding in "macroman cp1256 latin1 koi8_r".split():
... print(b'\xC9'.decode(encoding))
...
…
ة
É
и


If you know the encoding via some out-of-band metadata, that's great. If 
you don't, or if the specified encoding is wrong, an application may not 
have the luxury of just throwing up its hands and refusing to process 
the data. Your web browser has to display something even if the web page 
lies about the encoding used or contains invalid data.

Even though encoding issues are more than 40 years old, making this 
problem older than most programmers, it's still new to many people. 
(Perhaps they haven't been paying attention, or living in denial that it 
would even happen to them, or they've just been lucky to be living in a 
pure ASCII world.) So a bit of sympathy to those struggling with this, 
but on the flip side, they need to HTFU and deal with it. Python 3 did 
not cause encoding issues, and in these days of code being interchanged 
all over the world, any programmer who doesn't have at least a basic 
understanding of this is like a programmer who doesn't understand why 
" cannot multiply correctly":

py> 0.7*7 == 4.9
False



-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Lennart Regebro
On Thu, Jan 9, 2014 at 10:06 AM, Kristján Valur Jónsson
 wrote:
> Do I speak Chinese to my grocer because china is a growing force in the 
> world?  Or start every discussion with my children with a negotiation on what 
> language to use?

No, because your environment have a default language. And Python has a
default encoding. You only get problems when some file doesn't use the
default encoding.

//Lennart
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Lennart Regebro
On Fri, Jan 10, 2014 at 2:03 AM, Joao S. O. Bueno  wrote:
> On 9 January 2014 04:50, Lennart Regebro  wrote:
>> To be honest, you can define text as "A stream of bytes that are split
>> up in lines separated by a linefeed", and do some basic text
>> processing like that. Just very *basic*, but still. Replacing
>> characters. Extracting certain lines etc.
>
> That is, until you hit a character which has a byte with the same
> value of ASCII newline in the middle of a multi-byte character.
>
> So, this approach is broken to start with.

For a very specific definition of broken, yes, namely that it will
fail with UTF-16 or EBCDIC. Files that with the above definition of
"text files" are not text files. :-)

//Lennart
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Chris Angelico
On Fri, Jan 10, 2014 at 1:39 PM, Steven D'Aprano  wrote:
> On Fri, Jan 10, 2014 at 12:22:02PM +1100, Chris Angelico wrote:
>> On Fri, Jan 10, 2014 at 11:53 AM, anatoly techtonik  
>> wrote:
>> >   2. introduce autodetect mode to open functions
>> >  1. read and transform on the fly, maintaining a buffer that
>> > stores original bytes
>> >  and their mapping to letters. The mapping is updated as bytes 
>> > frequency
>> >  changes. When the buffer is full, you have the best candidate.
>> >
>>
>> Bad idea. Bad, bad idea! No biscuit. Sit!
>>
>> This sort of magic is what brings the "bush hid the facts" bug in
>> Windows Notepad. If byte value distribution is used to guess encoding,
>> there's no end to the craziness that can result.
>
> I think that heuristics to guess the encoding have their role to play,
> if the caller understands the risks. For example, an application might
> give the user the choice of specifying the codec, or having the app
> guess it. (I dislike the term "Auto detect", since that implies a level
> of certainty which often doesn't apply to real files.)
>
> There is already a third-party library, chardet, which does this.
> Perhaps the std lib should include this? Perhaps chardet should be
> considered best-of-breed "atomic reactor", but the std lib could include
> a "battery" to do something similar. I don't think we ought to dismiss
> this idea out of hand.

I don't deny that chardet has its place, but would you use it like
this (I'm assuming it works with Py3, the docs seem to imply Py2):

text = ""
with open("blah", "rb") as f:
while True:
data = f.read(256)
if not data: break
text += data.decode(chardet.detect(data)['encoding'])

Certainly not. But that's how the file-open-mode of "auto detect"
sounds. At very least, it has to do something like this _until_ it has
confidence; maybe it can retain the chardet state after the first
read, but it's still going to have to decode as little as you first
read. How can it handle this case?

first_char = open("blah", encoding="auto").read(1)

Somehow it needs to know how many bytes to read (and not read too many
more, preferably - buffering a line-ish is reasonable, buffering a
megabyte not so much) and figure out what's one character.

I see this as similar to the Python 2 input() function. It's not the
file-open builtin's job to do something advanced and foot-shooting as
automatic charset detection. If you want that, you should be prepared
for its failures and the messes of partial reads, and call on chardet
yourself, same as you should use eval(input()) explicitly in Py3 (and,
in my opinion, eval(raw_input()) equally explicitly in Py2). I'm not
saying that chardet is bad, but I *am* saying, and I stand by this,
that an auto-detect option on file open is a bad idea.

Unix comes with a 'file' command which will tell you even more about
what something is. (For what it thinks are text files, I believe it
uses heuristics similar to chardet to guess an encoding.) Would you
want a parameter to the open() builtin that tries to read the file as
an image, or an audio file, or a document, or an executable, and
automatically decodes it to a PIL.Image, an mm.wave, etc, or execute
the code and return its stdout, all entirely automatically? I don't
think so. Not open()'s job.

ChrisA
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Ben Finney
Steven D'Aprano  writes:

> I think that heuristics to guess the encoding have their role to play,
> if the caller understands the risks.

I think, for a language whose developers espouse a principle “In the
face of ambiguity, refuse the temptation to guess”, heuristics have no
role to play in the standard library.

> There is already a third-party library, chardet, which does this.

As a third-party library, it's fine and quite useful.

> Perhaps the std lib should include this?

In my opinion, content-type guessing heuristics certainly don't belong
in the standard library.

-- 
 \  “Nothing is more sacred than the facts.” —Sam Harris, _The End |
  `\   of Faith_, 2004 |
_o__)  |
Ben Finney

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Stephen J. Turnbull
INADA Naoki writes:

 > latin1 is OK but is it Pythonic?

Yes.  EIBTI, including being explicit that you're doing something that
has semantics that you are ignoring but may come back to bite you or
somebody who naively uses your module.

There's nothing un-Pythonic about using potentially dangerous idioms.
We assume that you know what you are doing and either have taken
measures to trap exceptional cases or are willing to accept the risk
of an unhandled exception.

 > I've posted suggestion about add 'bytes' as a alias for 'latin1'.

Unpythonic.  Such alternative names hide the fact that there are
semantics that you may not want.  Only the programmer can know whether
it's safe.  If you want an ascii-compatible and space-efficient
representation that is safe even if the bytestream is something you
don't expect, you need to do something like I proposed.  If you don't
need efficiency, (encoding='ascii', errors='surrogateescape') is the
way to go.  But these still don't provide convenient interpolation of
binary data, as we discovered earlier.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-09 Thread Stephen J. Turnbull
Chris Angelico writes:

 > I'm not saying that chardet is bad, but I *am* saying, and I stand
 > by this, that an auto-detect option on file open is a bad idea.

I have used it by default in Emacs and XEmacs since 1990, and I
certainly haven't experienced it as a bad idea at *any* time in more
than two decades.  Of course, it shouldn't be default in Python for
two reasons: (1) Emacsen are invariably interactive so very flexible
with error recovery, not so for Python, and (2) Emacsen can generally
assume that the files they open are more or less text in the first
place, which again is not true for Python.

 > Would you want a parameter to the open() builtin

It's not a parameter, it's a particular value for the encoding
parameter.

 > that tries to read the file as an image, or an audio file, or a
 > document, or an executable, and automatically decodes it to a
 > PIL.Image, an mm.wave, etc,

Emacsen do that, too.  It's not the sayonara Grand Slam in the 7th
game of the World Series spectacular win that text encoding detection
is, but it is very useful much of the time.

What it comes down to for all of the above is "consenting adults."
Python should *not* do any guessing by default, but if the programmer
or user explicitly request a guess with "encoding=chardet", why in the
world would you want Python to do anything but give it the old college
try?  Of course any Python-supplied guesser should take a very
pessimistic approach and error unless it's quite certain, but

 > or execute the code and return its stdout, all entirely
 > automatically?

Now *that* is a really bad idea.  You shouldn't mix it with the
others.  (I'll also concede that many file formats -- Postscript, I'm
looking at you -- require special care to avoid arbitrary code
execution.)

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com