Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Nick Coghlan
On 11 January 2014 08:58, Ethan Furman  wrote:
> On 01/10/2014 02:42 PM, Antoine Pitrou wrote:
>>
>> On Fri, 10 Jan 2014 17:33:57 -0500
>> "Eric V. Smith"  wrote:
>>>
>>> On 1/10/2014 5:29 PM, Antoine Pitrou wrote:

 On Fri, 10 Jan 2014 12:56:19 -0500
 "Eric V. Smith"  wrote:
>
>
> I agree. I don't see any reason to exclude int and float. See Guido's
> messages http://bugs.python.org/issue3982#msg180423 and
> http://bugs.python.org/issue3982#msg180430 for some justification and
> discussion.


 If you are representing int and float, you're really formatting a text
 message, not bytes. Basically if you allow the formatting of int and
 float instances, there's no reason not to allow the formatting of
 arbitrary objects through __str__. It doesn't make sense to
 special-case those two types and nothing else.
>>>
>>>
>>> It might not for .format(), but I'm not convinced. But for %-formatting,
>>> str is already special-cased for these types.
>>
>>
>> That's not what I'm saying. str.__mod__ is able to represent all kinds
>> of types through %s and calling __str__. It doesn't make sense for
>> bytes.__mod__ to only support int and float. Why only them?
>
>
> Because embedding the ASCII equivalent of ints and floats in byte streams is
> a common operation?

It's emphatically *NOT* a binary interpolation operation though - the
binary representation of the integer 1 is the byte value 1, not the
byte value 49. If you want the byte value 49 to appear in the stream,
then you need to interpolate the *ASCII encoding* of the string "1",
not the integer 1.

If you want to manipulate text representations, do it in the text
domain. If you want to manipulate binary representations, do it in the
binary domain. The *whole point* of the text model change in Python 3
is to force programmers to *decide* which domain they're operating in
at any given point in time - while the approach of blurring the
boundaries between the two can be convenient for wire protocol and
file format manipulation, it is a horrendous bug magnet everywhere
else.

PEP 360 is just about adding back some missing functionality in the
binary domain (interpolating binary sequences together), not about
bringing back the problematic text model that allows particular text
representations to be interpreted as if they were also binary data.

That said, I actually think there's a valid use case for a Python 3
type that allows the bytes/text boundary to be blurred in making it
easier to port certain kinds of Python 2 code to Python 3
(specifically, working with wire protocols and file formats that
contain a mixture of encodings, but all encodings are *known* to at
least be ASCII compatible). It is highly unlikely that such a type
will *ever* be part of the standard library, though - idiomatic Python
3 code shouldn't need it, affected Python 2 code *can* be ported
without it (but may look more complicated due to the use of explicit
decoding and encoding operations, rather than relying on implicit
ones), and it should be entirely possible to implement it as an
extension module (modulo one bug in CPython that may impact the
approach, but we won't know for sure until people actually try it
out).

Fortunately, after years of my suggesting the idea to almost everyone
that complained about the move away from the broken POSIX text model
in Python 3, Benno Rice has started experimenting with such a type
based on a preliminary test case I wrote at linux.conf.au last week:
https://github.com/jeamland/asciicompat/blob/master/tests/ncoghlan.py

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Nick Coghlan
On 11 January 2014 12:28, Ethan Furman  wrote:
> On 01/10/2014 06:04 PM, Antoine Pitrou wrote:
>>
>> On Fri, 10 Jan 2014 20:53:09 -0500
>> "Eric V. Smith"  wrote:
>>>
>>>
>>> So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
>>> 3892. See for example http://bugs.python.org/issue3982#msg180432 .
>>
>>
>> Then we might as well not do anything, since any attempt to advance
>> things is met by stubborn opposition in the name of "not far enough".
>
>
> Heh, and here I thought it was stubborn opposition in the name of purity.
> ;)

No, it's "the POSIX text model is completely broken and we're not
letting people bring it back by stealth because they want to stuff
their esoteric use case back into the builtin data types instead of
writing their own dedicated type now that the builtin types don't
handle it any more".

Yes, we know we changed the text model and knocked wire protocols off
their favoured perch, and we're (thoroughly) aware of the fact that
wire protocol developers don't like the fact that the default model
now strongly favours the vastly more common case of application
development.

However, until Benno volunteered to start experimenting with
implementing an asciistr type yesterday, there have been *zero*
meaningful attempts at trying to solve the issues with wire protocol
manipulation outside the Python 3 core - instead there has just been a
litany of whining that Python 3 is different from Python 2, and a
complete and total refusal to attempt to understand *why* we changed
the text model.

The answer *should* be obvious: the POSIX based text model in Python 2
makes web frameworks easier to write at the expense of making web
applications *harder* to write, and the same is true for every other
domain where the wire protocol and file format handling is isolated to
widely used frameworks and support libraries, with the application
code itself operating mostly on text and structured data. With the
Python 3 text model, we decided that was a terrible trade-off, so the
core text model now *strongly* favours application code.

This means that is now *boundary* code that may need additional helper
types, because the core types aren't necessarily going to cover all
those use cases any more. In particular, the bytes type is, and always
will be, designed for pure binary manipulation, while the str type is
designed for text manipulation. The weird kinda-text-kinda-binary
8-bit builtin type is gone, and *deliberately* so.

I've been saying for years that people should experiment with creating
a Python 3 extension type that
behaves more like the Python 2 str type. For the standard library,
we've never hit a case where the explicit encoding and decoding was so
complicated that creating such a type seemed simpler, so *we're* not
going to do it. After discussing it with me at LCA, Benno Rice offered
to try out the idea, just to determine whether or not it was actually
possible. If there are any CPython bugs that mean the idea *doesn't*
currently work (such as interoperability issues in the core types),
then I'm certainly happy for us to fix *those*. But we're never ever
going to change the core text model back to the broken POSIX one, or
even take steps in that direction.

Regards,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Stephen Hansen
For not caring much, your own stubbornness is quite notable throughout this
discussion. Stones and glass houses. :)

That said:

Twisted and Mercurial aren't the only ones who are hurt by this, at all.
I'm aware of at least two other projects who are actively hindered in their
support or migration to Python 3 by the bytes type not having some basic
functionality that "strings" had in 2.0.

The purity crowd in here has brought up that it was an important and
serious decision to split Text from Bytes in Py3, and I actually agree with
that. However, it is missing some very real and very concrete use-cases --
there are multiple situations where there are byte streams which have a
known text-subset which they really, really do need to operate on.

There's been a number of examples given: PDF, HTTP, network streams that
switch inline from text-ish to binary and back-again.. But, we can focus
that down to a very narrow and not at all uncommon situation in the latter.

Look at the HTTP Content-Length header. HTTP headers are fuzzy. My
understanding is, per the RFCs, their body can be arbitrary octets to the
exclusion of line feeds and DELs-- my understanding may be a bit off here,
and please feel free to correct me -- but the relevant specifications are a
bit fuzzy to begin with.

To my understanding of the spec, the header field name is essentially an
ASCII text field (sans separator), and the body is... anything, or nearly
anything. This is HTTP, which is surely one of the most used protocols in
the world.

The need to be able to assemble and disassemble such streams of that is a
real, valid use-case.

But looking at it, now look to the Content-Length header I mentioned. It
seems those who are declaring a purity priority in bytes/string separation
think it reasonable to do things like:

  headers.append((b"Content-Length": ("%d" %
(len(content))).encode("ascii")))

Or something. In the middle of processing a stream, you need to convert
this number into a string then encode it into bytes to just represent the
number as the extremely common, widely-accessible 7-bit ascii subset of its
numerical value. This isn't some rare, grandiose or fiendish undertaking,
or trying to merge Strings and Bytes back together: this is the simple
practical recognition that representing a number as its ascii-numerical
value is actually not at all uncommon.

This position seems utterly astonishing in its ridiculousness to me. The
recognition that the number "123" may be represented as b"123" surprises me
as a controversial thing, considering how often I see it in real life.

There is a LOT of code out there which needs a little bit of a middle
ground between bytes and strings; it doesn't mean you are giving way and
allowing strings and bytes to merge and giving up on the Edict of
Separation. But there are real world use-cases where you simply need to be
able to do many basic "String" like operations on byte-streams.

The removal of the ability to use interpolation to construct such byte
strings was a major regression in python 3 and is a big hurdle for more
then a few projects to upgrade.

I mean, its not like the "bytes" type lacks knowledge of the subset of
bytes that happen to be 7-bit ascii-compatible and can't perform text-ish
operations on them--

  Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32
bit (Intel)] on win32
  Type "help", "copyright", "credits" or "license" for more information.
  >>> b"stephen hansen".title()
  b'Stephen Hansen'

How is this not a practical recognition that yes, while bytes are byte
streams and not text, a huge subset of bytes are text-y, and as long as we
maintain the barrier between higher characters and implicit conversion
therein, we're fine?

I don't see the difference here. There is a very real, practical need to
interpolate bytes. This very real, practical need includes the very real
recognition that converting 12345 to b'12345' is not something weird,
unusual, and subject to the thorny issues of Encodings. It is not violating
the doctrine of separation of powers between Text and Bytes.

Personally, I won't be converting my day job's codebase to Python 3 anytime
soon (where 'soon' is defined as 'within five years, assuming a best-case
scenario that a number of third-party issues are resolved. But! I'm aware
and involved with other projects, and this has bit two of them
specifically. I'm sure there are others who are not aware of this list or
don't feel comfortable talking on it (as it is, I encouraged one of the
project's coder to speak up, but they thought the question was a lost one
due to  previous responses on the original issue ticket and gave up.).

On Fri, Jan 10, 2014 at 6:04 PM, Antoine Pitrou  wrote:

> On Fri, 10 Jan 2014 20:53:09 -0500
> "Eric V. Smith"  wrote:
> >
> > So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
> > 3892. See for example http://bugs.python.org/issue3982#msg180432 .
>
> Then we might as well not do anything, since an

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Glenn Linderman

On 1/11/2014 1:44 AM, Stephen Hansen wrote:
There's been a number of examples given: PDF, HTTP, network streams 
that switch inline from text-ish to binary and back-again.. But, we 
can focus that down to a very narrow and not at all uncommon situation 
in the latter.


PDF has been mentioned a few times.  ReportLAB recently decided to 
convert to Python 3, and fairly quickly (from my perspective, it took 
them a _long_ time to decide to port, but once they decided to, then it 
seemed quick) produced an alpha version that passes many of their tests. 
I've not tried it yet, although it interests me, as I have some Python 2 
code written only because ReportLAB didn't support Python 3, and I 
wanted to generate some PDF files. I'll be glad to get rid of the Python 
2 code, once they are released.


But I guess they figured out a solution that wasn't onerous, I'd have to 
go re-read the threads to be sure, but it seems they are running one 
code base for both... not sure of the details of what techniques they 
used, or if they ever used the % operator :)


But I'm wondering, since they did what they did so quickly, if the 
"mixed bytes and str" use case is mostly, in fact, a mind-set issue... 
yes, likely some code has to change, but maybe the changes really aren't 
all that significant.


I wouldn't want to drag them into this discussion, I'd rather they get 
the port complete, but it would be interesting to know what they did, 
and how they did it, and what problems they had, etc. If anyone here 
knows that code a bit, perhaps the diffs could be examined in their 
repository to figure out what they did, and how much it impacted their 
code. I do know they switched XML parsers along the way, as well as 
dealing with string handling differences.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Kristján Valur Jónsson
I don't know what the fuss is about.  This isn't about breaking the text model.
It's about a convenient way to turn text into bytes using a default, lenient, 
way.  Not the other way round.
Here's my proposal

b'foo%sbar' % (a)

would implicitly apply the following function equivalent to every object in the 
tuple:
def coerce_ascii(o):
if has_bytes_interface(o):  return o
return o.encode('ascii', 'strict')

There's no need for special %d or %f formatting.  If more fanciful formatting 
is required, e.g. exponents or, or precision, then by all means, to it in the 
str domain:

b'foo%sbar' %("%.15f"%(42.2, ))

Basically, let's just support simple bytes interpolation that will support 
coercing into bytes by means of strict ascii.
It's a one way convenience, explicitly requested, and for conselting adults.


-Original Message-
From: Python-Dev [mailto:[email protected]] 
On Behalf Of Nick Coghlan
Sent: 11. janúar 2014 08:43
To: Ethan Furman
Cc: [email protected]
Subject: Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) 
to Python 3.5

No, it's "the POSIX text model is completely broken and we're not letting 
people bring it back by stealth because they want to stuff their esoteric use 
case back into the builtin data types instead of writing their own dedicated 
type now that the builtin types don't handle it any more".


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Juraj Sukop
On Sat, Jan 11, 2014 at 5:14 AM, Cameron Simpson  wrote:

>
> Hi Juraj,
>

Hello Cameron.


>   data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) )
>

Thanks for the suggestion! The problem with "bytify" is that some items
might require different formatting than other items. For example, in
"Cross-Reference Table" there are three different formats: non-padded
integer ("1"), 10- and 15digit integer, ("03", "65535").
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Juraj Sukop
On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano wrote:

>
> I'm sorry, I don't understand what you mean here. I'm honestly not
> trying to be difficult, but you sound confident that you understand what
> you are doing, but your description doesn't make sense to me. To me, it
> looks like you are conflating bytes and ASCII characters, that is,
> assuming that characters "are" in some sense identical to their ASCII
> representation. Let me explain:
>
> The integer that in English is written as 100 is represented in memory
> as bytes 0x0064 (assuming a big-endian C short), so when you say "an
> integer is written down AS-IS" (emphasis added), to me that says that
> the PDF file includes the bytes 0x0064. But then you go on to write the
> three character string "100", which (assuming ASCII) is the bytes
> 0x313030. Going from the C short to the ASCII representation 0x313030 is
> nothing like inserting the int "as-is". To put it another way, the
> Python 2 '%d' format code does not just copy bytes.
>

Sorry, I should've included an example: when I said "as-is" I meant "1",
"0", "0" so that would be yours "0x313030."


> If you consider PDF as binary with occasional pieces of ASCII text, then
> working with bytes makes sense. But I wonder whether it might be better
> to consider PDF as mostly text with some binary bytes. Even though the
> bulk of the PDF will be binary, the interesting bits are text. E.g. your
> example:
>
> Even though the binary image data is probably much, much larger in
> length than the text shown above, it's (probably) trivial to deal with:
> convert your image data into bytes, decode those bytes into Latin-1,
> then concatenate the Latin-1 string into the text above.
>

This is similar to what Chris Barker suggested. I also don't try to be
difficult here but please explain to me one thing. To treat bytes as if
they were Latin-1 is bad idea, that's why "%f" got dropped in the first
place, right? How is it then alright to put an image inside an Unicode
string?

Also, apart from the in/out conversions, do any other difficulties come to
your mind?

Please also take note that in Python 3.3 and better, the internal
> representation of Unicode strings containing only code points up to 255
> (i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte
> per character.
>

I guess you meant [C]Python...

In any case, thanks for the detailed reply.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Important background for PEP 460: Py 2/3 text model differences

2014-01-11 Thread Nick Coghlan
The PEP 460 discussion threads made it clear that some of the
participants that weren't around for the earlier parts of the Python 3
transition were struggling with the fundamental conceptual differences
between the Python 2 and Python 3 text models.

Since other folks (including Armin Ronacher) have also struggled with
that distinction, I added a new question and answer to my Python 3
Q&A: 
http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Georg Brandl
Am 11.01.2014 09:43, schrieb Nick Coghlan:
> On 11 January 2014 12:28, Ethan Furman  wrote:
>> On 01/10/2014 06:04 PM, Antoine Pitrou wrote:
>>>
>>> On Fri, 10 Jan 2014 20:53:09 -0500
>>> "Eric V. Smith"  wrote:


 So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
 3892. See for example http://bugs.python.org/issue3982#msg180432 .
>>>
>>>
>>> Then we might as well not do anything, since any attempt to advance
>>> things is met by stubborn opposition in the name of "not far enough".
>>
>>
>> Heh, and here I thought it was stubborn opposition in the name of purity.
>> ;)
> 
> No, it's "the POSIX text model is completely broken and we're not
> letting people bring it back by stealth because they want to stuff
> their esoteric use case back into the builtin data types instead of
> writing their own dedicated type now that the builtin types don't
> handle it any more".
> 
> Yes, we know we changed the text model and knocked wire protocols off
> their favoured perch, and we're (thoroughly) aware of the fact that
> wire protocol developers don't like the fact that the default model
> now strongly favours the vastly more common case of application
> development.
> 
> However, until Benno volunteered to start experimenting with
> implementing an asciistr type yesterday, there have been *zero*
> meaningful attempts at trying to solve the issues with wire protocol
> manipulation outside the Python 3 core

Can we please also include pseudo-binary file formats?  It's not "just"
wire protocols.

Georg

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Georg Brandl
Am 11.01.2014 10:44, schrieb Stephen Hansen:

> I mean, its not like the "bytes" type lacks knowledge of the subset of bytes
> that happen to be 7-bit ascii-compatible and can't perform text-ish operations
> on them--
> 
>   Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 bit
> (Intel)] on win32
>   Type "help", "copyright", "credits" or "license" for more information.
>   >>> b"stephen hansen".title()
>   b'Stephen Hansen'
> 
> How is this not a practical recognition that yes, while bytes are byte streams
> and not text, a huge subset of bytes are text-y, and as long as we maintain 
> the
> barrier between higher characters and implicit conversion therein, we're fine?
> 
> I don't see the difference here. There is a very real, practical need to
> interpolate bytes. This very real, practical need includes the very real
> recognition that converting 12345 to b'12345' is not something weird, unusual,
> and subject to the thorny issues of Encodings. It is not violating the 
> doctrine
> of separation of powers between Text and Bytes.

This. Exactly. Thanks for putting it so nicely, Stephen.

Georg

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Georg Brandl
Am 11.01.2014 14:49, schrieb Georg Brandl:
> Am 11.01.2014 10:44, schrieb Stephen Hansen:
> 
>> I mean, its not like the "bytes" type lacks knowledge of the subset of bytes
>> that happen to be 7-bit ascii-compatible and can't perform text-ish 
>> operations
>> on them--
>> 
>>   Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 
>> bit
>> (Intel)] on win32
>>   Type "help", "copyright", "credits" or "license" for more information.
>>   >>> b"stephen hansen".title()
>>   b'Stephen Hansen'
>> 
>> How is this not a practical recognition that yes, while bytes are byte 
>> streams
>> and not text, a huge subset of bytes are text-y, and as long as we maintain 
>> the
>> barrier between higher characters and implicit conversion therein, we're 
>> fine?
>> 
>> I don't see the difference here. There is a very real, practical need to
>> interpolate bytes. This very real, practical need includes the very real
>> recognition that converting 12345 to b'12345' is not something weird, 
>> unusual,
>> and subject to the thorny issues of Encodings. It is not violating the 
>> doctrine
>> of separation of powers between Text and Bytes.
> 
> This. Exactly. Thanks for putting it so nicely, Stephen.

To elaborate: if the bytes type didn't have all this ASCII-aware functionality
already, I think we would have (and be using) a dedicated "asciistr" type right
now.  But it has the functionality, and it's way too late to remove it.

Georg


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread M.-A. Lemburg
On 11.01.2014 14:54, Georg Brandl wrote:
> Am 11.01.2014 14:49, schrieb Georg Brandl:
>> Am 11.01.2014 10:44, schrieb Stephen Hansen:
>>
>>> I mean, its not like the "bytes" type lacks knowledge of the subset of bytes
>>> that happen to be 7-bit ascii-compatible and can't perform text-ish 
>>> operations
>>> on them--
>>>
>>>   Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 
>>> bit
>>> (Intel)] on win32
>>>   Type "help", "copyright", "credits" or "license" for more information.
>>>   >>> b"stephen hansen".title()
>>>   b'Stephen Hansen'
>>>
>>> How is this not a practical recognition that yes, while bytes are byte 
>>> streams
>>> and not text, a huge subset of bytes are text-y, and as long as we maintain 
>>> the
>>> barrier between higher characters and implicit conversion therein, we're 
>>> fine?
>>>
>>> I don't see the difference here. There is a very real, practical need to
>>> interpolate bytes. This very real, practical need includes the very real
>>> recognition that converting 12345 to b'12345' is not something weird, 
>>> unusual,
>>> and subject to the thorny issues of Encodings. It is not violating the 
>>> doctrine
>>> of separation of powers between Text and Bytes.
>>
>> This. Exactly. Thanks for putting it so nicely, Stephen.
> 
> To elaborate: if the bytes type didn't have all this ASCII-aware functionality
> already, I think we would have (and be using) a dedicated "asciistr" type 
> right
> now.  But it has the functionality, and it's way too late to remove it.

I think we need to step back a little from the purist view
of things and give more emphasis on the "practicality beats
purity" Zen.

I complete agree with Stephen, that bytes are in fact often
an encoding of text. If that text is ASCII compatible, I don't
see any reason why we should not continue to expose the C lib
standard string APIs available for text manipulations on bytes.

We don't have to be pedantic about the bytes/text separation.
It doesn't help in real life.

If you give programmers the choice they will - most of the time -
do the right thing. If you don't give them the tools, they'll
work around the missing features in a gazillion different
ways of which many will probably miss a few edge cases.

bytes already have most of the 8-bit string methods from Python 2,
so it doesn't hurt adding some more of the missing features
from Python 2 on top to make life easier for people dealing
with multiple/unknown encoding data.

BTW: I don't know why so many people keep asking for use cases.
Isn't it obvious that text data without known (but ASCII compatible)
encoding or multiple different encodings in a single data chunk
is part of life ? Most HTTP packets fall into this category,
many email messages as well. And let's not forget that we don't
live in a perfect world. Broken encodings are everywhere around
you - just have a look at your spam folder for a decent chunk
of example data :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 11 2014)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Antoine Pitrou
On Sat, 11 Jan 2014 08:26:57 +0100
Georg Brandl  wrote:
> Am 11.01.2014 03:04, schrieb Antoine Pitrou:
> > On Fri, 10 Jan 2014 20:53:09 -0500
> > "Eric V. Smith"  wrote:
> >> 
> >> So, I'm -1 on the PEP. It doesn't address the cases laid out in issue
> >> 3892. See for example http://bugs.python.org/issue3982#msg180432 .
> 
> I agree.
> 
> > Then we might as well not do anything, since any attempt to advance
> > things is met by stubborn opposition in the name of "not far enough".
> > 
> > (I don't care much personally, I think the issue is quite overblown
> > anyway)
> 
> So you wouldn't mind another overhaul of the PEP including a bit more
> functionality again? :)
>  I really think that practicality beats purity
> here.  (I'm not advocating free mixing bytes and str, mind you!)

The PEP already proposes a certain amount of practicality. I personally
*would* mind adding %d and friends to it. But of course someone can
fork the PEP or write another one.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Nick Coghlan
On 12 January 2014 01:15, M.-A. Lemburg  wrote:
> On 11.01.2014 14:54, Georg Brandl wrote:
>> Am 11.01.2014 14:49, schrieb Georg Brandl:
>>> Am 11.01.2014 10:44, schrieb Stephen Hansen:
>>>
 I mean, its not like the "bytes" type lacks knowledge of the subset of 
 bytes
 that happen to be 7-bit ascii-compatible and can't perform text-ish 
 operations
 on them--

   Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 32 
 bit
 (Intel)] on win32
   Type "help", "copyright", "credits" or "license" for more information.
   >>> b"stephen hansen".title()
   b'Stephen Hansen'

 How is this not a practical recognition that yes, while bytes are byte 
 streams
 and not text, a huge subset of bytes are text-y, and as long as we 
 maintain the
 barrier between higher characters and implicit conversion therein, we're 
 fine?

 I don't see the difference here. There is a very real, practical need to
 interpolate bytes. This very real, practical need includes the very real
 recognition that converting 12345 to b'12345' is not something weird, 
 unusual,
 and subject to the thorny issues of Encodings. It is not violating the 
 doctrine
 of separation of powers between Text and Bytes.
>>>
>>> This. Exactly. Thanks for putting it so nicely, Stephen.
>>
>> To elaborate: if the bytes type didn't have all this ASCII-aware 
>> functionality
>> already, I think we would have (and be using) a dedicated "asciistr" type 
>> right
>> now.  But it has the functionality, and it's way too late to remove it.
>
> I think we need to step back a little from the purist view
> of things and give more emphasis on the "practicality beats
> purity" Zen.
>
> I complete agree with Stephen, that bytes are in fact often
> an encoding of text. If that text is ASCII compatible, I don't
> see any reason why we should not continue to expose the C lib
> standard string APIs available for text manipulations on bytes.
>
> We don't have to be pedantic about the bytes/text separation.
> It doesn't help in real life.

Yes, it bloody well does. The number of people who have told me that
using Python 3 is what allowed them to finally understand how Unicode
works vastly exceeds the number of wire protocol and file format devs
that have complained about working with binary formats being
significantly less tolerant of the "it's really like ASCII text"
mindset.

We are NOT going back to the confusing incoherent mess that is the
Python 2 model of bolting Unicode onto the side of POSIX:
http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3

While that was an *expedient* (and, in fact, necessary) solution at
the time, the fact it is still thoroughly confusing people 13 years
later shows it is not a *comprehensible* solution.

> If you give programmers the choice they will - most of the time -
> do the right thing. If you don't give them the tools, they'll
> work around the missing features in a gazillion different
> ways of which many will probably miss a few edge cases.
>
> bytes already have most of the 8-bit string methods from Python 2,
> so it doesn't hurt adding some more of the missing features
> from Python 2 on top to make life easier for people dealing
> with multiple/unknown encoding data.

Because people that aren't happy with the current bytes type
persistently refuse to experiment with writing their own extension
type to figure out what the API should look like. Jamming speculative
API design into the core text model without experimenting in a third
party extension first is a straight up stupid idea.

Anyone that is pushing for this should be checking out Benno's first
draft experimental prototype for asciistr and be working on getting it
passing the test suite I created:
https://github.com/jeamland/asciicompat

The "Wah, you broke it and now I have completely forgotten how to
create custom types, so I'm just going to piss and moan until somebody
else fixes it" infantilism of the past five years in this regard has
frankly pissed me off.

Regards,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 01:56:56PM +0100, Juraj Sukop wrote:
> On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano wrote:

> > If you consider PDF as binary with occasional pieces of ASCII text, then
> > working with bytes makes sense. But I wonder whether it might be better
> > to consider PDF as mostly text with some binary bytes. Even though the
> > bulk of the PDF will be binary, the interesting bits are text. E.g. your
> > example:

10 0 obj
  << /Type /XObject
 /Width 100
 /Height 100
 /Alternates 15 0 R
 /Length 2167
  >>
stream
...binary image data...
endstream
endobj


> > Even though the binary image data is probably much, much larger in
> > length than the text shown above, it's (probably) trivial to deal with:
> > convert your image data into bytes, decode those bytes into Latin-1,
> > then concatenate the Latin-1 string into the text above.
> 
> This is similar to what Chris Barker suggested. I also don't try to be
> difficult here but please explain to me one thing. To treat bytes as if
> they were Latin-1 is bad idea, 

Correct. Bytes are not Latin-1. Here are some bytes which represent a 
word I extracted from a text file on my computer: 

b'\x8a\x75\xa7\x65\x72\x73\x74'

If you imagine that they are Latin-1, you might think that the word 
is a C1 control character ("VTS", or Vertical Tabulation Set) followed 
by "u§erst", but it is not. It is actually the German word "äußerst" 
("extremely"), and the text file was generated on a 1990s vintage 
Macintosh using the MacRoman "extended ASCII" code page.


> that's why "%f" got dropped in the first
> place, right? How is it then alright to put an image inside an Unicode
> string?

The point that I am making is that many people want to add formatting 
operations to bytes so they can put ASCII strings inside bytes. But (as 
far as I can tell) they don't need to do this, because they can treat 
Unicode strings containing code points U+ through U+00FF (i.e. the 
same range as handled by Latin-1) as if they were bytes. This gives you:

- convenient syntax, no need to prefix strings with b;

- mostly avoid needing to decode and encode strings, except at a 
  few points in your code;

- the full set of string methods;

- can easily include arbitrary octal or hex byte values, using \o and
  \x escapes;

- error checking: when you finally encode the text to bytes before 
  writing to a file, or sending over a wire, any code-point greater 
  than U+00FF will give you an exception unless explicitly silenced.

No need to wait for Python 3.5 to come out, you can do this *right now*.

Of course, this is a little bit "unclean", it breaks the separation of 
text and bytes by treating bytes *as if* they were Unicode code points, 
which they are not, but I believe that this is a practical technique 
which is not too hard to deal with. For instance, suppose I have a 
mixed format which consists of an ASCII tag, a number written in ASCII, 
a NULL separator, and some binary data:

# Using bytes
values = [29460, 29145, 31098, 27123]
blob = b"".join(struct.pack(">h", n) for n in values)
data = b"Tag:" + str(len(values)).encode('ascii') + b"\0" + blob

=> gives data = b'Tag:4\x00s\x14q\xd9yzi\xf3'


That's a bit ugly, but not too ugly. I could write code like that. But 
if bytes had % formatting, I might write this instead:

data = b"Tag:%d\0%s" % (len(values), blob)


This is a small improvement, but I can't use it until Python 3.5 comes 
out. Or I could do this right now:


# Using text
values = [29460, 29145, 31098, 27123]
blob = b"".join(struct.pack(">h", n) for n in values)
data = "Tag:%d\0%s" % (len(values), blob.decode('latin-1'))

=> gives data = 'Tag:4\x00s\x14qÙyzió'

When I'm ready to transmit this over the wire, or write to disk, then I 
encode, and get:

data.encode('latin-1')
=> b'Tag:4\x00s\x14q\xd9yzi\xf3'


which is exactly the same as I got in the first place. In this case, I'm 
not using Latin-1 for the semantics of bytes to characters (e.g. byte 
\xf3 = char ó), but for the useful property that all 256 distinct bytes 
are valid in Latin-1. Any other encoding with the same property will do.

It is a little unfortunate that struct gives bytes rather than a str, 
but you can hide that with a simple helper function:

def b2s(bytes):
return bytes.decode('latin1')

data = "Tag:%d\0%s" % (len(values), b2s(blob))



> Also, apart from the in/out conversions, do any other difficulties come to
> your mind?

No. If you accidentally introduce a non-Latin1 code point, when you 
decode you'll get an exception. 


-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Antoine Pitrou
On Sun, 12 Jan 2014 01:34:26 +1000
Nick Coghlan  wrote:
> 
> Yes, it bloody well does. The number of people who have told me that
> using Python 3 is what allowed them to finally understand how Unicode
> works vastly exceeds the number of wire protocol and file format devs
> that have complained about working with binary formats being
> significantly less tolerant of the "it's really like ASCII text"
> mindset.

+1 to what Nick says. Forcing some constructs to be explicit leads
people to know about the issue and understand it, rather than sweep it
under the carpet as Python 2 encouraged them to do.

Yes, if you're dealing with a file format or network protocol, you'd
better know in which charset its textual information is being expressed.
It's a very sane question to ask yourself!

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 07:38 AM, Steven D'Aprano wrote:


The point that I am making is that many people want to add formatting
operations to bytes so they can put ASCII strings inside bytes. But (as
far as I can tell) they don't need to do this, because they can treat
Unicode strings containing code points U+ through U+00FF (i.e. the
same range as handled by Latin-1) as if they were bytes.


So instead of blurring the line between bytes and text, you're blurring the line between text and bytes (with a few 
extra seat belts thrown in).  Besides being a bit awkward, this also means that any encoded text (even the plain ASCII 
stuff) is now being transformed three times instead of one:


  unicode to bytes
  bytes to unicode using latin1
  unicode to bytes

Even if the cost of moving those bytes around is cheap, it's not free.  When you're creating hundreds of PDFs at a time 
that's going to make a difference.


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread M.-A. Lemburg
On 11.01.2014 16:34, Nick Coghlan wrote:
> On 12 January 2014 01:15, M.-A. Lemburg  wrote:
>> On 11.01.2014 14:54, Georg Brandl wrote:
>>> Am 11.01.2014 14:49, schrieb Georg Brandl:
 Am 11.01.2014 10:44, schrieb Stephen Hansen:

> I mean, its not like the "bytes" type lacks knowledge of the subset of 
> bytes
> that happen to be 7-bit ascii-compatible and can't perform text-ish 
> operations
> on them--
>
>   Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600 
> 32 bit
> (Intel)] on win32
>   Type "help", "copyright", "credits" or "license" for more information.
>   >>> b"stephen hansen".title()
>   b'Stephen Hansen'
>
> How is this not a practical recognition that yes, while bytes are byte 
> streams
> and not text, a huge subset of bytes are text-y, and as long as we 
> maintain the
> barrier between higher characters and implicit conversion therein, we're 
> fine?
>
> I don't see the difference here. There is a very real, practical need to
> interpolate bytes. This very real, practical need includes the very real
> recognition that converting 12345 to b'12345' is not something weird, 
> unusual,
> and subject to the thorny issues of Encodings. It is not violating the 
> doctrine
> of separation of powers between Text and Bytes.

 This. Exactly. Thanks for putting it so nicely, Stephen.
>>>
>>> To elaborate: if the bytes type didn't have all this ASCII-aware 
>>> functionality
>>> already, I think we would have (and be using) a dedicated "asciistr" type 
>>> right
>>> now.  But it has the functionality, and it's way too late to remove it.
>>
>> I think we need to step back a little from the purist view
>> of things and give more emphasis on the "practicality beats
>> purity" Zen.
>>
>> I complete agree with Stephen, that bytes are in fact often
>> an encoding of text. If that text is ASCII compatible, I don't
>> see any reason why we should not continue to expose the C lib
>> standard string APIs available for text manipulations on bytes.
>>
>> We don't have to be pedantic about the bytes/text separation.
>> It doesn't help in real life.
> 
> Yes, it bloody well does. The number of people who have told me that
> using Python 3 is what allowed them to finally understand how Unicode
> works vastly exceeds the number of wire protocol and file format devs
> that have complained about working with binary formats being
> significantly less tolerant of the "it's really like ASCII text"
> mindset.
> 
> We are NOT going back to the confusing incoherent mess that is the
> Python 2 model of bolting Unicode onto the side of POSIX:
> http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3
> 
> While that was an *expedient* (and, in fact, necessary) solution at
> the time, the fact it is still thoroughly confusing people 13 years
> later shows it is not a *comprehensible* solution.

FWIW: I quite liked the Python 2 model, but perhaps that's because
I already knww how Unicode works, so could use it to make my
life easier ;-)

Seriously, Unicode has always caused heated discussions and
I don't expect this to change in the next 5-10 years.

The point is: there is no 100% perfect solution either way and
when you acknowledge this, things don't look black and white anymore,
but instead full of colors :-)

Python 3 forces people to actually use Unicode; in Python 2 they
could easily avoid it. It's good to educate people on how it's
used and the issues you can run into, but let's not forget
that people are trying to get work done and we all love readable
code.

PEP 460 just adds two more methods to the bytes object which come
in handy when formatting binary data; I don't think it has potential
to muddy the Python 3 text model, given that the bytes
object already exposes a dozen of other ASCII text methods :-)

>> If you give programmers the choice they will - most of the time -
>> do the right thing. If you don't give them the tools, they'll
>> work around the missing features in a gazillion different
>> ways of which many will probably miss a few edge cases.
>>
>> bytes already have most of the 8-bit string methods from Python 2,
>> so it doesn't hurt adding some more of the missing features
>> from Python 2 on top to make life easier for people dealing
>> with multiple/unknown encoding data.
> 
> Because people that aren't happy with the current bytes type
> persistently refuse to experiment with writing their own extension
> type to figure out what the API should look like. Jamming speculative
> API design into the core text model without experimenting in a third
> party extension first is a straight up stupid idea.
> 
> Anyone that is pushing for this should be checking out Benno's first
> draft experimental prototype for asciistr and be working on getting it
> passing the test suit

Re: [Python-Dev] Python3 "complexity"

2014-01-11 Thread Matěj Cepl
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2014-01-10, 17:34 GMT, you wrote:
> From my experience, the concept of a default locale is deeply 
> flawed.  What if I log into a (Linux) machine using an old 
> latin-1 putty from the Windows XP era, have most file names 
> and contents in UTF-8 encoding, except for one directory where 
> people from eastern Europe upload files via FTP in whatever 
> encoding they choose. What should the "default" encoding be 
> now?

I know this stuff is really hard and only because I had to fight 
with it for a years (being Czech, so not blessed by Latin-1 
covering my language … actually no living encoding does support 
it completely, but that’s mostly theoretical issue … Latin-2 
used to work for us, and now everybody with civilized OS uses 
UTF-8 of course, not sure what’s the current state of MS 
Windows).

It seems to me that you have some fundamental principles muddled 
together.

a) Locale should be always set for the particular system. I.e., 
in your example above you have two variables only: locale of 
your Windows XP and locale of the Linux box.
b) I know for fact that exactly putty (even on Windows XP) CAN 
translate from UTF-8 on the server to whatever Windows have to 
offer. So, there is no such thing as “latin-1 putty”.
c) Responsibility for filenames on the system stands on whatever 
actually saves the file. So, in this testcase it is a matter of 
correct setting up of the FTP server (I see for example 
http://rhn.redhat.com/errata/RHBA-2012-0187.html and 
https://bugzilla.redhat.com/show_bug.cgi?id=638873 which seem to 
indicate that vsftpd, and what else you would use?, should 
support UTF-8 on filenames). If the server locale supports 
Eastern European filenames and vsftpd supports translation to 
this encoding (hint, hint: UTF-8 does), then you are all set.

> That's why I make it a principle to always unset all LC_* and 
> LANG variables, except when working locally, which happens 
> rather rarely.

That’s a bad idea. Those variables have ALWAYS some value set 
(perhaps default, which tends to be something like en_US.ASCII, 
which is not what you want, fortunately on most Unices these 
days it would be en_US.UTF8, command locale(1) always gives some 
result).

Matěj

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iD8DBQFS0TsM4J/vJdlkhKwRAg9+AJ9wuCEnPqbUr6imA2L9ak17svSP3ACePVRp
5MKkSVUQ9G7A+fZVhDGiEC8=
=MXgT
-END PGP SIGNATURE-
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Matěj Cepl
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2014-01-11, 10:56 GMT, you wrote:
> I don't know what the fuss is about.

I just cannot resist:

When you are calm while everybody else is in the state of 
panic, you haven’t understood the problem.

-- one of many collections of Murphy’s Laws

Matěj

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iD8DBQFS0UBf4J/vJdlkhKwRAtc3AJ9c1ElUhLjvHX+Jw4/NvvmGABNbTQCfe9Zm
rD65ozDhpj/Fu3ydM8Oipco=
=TDQP
-END PGP SIGNATURE-
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 12:43 AM, Nick Coghlan wrote:


In particular, the bytes type is, and always will be, designed for
pure binary manipulation [...]


I apologize for being blunt, but this is a lie.

Lets take a look at the methods defined by bytes:


dir(b'')
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', 
'__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', 
'__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', 
'__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'center', 'count', 'decode', 'endswith', 'expandtabs', 
'find', 'fromhex', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 
'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 
'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']


Are you really going to insist that expandtabs, isalnum, isalpha, isdigit, islower, isspace, istitle, isupper, ljust, 
lower, lstrip, rjust, splitlines, swapcase, title, upper, and zfill are pure binary manipulation methods?


Let's take a look at the repr of bytes:


bytes([48, 49, 50, 51])

b'0123'

Wow, that sure doesn't look like binary data!

Py3 did not go from three text models to two, it went to one good one (unicode strings) and one broken one (bytes).  If 
the aim was indeed for pure binary manipulation, we failed.  We left in bunches of methods which can *only* be 
interpreted as supporting ASCII manipulation.


Due to backwards compatibility we cannot now finish yanking those out, so either we live with a half-dead class 
screaming "I want be ASCII!  I want to be ASCII!" or add back the missing functionality.


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Victor Stinner
Hi,

I'm in favor of adding support of formatting integer and floatting
point numbers in the PEP 460: %d, %u, %o, %x, %f with padding and
precision (%10d, %010d, %1.5f) and sign (%-i, %+i) but without
alternate format ("{:#x}"). %s would also accept int and float for
convenience.

int and float subclasses would not be handled differently, their
__str__ and __format__ would be ignored.

Other int-like and float-like types (ex: defining __int__ or
__index__) are not supported. Explicit cast would be required.

For %s, the choice between string and number is made using
"(PyLong_Check() || PyFloat_Check())".

If you agree, I will modify the PEP. If Antoine disagree, I will fork
the PEP 460 ;-)

---

%s should not support precision (ex: %.100s), use Unicode for that.

---

The PEP 460 should not reintroduce bytes+unicode, implicit decoding or
implement encoding.

b'x=%s' % 10 is well defined, it's pure bytes. If you consider that
bytes should not contain text, why does the bytes type have methods
like isalpha() or upper()? And why binary files have a readline()
method? A "line" doesn't mean anything in pure bytes.

It's an example of "practicality beats purity". Python 3 should not
enforce Unicode if the developers *chose* to use bytes to handle mixed
binary/text protocols like HTTP.

But I'm against of adding "%r" and "%a" because they use Unicode and
would require an implicit encoding. type(ascii(obj)) is str, not
bytes. If you really want to use repr() and ascii(), encode the result
explicitly.

Victor
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 07:34 AM, Nick Coghlan wrote:

On 12 January 2014 01:15, M.-A. Lemburg wrote:


We don't have to be pedantic about the bytes/text separation.
It doesn't help in real life.


Yes, it bloody well does. The number of people who have told me that
using Python 3 is what allowed them to finally understand how Unicode
works . . .


We are not proposing a change to the unicode string type in any way.



We are NOT going back to the confusing incoherent mess that is the
Python 2 model of bolting Unicode onto the side of POSIX . . .


We are not asking for that.



bytes already have most of the 8-bit string methods from Python 2,
so it doesn't hurt adding some more of the missing features
from Python 2 on top to make life easier for people dealing
with multiple/unknown encoding data.


Because people that aren't happy with the current bytes type
persistently refuse to experiment with writing their own extension
type to figure out what the API should look like. Jamming speculative
API design into the core text model without experimenting in a third
party extension first is a straight up stupid idea.


True, if this were a new API; but it isn't, it's the Py2 str API that was stripped out.  The one big difference being 
that if the results of %s (or %d or any other %) is not in the 0-127 range it errors out.


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and NOT ALLOWING mojibake :)

2014-01-11 Thread Georg Brandl
Am 11.01.2014 18:41, schrieb Victor Stinner:
> Hi,
> 
> I'm in favor of adding support of formatting integer and floatting
> point numbers in the PEP 460: %d, %u, %o, %x, %f with padding and
> precision (%10d, %010d, %1.5f) and sign (%-i, %+i) but without
> alternate format ("{:#x}"). %s would also accept int and float for
> convenience.
> 
> int and float subclasses would not be handled differently, their
> __str__ and __format__ would be ignored.
> 
> Other int-like and float-like types (ex: defining __int__ or
> __index__) are not supported. Explicit cast would be required.
> 
> For %s, the choice between string and number is made using
> "(PyLong_Check() || PyFloat_Check())".
> 
> If you agree, I will modify the PEP. If Antoine disagree, I will fork
> the PEP 460 ;-)
> 
> ---
> 
> %s should not support precision (ex: %.100s), use Unicode for that.
> 
> ---
> 
> The PEP 460 should not reintroduce bytes+unicode, implicit decoding or
> implement encoding.
> 
> b'x=%s' % 10 is well defined, it's pure bytes. If you consider that
> bytes should not contain text, why does the bytes type have methods
> like isalpha() or upper()? And why binary files have a readline()
> method? A "line" doesn't mean anything in pure bytes.
> 
> It's an example of "practicality beats purity". Python 3 should not
> enforce Unicode if the developers *chose* to use bytes to handle mixed
> binary/text protocols like HTTP.
> 
> But I'm against of adding "%r" and "%a" because they use Unicode and
> would require an implicit encoding. type(ascii(obj)) is str, not
> bytes. If you really want to use repr() and ascii(), encode the result
> explicitly.

I agree. For non-ASCII characters what ascii() gives you is almost always
not what you want anyway.

Georg


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Antoine Pitrou
On Sat, 11 Jan 2014 18:41:49 +0100
Victor Stinner  wrote:
> 
> If you agree, I will modify the PEP. If Antoine disagree, I will fork
> the PEP 460 ;-)

Please fork it.

> b'x=%s' % 10 is well defined, it's pure bytes.

It is well-defined? Then please explain me what the general case of
  b'%s' % x
is supposed to call:

- does it call x.__bytes__? int.__bytes__ doesn't exist
- does it call bytes(x)? bytes(10) gives
  b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
- does it call x.__str__? you've reintroduced the Python 2 behaviour of
  conflating bytes and unicode

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:
> On 01/11/2014 07:38 AM, Steven D'Aprano wrote:
> >
> >The point that I am making is that many people want to add formatting
> >operations to bytes so they can put ASCII strings inside bytes. But (as
> >far as I can tell) they don't need to do this, because they can treat
> >Unicode strings containing code points U+ through U+00FF (i.e. the
> >same range as handled by Latin-1) as if they were bytes.
> 
> So instead of blurring the line between bytes and text, you're blurring the 
> line between text and bytes (with a few extra seat belts thrown in).  

I'm not blurring anything. The people who designed the file format that 
mixes textual data and binary data did the blurring. Given that such 
formats exist, it is inevitable that we need to put text into bytes, or 
bytes into text. The situation is already blurred, we just have to 
decide how to handle it. There are three broad strategies:

1) Make bytes more string-like, so that we can process our data as 
bytes, but still do string operations on the bits that are ASCII.

2) Make strings more byte-like, so that we can process our data as 
strings, but do byte operations (like bit mask operations) on the parts 
that are binary data.

3) Don't do either. Keep the text parts of your data as text, and the 
binary parts of your data as bytes. Do your text operations on text, and 
your byte operations on bytes.

At some point, of course, they need to be combined. We have a choice:

* Right now, we can use text as the base, and combine bytes into the 
  text using Latin-1, and it Just Works.

* Or we can wait until (maybe) Python 3.5, when (perhaps) bytes objects 
  will be more text-like, and then use bytes as the base, and (with 
  luck) it Should Just Work.


There's another disadvantage with the second: treating bytes as if they 
were ASCII by default reinforces the same old harmful paradigm that text 
is ASCII that we're trying to get away from. That's a bad, painful idea 
that causes a lot of problems and buggy code, and should be resisted.

On the other hand, embedding arbitrary binary data in Unicode text 
doesn't reinforce any common or harmful paradigms. It just requires the 
programmer to forget about characters and concentrate on code points, 
since Latin-1 maps bytes to code points in a very convenient way:

Byte 0x00 maps to code point U+
Byte 0x01 maps to code point U+0001
Byte 0x02 maps to code point U+0002
...
Byte 0xFF maps to code point U+00FF


So to embed the binary data 0xDEADBEEF in your string, you can just use 
'\xDE\xAD\xBE\xEF' regardless of what character those code points happen 
to be.

If we are manipulating data *as if it were text*, then we ought to treat 
it as text, not add methods to bytes that makes bytes text-like. If we 
are manipulating data *as if it were bytes*, doing byte-manipulation 
operations like bit-masking, then we ought to treat it as numeric bytes, 
not add numeric methods to text. Is that really a controversial opinion?


> Besides being a bit awkward, this also means that any encoded text (even 
> the plain ASCII stuff) is now being transformed three times instead of one:
> 
>   unicode to bytes
>   bytes to unicode using latin1
>   unicode to bytes

Where do you get this from? I don't follow your logic. Start with a text 
template:

template = """\xDE\xAD\xBE\xEF
Name:\0\0\0%s
Age:\0\0\0\0%d
Data:\0\0\0%s
blah blah blah
"""

data = template % ("George", 42, blob.decode('latin-1'))

Only the binary blobs need to be decoded. We don't need to encode the 
template to bytes, and the textual data doesn't get encoded until we're 
ready to send it across the wire or write it to disk. And when we do, 
since all the code points are in the range U+ to U+00FF, encoding it 
to Latin-1 ought to be a fast, efficient operation, possibly even just a 
mem copy.

It's true that the individual binary data fields will been to be decoded 
from bytes, but unless you want Python to guess an encoding (which is 
the old broken Python 2 model), you're going to have to do that 
regardless.


> Even if the cost of moving those bytes around is cheap, it's not free.  
> When you're creating hundreds of PDFs at a time that's going to make a 
> difference.

You've profiled it? Unless you've measured it, it doesn't exist. I'm not 
going to debate performance penalties of code you haven't written yet.



-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Ethan Furman

On 01/11/2014 10:32 AM, Antoine Pitrou wrote:

On Sat, 11 Jan 2014 18:41:49 +0100
Victor Stinner  wrote:


If you agree, I will modify the PEP. If Antoine disagree, I will fork
the PEP 460 ;-)


Please fork it.


You've already stated you don't care that much and are willing to let the PEP as-is be rejected.  Why not remove your 
name and let Victor have it back?  Is he not the original author?  (If this is protocol just say so -- remember I'm 
still new to the ways of PyDev. :).


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread R. David Murray
tl;dr: At the end I'm volunteering to look at real code that is having
porting problems.

On Sat, 11 Jan 2014 17:33:17 +0100, "M.-A. Lemburg"  wrote:
> asciistr is interesting in that it coerces to bytes instead
> of to Unicode (as is the case in Python 2).
> 
> At the moment it doesn't cover the more common case bytes + str,
> just str + bytes, but let's assume it would, then you'd write
> 
> ...
> headers += asciistr('Length: %i bytes\n' % 123)
> headers += b'\n\n'
> body = b'...'
> socket.send(headers + body)
> ...
> 
> With PEP 460, you could write the above as:
> 
> ...
> headers += b'Length: %i bytes\n' % 123
> headers += b'\n\n'
> body = b'...'
> socket.send(headers + body)
> ...
> 
> IMO, that's more readable.
> 
> Both variants essentially do the same thing: they implicitly
> coerce ASCII text strings to bytes, so conceptually, there's
> little difference.

And if we are explicit:

headers = u'Length: %i bytes\n' % 123
headers += u'\n\n'
body = b'...'
socket.send(headers.encode('ascii') + body)

(I included the 'u' prefix only because we are talking about
shared-codebase python2/python3 code.)

That looks pretty readable to me, and it is explicit about what
parts are text and what parts are binary.

But of course we'd never do exactly that in any but the simplest of
protocols and scripts.

Instead we'd write a library that had one or more object that modeled
our wire/file protocol.  The text parts the API would accept input as
text strings.  The binary parts it would accept input as bytes.  Then,
when reading or writing the data stream, we perform the appropriate
conversions on the appropriate parts.  Our library does a more complex
analog of 'socket.send(headers.encode('ascii') + body)', one that
understands the various parts and glues them together, encoding the
text parts to the appropriate encoding (often-but-not-always ascii)
as it does so.

And yes, I have written code that does this in Python3.

What I haven't done is written that code to run in both Python3 and
Python2.  I *think* the only missing thing I would need to back-port
it is the surrogateescape error handler, but I haven't tried it.  And I
could probably conditionalize the code to use latin1 on python2 instead
and get away with it.

And please note that email is probably the messiest of messy binary
wire protocols.  Not only do you have bytes and text mixed in the same
data stream, with internal markers (in the text parts) that specify
how to interpret the binary, including what encodings each part of that
binary data is in for cases where that matters, you *also* have to deal
with the possibility of there being *invalid* binary data mixed in with
the ostensibly text parts, that you nevertheless are expected to both
preserve and parse around.

When I started adding back binary support to the email package, I was
really annoyed by the lack of certain string features in the bytes
type.  But in the end, it turned out to be really simple to instead
think of the text-with-invalid-bytes parts as *text*-with-invalid-bytes
(surrogateescaped bytes).

Now, if I was designing from the ground up I'd store the stuff that
was really binary as bytes in the model object instead of storing it as
surrogateescaed text, but that problem is a consequence of how we got from
there to here (python2-email to python3-email-that-didn't-handle-8bit-data
to python3-email-that-works) rather than a problem with the python3 core
data model.

So it seems like I'm with Nick and Antoine and company here.  The
byte-interpolation proposed by Antoine seems reasonable, but I don't
see the *need* for the other stuff.  I think that programs will
be cleaner if the text parts of the protocol are handled *as text*.

On the other hand, Ethan's point that bytes *does* have text methods
is true.  However, other than the perfectly-sensible-for-bytes split,
strip, and ends/startswith, I don't think I actually use any of them.


But!  Our goal should be to help people convert to Python3.  So how can
we find out what the specific problems are that real-world programs are
facing, look at the *actual code*, and help that project figure out the
best way to make that code work in both python2 and python3?

That seems like the best way to find out what needs to be added to
python3 or pypi:  help port the actual code of the developers who are
running into problems.

Yes, I'm volunteering to help with this, though of course I can't promise
exactly how much time I'll have available.

--David
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Stephen J. Turnbull
M.-A. Lemburg writes:

 > I complete agree with Stephen, that bytes are in fact often
 > an encoding of text. If that text is ASCII compatible, I don't
 > see any reason why we should not continue to expose the C lib
 > standard string APIs available for text manipulations on bytes.

We already *have* a type in Python 3.3 that provides text
manipulations on arrays of 8-bit objects: str (per PEP 393).

 > BTW: I don't know why so many people keep asking for use cases.
 > Isn't it obvious that text data without known (but ASCII compatible)
 > encoding or multiple different encodings in a single data chunk
 > is part of life ?

Isn't it equally obvious that if you create or read all such ASCII-
compatible chunks as (encoding='ascii', errors='surrogateescape') that
you *don't need* string APIs for bytes?

Why do these "text chunks" need to be bytes in the first place?
That's why we ask for use cases.  AFAICS, reading and writing ASCII-
compatible text data as 'latin1' is just as fast as bytes I/O.  So
it's not I/O efficiency, and (since in this model we don't do any
en/decoding on bytes/str), it's not redundant en/decoding of bytes to
str and back.

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 04:15:35PM +0100, M.-A. Lemburg wrote:

> I think we need to step back a little from the purist view
> of things and give more emphasis on the "practicality beats
> purity" Zen.
> 
> I complete agree with Stephen, that bytes are in fact often
> an encoding of text. If that text is ASCII compatible, I don't
> see any reason why we should not continue to expose the C lib
> standard string APIs available for text manipulations on bytes.

Later in your post, you talk about the masses of broken encodings found 
everywhere (not just in your spam folder). How do the C lib standard 
string APIs help programmers to avoid broken encodings?


> We don't have to be pedantic about the bytes/text separation.
> It doesn't help in real life.

On the contrary, it helps a lot. To the extent that people keep that 
clean bytes/text separation, it helps avoid bugs. It prevents problems 
like this Python 2 nonsense:

s = "Straße"
assert len(s) == 6  # fails
assert s[5] == 'e'  # fails

Most problematic, printing s may (depending on your terminal settings) 
actually look like "Straße".

Not only is having a clean bytes/text separation the pedantic thing to 
do, it's also the right thing to do nearly always (not withstanding a 
few exceptions, allegedly).


> If you give programmers the choice they will - most of the time -
> do the right thing. 

Unicode has been available in Python since version 2.2, more than a 
decade ago. And yet here we are, five point releases later (2.7), and 
the majority of text processing code is still using bytes. I'm not just 
pointing the finger at others. My 2.x only code almost always uses byte 
strings for text processing, and not always because it was old code I 
wrote before I knew better. The coders I work with do the same, only you 
can remove the word "almost". The code I see posted on comp.lang.python 
and Reddit and the tutor mailing list invariably uses byte strings. The 
beginners on the tutor list at least have an excuse that they are 
beginners.

A quarter of a century after Unicode was first published, nearly 
28 years since IBM first introduced the concept of "code pages" 
to PC users, and we still have programmers writing ASCII only 
string-handling code that, if it works with extended character sets, 
only works by accident. The majority of programmer still have *no idea* 
of even the most basic parts of Unicode. They've had the the right tools 
for a decade, and ignored them.

Python 3 forces the issue, and my code is better for it.


> bytes already have most of the 8-bit string methods from Python 2,
> so it doesn't hurt adding some more of the missing features
> from Python 2 on top to make life easier for people dealing
> with multiple/unknown encoding data.

I personally think it was a mistake to keep text operations like upper() 
and lower() on bytes. I think it will compound the mistake to add even 
more text operations.


-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 05:33:17PM +0100, M.-A. Lemburg wrote:

> FWIW: I quite liked the Python 2 model, but perhaps that's because
> I already knww how Unicode works, so could use it to make my
> life easier ;-)

/incredulous

I would really love to see you justify that claim. How do you use the 
Python 2 string type to make processing Unicode text easier?



-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread MRAB

On 2014-01-11 05:36, Steven D'Aprano wrote:
[snip]

Latin-1 has the nice property that every byte decodes into the character
with the same code point, and visa versa. So:

for i in range(256):
 assert bytes([i]).decode('latin-1') == chr(i)
 assert chr(i).encode('latin-1') == bytes([i])

passes. It seems to me that your problem goes away if you use Unicode
text with embedded binary data, rather than binary data with embedded
ASCII text. Then when writing the file to disk, of course you encode it
to Latin-1, either explicitly:

pdf = ... # Unicode string containing the PDF contents
with open("outfile.pdf", "wb") as f:
 f.write(pdf.encode("latin-1")

or implicitly:

with open("outfile.pdf", "w", encoding="latin-1") as f:
 f.write(pdf)


[snip]
The second example won't work because you're forgetting about the
handling of line endings in text mode.

Suppose you have some binary data bytes([10]).

You convert it into a Unicode string using Latin-1, giving '\n'.

You write it out to a file opened in text mode.

On Windows, that string '\n' will be written to the file as b'\r\n'.

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Antoine Pitrou
On Sat, 11 Jan 2014 10:38:01 -0800
Ethan Furman  wrote:
> On 01/11/2014 10:32 AM, Antoine Pitrou wrote:
> > On Sat, 11 Jan 2014 18:41:49 +0100
> > Victor Stinner  wrote:
> >>
> >> If you agree, I will modify the PEP. If Antoine disagree, I will fork
> >> the PEP 460 ;-)
> >
> > Please fork it.
> 
> You've already stated you don't care that much and are willing to let the PEP 
> as-is be rejected.  Why not remove your 
> name and let Victor have it back?  Is he not the original author?  (If this 
> is protocol just say so -- remember I'm 
> still new to the ways of PyDev. :).

Because the PEP is IMO a much saner compromise than what you're
trying to do (and would also stand a better chance of being accepted,
if it weren't for your stupid maximalist opposition).

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 10:36 AM, Steven D'Aprano wrote:

On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:


   unicode to bytes
   bytes to unicode using latin1
   unicode to bytes


Where do you get this from? I don't follow your logic. Start with a text
template:

template = """\xDE\xAD\xBE\xEF
Name:\0\0\0%s
Age:\0\0\0\0%d
Data:\0\0\0%s
blah blah blah
"""

data = template % ("George", 42, blob.decode('latin-1'))

Only the binary blobs need to be decoded. We don't need to encode the
template to bytes, and the textual data doesn't get encoded until we're
ready to send it across the wire or write it to disk.


And what if your name field has data not representable in latin-1?

--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8')
u'\u0441\u0440\u0403'

--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-2: 
ordinal not in range(256)

So really your example should be:

data = template % 
("George".encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'), 
42, blob.decode('latin-1'))

Which is a mess.

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Kristján Valur Jónsson
Hi there.
How about a compromise?
Personally, I think adding the full complement of integer/float formatting to 
bytes is a bit over the top.
How about just supporting two format specifiers?
%b : interpolate a bytes object.  If it doesn't have the buffer interface, 
error.
%s : interpolate a str object, encoded to ASCII using 'strict' conversion.  

This should cover the most common use cases.
In particular, you could do this:

Headers.append('Content-Length: %s'%(len(data),))

And then subsequently:
Packet = b'%b%b'%(b"join(headers), data)

For more complex formatting, you delegate to the more capable string class, but 
benefit from automatic ASCII conversion:

Data = b"percentage = %s" % ("%4.2f" % (value,))

I think interpolating bytes objecst is very important.  And support for 
automatic ASCII conversion in the process will help us cover all of the numeric 
use cases.

K

-Original Message-
From: Python-Dev [mailto:[email protected]] 
On Behalf Of Victor Stinner
Sent: 11. janúar 2014 17:42
To: Python Dev
Subject: [Python-Dev] PEP 460: allowing %d and %f and mojibake

Hi,

I'm in favor of adding support of formatting integer and floatting point 
numbers in the PEP 460: %d, %u, %o, %x, %f with padding and precision (%10d, 
%010d, %1.5f) and sign (%-i, %+i) but without alternate format ("{:#x}"). %s 
would also accept int and float for convenience.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Barry Warsaw
On Jan 11, 2014, at 10:38 AM, Ethan Furman wrote:

>You've already stated you don't care that much and are willing to let the PEP
>as-is be rejected.  Why not remove your name and let Victor have it back?  Is
>he not the original author?  (If this is protocol just say so -- remember I'm
>still new to the ways of PyDev. :).

>From a procedural point of view, I would say that it's entirely appropriate
for a PEP to have open questions, alternatives, and options.  Have it lay out
the arguments pro and con and let Guido or the appointed PEP czar make the
final decision.  Then the PEP can be amended with those decisions, and if
folks still think more needs to be done, a follow up PEP can be filed.

-Barry
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Stephen J. Turnbull
MRAB writes:

 > > with open("outfile.pdf", "w", encoding="latin-1") as f:
 > >  f.write(pdf)
 > >
 > [snip]
 > The second example won't work because you're forgetting about the
 > handling of line endings in text mode.

Not so fast!  Forgot, yes (me too!), but not work?  Not quite:

with open("outfile.pdf", "w", encoding="latin-1", newline="") as f:
f.write(pdf)

should do the trick.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 11:49 AM, Stephen J. Turnbull wrote:

MRAB writes:

  > > with open("outfile.pdf", "w", encoding="latin-1") as f:
  > >  f.write(pdf)
  > >
  > [snip]
  > The second example won't work because you're forgetting about the
  > handling of line endings in text mode.

Not so fast!  Forgot, yes (me too!), but not work?  Not quite:

 with open("outfile.pdf", "w", encoding="latin-1", newline="") as f:
 f.write(pdf)

should do the trick.


Well, it's good that there is a work-a-round.  Are we going to have a document listing all the work-a-rounds needed to 
program a bytes-oriented style using unicode?


--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread R. David Murray
On Sat, 11 Jan 2014 11:54:26 -0800, Ethan Furman  wrote:
> On 01/11/2014 11:49 AM, Stephen J. Turnbull wrote:
> > MRAB writes:
> >
> >   > > with open("outfile.pdf", "w", encoding="latin-1") as f:
> >   > >  f.write(pdf)
> >   > >
> >   > [snip]
> >   > The second example won't work because you're forgetting about the
> >   > handling of line endings in text mode.
> >
> > Not so fast!  Forgot, yes (me too!), but not work?  Not quite:
> >
> >  with open("outfile.pdf", "w", encoding="latin-1", newline="") as f:
> >  f.write(pdf)
> >
> > should do the trick.
> 
> Well, it's good that there is a work-a-round.  Are we going to have a 
> document listing all the work-a-rounds needed to 
> program a bytes-oriented style using unicode?

That's not a work-around (if you are talking specifically about the
newline="").  That's just the way the python3 IO library works.  If you
want to preserve the newlines in your data, but still have the text-io
machinery count them for deciding when to trigger io/buffering behavior,
you use newline=''.

It's not the most intuitive API, so I won't be surprised if a lot of
people don't know about it or get confused by it when they see it.
I first learned about it in the context of csv files, another one of
those legacy file protocols that are mostly-text-but-not-entirely.

--David
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Donald Stufft

On Jan 11, 2014, at 10:34 AM, Nick Coghlan  wrote:

> Yes, it bloody well does. The number of people who have told me that
> using Python 3 is what allowed them to finally understand how Unicode
> works vastly exceeds the number of wire protocol and file format devs
> that have complained about working with binary formats being
> significantly less tolerant of the "it's really like ASCII text"
> mindset.

FWIW as one of the people who it took Python3 to finally figure out how to
actually use unicode, it was the absence of encode on bytes and decode on
str that actually did it. Giving bytes a format method would not have affected
that either way I don’t believe.

-
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Ethan Furman

On 01/11/2014 11:22 AM, Antoine Pitrou wrote:

On Sat, 11 Jan 2014 10:38:01 -0800
Ethan Furman  wrote:

On 01/11/2014 10:32 AM, Antoine Pitrou wrote:

On Sat, 11 Jan 2014 18:41:49 +0100
Victor Stinner  wrote:


If you agree, I will modify the PEP. If Antoine disagree, I will fork
the PEP 460 ;-)


Please fork it.


You've already stated you don't care that much and are willing to let the PEP 
as-is be rejected.  Why not remove your
name and let Victor have it back?  Is he not the original author?  (If this is 
protocol just say so -- remember I'm
still new to the ways of PyDev. :).


Because the PEP is IMO a much saner compromise than what you're
trying to do (and would also stand a better chance of being accepted,
if it weren't for your stupid maximalist opposition).


Well, it's good to know you do care.  :)

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Georg Brandl
Am 11.01.2014 20:22, schrieb Antoine Pitrou:
> On Sat, 11 Jan 2014 10:38:01 -0800
> Ethan Furman  wrote:
>> On 01/11/2014 10:32 AM, Antoine Pitrou wrote:
>> > On Sat, 11 Jan 2014 18:41:49 +0100
>> > Victor Stinner  wrote:
>> >>
>> >> If you agree, I will modify the PEP. If Antoine disagree, I will fork
>> >> the PEP 460 ;-)
>> >
>> > Please fork it.
>> 
>> You've already stated you don't care that much and are willing to let the 
>> PEP as-is be rejected.  Why not remove your 
>> name and let Victor have it back?  Is he not the original author?  (If this 
>> is protocol just say so -- remember I'm 
>> still new to the ways of PyDev. :).
> 
> Because the PEP is IMO a much saner compromise than what you're
> trying to do (and would also stand a better chance of being accepted,
> if it weren't for your stupid maximalist opposition).

Can you please stop throwing personal insults around?  You don't have to
resort to that level.

Georg

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Serhiy Storchaka

11.01.14 21:40, Kristján Valur Jónsson написав(ла):

How about a compromise?
Personally, I think adding the full complement of integer/float formatting to 
bytes is a bit over the top.
How about just supporting two format specifiers?
%b : interpolate a bytes object.  If it doesn't have the buffer interface, 
error.
%s : interpolate a str object, encoded to ASCII using 'strict' conversion.


%b is not supported in Python 2.7. And compatibility with Python 2.7 is 
only the purpose of this feature.


___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Georg Brandl
Am 11.01.2014 22:01, schrieb Serhiy Storchaka:
> 11.01.14 21:40, Kristján Valur Jónsson написав(ла):
>> How about a compromise?
>> Personally, I think adding the full complement of integer/float formatting 
>> to bytes is a bit over the top.
>> How about just supporting two format specifiers?
>> %b : interpolate a bytes object.  If it doesn't have the buffer interface, 
>> error.
>> %s : interpolate a str object, encoded to ASCII using 'strict' conversion.
> 
> %b is not supported in Python 2.7. And compatibility with Python 2.7 is 
> only the purpose of this feature.

Not "only", but it is certainly an important one.

Georg

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Terry Reedy

On 1/11/2014 1:44 PM, Stephen J. Turnbull wrote:


We already *have* a type in Python 3.3 that provides text
manipulations on arrays of 8-bit objects: str (per PEP 393).

  > BTW: I don't know why so many people keep asking for use cases.
  > Isn't it obvious that text data without known (but ASCII compatible)
  > encoding or multiple different encodings in a single data chunk
  > is part of life ?

Isn't it equally obvious that if you create or read all such ASCII-
compatible chunks as (encoding='ascii', errors='surrogateescape') that
you *don't need* string APIs for bytes?

Why do these "text chunks" need to be bytes in the first place?
That's why we ask for use cases.  AFAICS, reading and writing ASCII-
compatible text data as 'latin1' is just as fast as bytes I/O.  So
it's not I/O efficiency, and (since in this model we don't do any
en/decoding on bytes/str), it's not redundant en/decoding of bytes to
str and back.


The problem with some criticisms of using 'unicode in Python 3' is that 
there really is no such thing. Unicode in 3.0 to 3.2 used the old 
internal model inherited from 2.x. Unicode in 3.3+ uses a different 
internal model that is a game changer with respect to certain issues of 
space and time efficiency (and cross-platform correctness and 
portability). So at least some the valid criticisms based on the old 
model are out of date and no longer valid.


--
Terry Jan Reedy

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Daniel Holth
On Sat, Jan 11, 2014 at 4:28 PM, Terry Reedy  wrote:
> On 1/11/2014 1:44 PM, Stephen J. Turnbull wrote:
>
>> We already *have* a type in Python 3.3 that provides text
>> manipulations on arrays of 8-bit objects: str (per PEP 393).
>>
>>   > BTW: I don't know why so many people keep asking for use cases.
>>   > Isn't it obvious that text data without known (but ASCII compatible)
>>   > encoding or multiple different encodings in a single data chunk
>>   > is part of life ?
>>
>> Isn't it equally obvious that if you create or read all such ASCII-
>> compatible chunks as (encoding='ascii', errors='surrogateescape') that
>> you *don't need* string APIs for bytes?
>>
>> Why do these "text chunks" need to be bytes in the first place?
>> That's why we ask for use cases.  AFAICS, reading and writing ASCII-
>> compatible text data as 'latin1' is just as fast as bytes I/O.  So
>> it's not I/O efficiency, and (since in this model we don't do any
>> en/decoding on bytes/str), it's not redundant en/decoding of bytes to
>> str and back.
>
>
> The problem with some criticisms of using 'unicode in Python 3' is that
> there really is no such thing. Unicode in 3.0 to 3.2 used the old internal
> model inherited from 2.x. Unicode in 3.3+ uses a different internal model
> that is a game changer with respect to certain issues of space and time
> efficiency (and cross-platform correctness and portability). So at least
> some the valid criticisms based on the old model are out of date and no
> longer valid.

-1 on adding more surrogateesapes by default. It's a pain to track
down where the encoding errors came from.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 12:45 PM, Donald Stufft wrote:


FWIW as one of the people who it took Python3 to finally figure out how to
actually use unicode, it was the absence of encode on bytes and decode on
str that actually did it. Giving bytes a format method would not have affected
that either way I don’t believe.


My biggest hurdle was realizing that ASCII was an encoding.

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] test.support.check_warnings

2014-01-11 Thread Ethan Furman

The docs say this [1]:
==
 test.support.check_warnings(*filters, quiet=True)

A convenience wrapper for warnings.catch_warnings() that makes it easier to test that a warning was correctly 
raised. It is approximately equivalent to calling warnings.catch_warnings(record=True) with warnings.simplefilter() set 
to always and with the option to automatically validate the results that are recorded.


check_warnings accepts 2-tuples of the form ("message regexp", WarningCategory) as positional arguments. If one or 
more filters are provided, or if the optional keyword argument quiet is False, it checks to make sure the warnings are 
as expected: each specified filter must match at least one of the warnings raised by the enclosed code or the test 
fails, and if any warnings are raised that do not match any of the specified filters the test fails. To disable the 
first of these checks, set quiet to True.

==

What I want is to make sure that DeprecationWarnings are being raised:
==
with support.check_warnings(
("automatic int conversions have been deprecated", 
DeprecationWarning),
quiet=False,
):
exec("'%x' % pi")
exec("'%x' % 3.14")
exec("'%X' % 2.11")
exec("'%o' % 1.79")
exec("'%c' % pi")
==

But if I throw in something that doesn't raise a deprecation warning, the test 
still passes:
==
exec("'%d' % 3")
==

Am I doing something wrong?

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Mariano Reingart
On Fri, Jan 10, 2014 at 9:13 PM, Juraj Sukop  wrote:

>
>
>
> On Sat, Jan 11, 2014 at 12:49 AM, Antoine Pitrou wrote:
>
>> Also, when you say you've never encountered UTF-16 text in PDFs, it
>>  sounds like those people who've never encountered any non-ASCII data in
>> their programs.
>
>
> Let me clarify: one does not think in "writing text in Unicode"-terms in
> PDF. Instead, one records the sequence of "character codes" which
> correspond to "glyphs" or the glyph IDs directly. That's because one
> Unicode character may have more than one glyph and more characters can be
> shown as one glyph.
>
>
>
AFAIK (and just for the record), there could be both Latin1 text and UTF-16
in a PDF (and other encodings too), depending on the font used:

/Encoding /WinAnsiEncoding (mostly latin1 "standard" fonts)
/Encoding /Identity-H (generally for unicode UTF-16 True Type "embedded"
fonts)

For example, in PyFPDF (a PHP library ported to python), the following code
writes out text that could be encoded in two different encodings:

s = sprintf("BT %.2f %.2f Td (%s) Tj ET", x*self.k, (self.h-y)*self.k, txt)

https://code.google.com/p/pyfpdf/source/browse/fpdf/fpdf.py#602

In Python2, txt is just a str, but in Python3 handling everything as latin1
string obviously doesn't work for TTF in this case.

Best regards

Mariano Reingart
http://www.sistemasagiles.com.ar
http://reingart.blogspot.com
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Ethan Furman

On 01/11/2014 10:32 AM, Antoine Pitrou wrote:

On Sat, 11 Jan 2014 18:41:49 +0100
Victor Stinner  wrote:


b'x=%s' % 10 is well defined, it's pure bytes.


It is well-defined? Then please explain me what the general case of
   b'%s' % x
is supposed to call:


This is the key question, isn't it?



- does it call x.__bytes__? int.__bytes__ doesn't exist


Perhaps that's the problem.  According to the docs:

 object.__bytes__(self)

Called by bytes() to compute a byte-string representation of an object. 
This should return a bytes object.


Obviously, with the plethora of different binary possibilities for representing a number (how many bytes? endianness? 
which complement?), we would be well within our rights to decide that the "byte-string representation" of the numeric 
types is the ASCII equivalent of their __repr__ or __str__, and implement __bytes__ appropriately for them.  Any other 
object that wants to be represented easily in a byte stream would also have to implement __bytes__.   If necessary we 
could add __bytes__ to str for /strict/ ASCII conversion (even latin-1 would have to be explicitly encoded)[1].


--
~Ethan~

[1] I'm iffy on this point as I'm not at all sure it's needed.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Victor Stinner
2014/1/11 Ethan Furman :
>>> b'x=%s' % 10 is well defined, it's pure bytes.
>>
>> It is well-defined? Then please explain me what the general case of
>>b'%s' % x
>> is supposed to call:
>
> This is the key question, isn't it?

Python 2 and Python 3 are very different here.

In Python 2, the "s" format of PyArg_Parse may call the __str__()
method of an object.

In Python 3, the "y*" format of PyArg_Parse uses the Py_buffer API
which has no slot (there is no protocol like a __getbuffer__()
method).  The Py_buffer can only be implemented in C. For example,
bytes, bytearray and memoryview implement it. PyArg_Parse requires
also the buffer to be C-contiguous and has a single segment (use
PyBUF_SIMPLE flag).

Said differently, bytes%args and bytes.format() would *not* call any method.

Victor
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Victor Stinner
Hi,

2014/1/11 Antoine Pitrou :
>> b'x=%s' % 10 is well defined, it's pure bytes.
>
> It is well-defined? Then please explain me what the general case of
>   b'%s' % x
> is supposed to call:
>
> - does it call x.__bytes__? int.__bytes__ doesn't exist
> - does it call bytes(x)? bytes(10) gives
>   b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
> - does it call x.__str__? you've reintroduced the Python 2 behaviour of
>   conflating bytes and unicode

I don't want to call any method from bytes%args, only Py_buffer API
would be used. So the pseudo-code becomes:

- try to get Py_buffer
- on failure, check if it's an int: yes? ok, format it as decimal
- otherwise, raise an error

Or:

- is the object an int? yes, format it as decimal. no, use Py_buffer

--

I discussed with Antoine to try to understand how and why we disagree.

Antoine prefers a pure API, whereas I'm trying to figure out if it
would be possible to write code compatible with Python 2 and Python 3.

Using Antoine's PEP, it's possible to write code working on Python 2
and Python 3 which only manipulate bytes strings.

The problem is that it's a pain to write a code working on both Python
versions when an argument is an integer. For example, the Python 2
code "Content-Length: %s\r\n" % 123 is written ("Content-Length:
%s\r\n" % 123).encode('ascii') in Python 3. So Python 2 and Python 3
codes are different.

Supporting formating integers would allow to write b"Content-Length:
%s\r\n" % 123, which would work on Python 2 and Python 3.

(u'Content-Length: %s\r\n' % 123).encode('ascii') works on both Python
versions, but it may require more work to Python 2 code on Python 3.

--

Now I'm trying to find use cases in Mercurial and Twisted source code
to see which features are required. First, I'm looking for a function
requiring to format a number in decimal in a bytes string.


In issue #3982, I saw:

"""
HTTP chunking' uses ASCII mixed with binary (octets). With 2.6 you could write:

def chunk(block):
return b'{0:x}\r\n{1}\r\n'.format(len(block), block)"
"""

and

"""
'Content-length: {}\r\n'.format(length)
"""

But are the examples real use cases, or artifical examples?

--

Augie Fackler gave an example from Mercurial:
"""
sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path':
'some/filesystem/path'})

except we don't know the encoding of the filesystem path (Hi unix!) so
we have to treat the whole thing as opaque bytes.  It's even more fun
for 'log', becase then it's got localized strings in it as well.
"""

But here I disagree with the design of Mercurial, filenames should be
treated as text. If a filename would be pure binary, you should not
write it in a terminal. Displaying binary data usually leads to
displaying random characters and changing terminal options (ex: text
starts blinking or is displayed in bold!?) :-)

For the localized string: again, it's also a design issue in my
opinion. A localized string is text, not binary data :-)

--

Another option is that I cannot find usecases because there are no use
cases for the PEP 460 and the PEP is useless :-)

Victor
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 07:22:30PM +, MRAB wrote:

> >with open("outfile.pdf", "w", encoding="latin-1") as f:
> > f.write(pdf)
> >
> [snip]
> The second example won't work because you're forgetting about the
> handling of line endings in text mode.

So I did! Thank you for the correction.



-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Glenn Linderman

On 1/11/2014 1:50 PM, Ethan Furman wrote:

Perhaps that's the problem.  According to the docs:

 object.__bytes__(self)

Called by bytes() to compute a byte-string representation of an 
object. This should return a bytes object.



Obviously, with the plethora of different binary possibilities for 
representing a number (how many bytes? endianness? which complement?), 
we would be well within our rights to decide that the "byte-string 
representation" of the numeric types is the ASCII equivalent of their 
__repr__ or __str__, and implement __bytes__ appropriately for them.  
Any other object that wants to be represented easily in a byte stream 
would also have to implement __bytes__.   If necessary we could add 
__bytes__ to str for /strict/ ASCII conversion (even latin-1 would 
have to be explicitly encoded)[1]. 


In spite of Victor's explanation of internals, which I didn't 
understand, this sounds like a very interesting idea, conceptually, that 
any object could implement its __bytes__representation.


On the other hand, it would probably have to be parameterized in the 
general case: for binary data values, one protocol or format may wish 
the data to be big-endian, and another may wish the data to be 
little-endian; for str, one protocol or format may require one encoding 
and another may require a different encoding, even (as for email) for 
different parts of the message. So it could be somewhat complex, yet 
would be very powerful in allowing complex objects, made up of other 
objects, some of which might have a variety of potential bytes formats 
(think TIFF files, for example) to convert themselves into a stream of 
bytes that fits the standard. On the flip side, one would want to 
convert the stream of bytes into the set of objects, which is a parsing 
problem.


This is a bit beyond what can be done automatically, just by calling 
__bytes__ with no parameters, though.


What it may be, though, is a meta-operation from which the needed bytes 
operations can be determined. It may also not be an easy "compatible 
with existing Python 2 code with minor tweaks" solution, either. It 
would be more like a pickle protocol, but pickle defines its own 
formats, and thus is useless for creating standard formats.


I guess it would belong on python-ideas.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Matěj Cepl
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2014-01-11, 18:09 GMT, you wrote:
>> We are NOT going back to the confusing incoherent mess that 
>> is the Python 2 model of bolting Unicode onto the side of 
>> POSIX . . .
>
> We are not asking for that.

Yes, you do. Maybe not you personally, but number of people here 
on this list (for F...k sake, this is for DEVELOPERS of the 
langauge, not some bloody users!) for whom the current 
suggestion is just the way how to avoid Unicode and keep all 
those broken script which barfs at me all the time alive is quit 
non-zero I am afraid.

Best,

Matěj

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iD8DBQFS0ev24J/vJdlkhKwRAoHOAJ9crimnp+TtXCxmZLvTUSFVFSESAwCeNrby
Yjwk6Ydzc/REezfHP046C5Y=
=c2vl
-END PGP SIGNATURE-
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] byteformat() proposal: please critique

2014-01-11 Thread Terry Reedy
The following function interpolates bytes, bytearrays, and formatted 
strings, the latter two auto-converted to bytes, into a bytes (or 
auto-converted bytearray) format. This function automates much of what 
some people have recommended for combining ascii text and binary blogs. 
The test passes on 2.7.6 as well as 3.3.3, though a 2.7-only version 
would be simpler.

===

# bf.py -- Terry Jan Reedy, 2014 Jan 11
"Define byteformat(): a bytes version of str.format as a function."
import re

def byteformat(form, obs):
'''Return bytes-formated objects interpolated into bytes format.

The bytes or bytearray format has two types of replacement fields.
b'{}' and b'{:}': The object can be any raw bytes or bytearray object.
b'{:}: The object can by any object ob that can be
string-formated with . Bytearray are converted to bytes.

The text encoding is the default (encoding="utf-8", errors="strict").
Users should be explicitly encode to bytes for any other encoding.
The struct module can by used to produce bytes, such as binary-formated
integers, that are not encoded text.

Test passes on both 2.7.6 and 3.3.3.
'''

if isinstance(form, bytearray):
form = bytes(form)
fields = re.split(b'{:?([^}]*)}', form)
# print(fields)
if len(fields) != 2*len(obs)+1:
raise ValueError('Number of replacement fields not same as 
len(obs)')

j = 1 # index into fields
for ob in obs:
if isinstance(ob, bytearray):
ob = bytes(ob)
field = fields[j]
fields[j] = format(ob, field.decode()).encode() if field else ob
j += 2
return b''.join(fields)

# test code
bformat = b"bytes: {}; bytearray: {:}; unicode: {:s}; int: {:5d}; float: 
{:7.2f}; end"

objects = (b'abc', bytearray(b'def'), u'ghi', 123, 12.3)
result = byteformat(bformat, objects)
result2 = byteformat(bytearray(bformat), objects)
strings = (ob.decode()  if isinstance(ob, (bytes, bytearray)) else ob
   for ob in objects)
expect = bformat.decode().format(*strings).encode()

#print(result)
#print(result2)
print(expect)
assert result == result2 == expect

=
This has been edited from what I posted to issue 3982 to expand the 
docstrings and to work the same with both bytes and bytearrays on both 
2.7 and 3.3. When I posted before, I though of it merely as a 
proof-of-concept prototype. After reading the seemingly endless 
discussion of possible variations of byte formatting with % and .format, 
I now present it as a real, concrete, proposal.


There are, of course, details that could be tweaked. The encoding uses 
the default, which on 3.x is (encoding='utf-8', errors='strict').  This 
could be changed to an explicit encoding='ascii'. If that were done, the 
encoding could be made a parameter that defaults to 'ascii'. The joiner 
could be defined as type(form)() so the output type matches the input 
form type. I did not do that because it complicates the test.


The coercion of interpolated bytearray objects to bytes is needed for 
2.7 because in 2.7, str/bytes.join raises TypeError for bytearrays in 
the input sequence. A 3.x-only version could drop this.


One objection to the function is that it is neither % or .format. To me, 
this is an advantage in that a new function will not be expected to 
exactly match the % or .format behavior in either 2.x or 3.x. It 
eliminates the 'matching the old' arguments so we can focus on what 
actual functionality is needed. There is no need to convert true binary 
bytes to text with either latin-1 or surrogates. There is no need to add 
anything to bytes. The code above uses the built-in facilities that we 
already have, which to me should be the first thing to try, not the last.


One new feature that does not match old behavior is that {} and {:} are 
changed (in 3.x) to indicate bytes whereas {:s} continues to indicate 
(in 3.x) unicode text. ({:s} might be changed to mean unicode for 2.7 
also, but I did not explore that idea.) Similarly, a new function is 
free to borrow only the format_spec part of replace of replacement 
fields and use format(ob, format_spec) to format each object. Anyone who 
needs the full power of str.format is free to use it explicitly. I think 
format_specs cover most of what people have asked for.


For future releases, the function could go in the string module. It 
could otherwise be added to existing or future 2&3 porting packages.


--
Terry Jan Reedy

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 04:28:34PM -0500, Terry Reedy wrote:

> The problem with some criticisms of using 'unicode in Python 3' is that 
> there really is no such thing. Unicode in 3.0 to 3.2 used the old 
> internal model inherited from 2.x. Unicode in 3.3+ uses a different 
> internal model that is a game changer with respect to certain issues of 
> space and time efficiency (and cross-platform correctness and 
> portability). So at least some the valid criticisms based on the old 
> model are out of date and no longer valid.

While there are definitely performance savings (particularly of memory) 
regarding the FSR in Python 3.3, for the use-case we're talking about, 
Python 3.1 and 3.2 (and for that matter, 2.2 through 2.7) Unicode 
strings should be perfectly adequate. The textual data being used is 
ASCII, and the binary blobs are encoded to Latin-1, so everything is a 
subset of Unicode, namely U+ to U+00FF. That means there are no 
astral characters, and no behavioural differences between wide and 
narrow builds (apart from memory use).


-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:

> AFAIK (and just for the record), there could be both Latin1 text and UTF-16
> in a PDF (and other encodings too), depending on the font used:
[...]
> In Python2, txt is just a str, but in Python3 handling everything as latin1
> string obviously doesn't work for TTF in this case.

Nobody is suggesting that you use Latin-1 for *everything*. We're 
suggesting that you use it for blobs of binary data that represent 
arbitrary bytes. First you have to get your binary data in the first 
place, using whatever technique is necessary. Here's one way to get a 
blob of binary data:


# encode four C shorts into a fixed-width struct
struct.pack(">", 23, 42, 17, 99)

Here's another way:

# encode a text string into UTF-16
"My name is Steven".encode("utf-16be")

Both examples return a bytes object containing arbitrary bytes. How do 
you combine those arbitrary bytes with a string template while still 
keeping all code-points under U+0100? By decoding to Latin-1.



-- 
Steven
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] test.support.check_warnings

2014-01-11 Thread Brett Cannon
On Sat, Jan 11, 2014 at 3:45 PM, Ethan Furman  wrote:

> The docs say this [1]:
> ==
>  test.support.check_warnings(*filters, quiet=True)
>
> A convenience wrapper for warnings.catch_warnings() that makes it
> easier to test that a warning was correctly raised. It is approximately
> equivalent to calling warnings.catch_warnings(record=True) with
> warnings.simplefilter() set to always and with the option to automatically
> validate the results that are recorded.
>
> check_warnings accepts 2-tuples of the form ("message regexp",
> WarningCategory) as positional arguments. If one or more filters are
> provided, or if the optional keyword argument quiet is False, it checks to
> make sure the warnings are as expected: each specified filter must match at
> least one of the warnings raised by the enclosed code or the test fails,
> and if any warnings are raised that do not match any of the specified
> filters the test fails. To disable the first of these checks, set quiet to
> True.
> ==
>
> What I want is to make sure that DeprecationWarnings are being raised:
> ==
> with support.check_warnings(
> ("automatic int conversions have been deprecated",
> DeprecationWarning),
> quiet=False,
> ):
> exec("'%x' % pi")
> exec("'%x' % 3.14")
> exec("'%X' % 2.11")
> exec("'%o' % 1.79")
> exec("'%c' % pi")
> ==
>
> But if I throw in something that doesn't raise a deprecation warning, the
> test still passes:
> ==
> exec("'%d' % 3")
> ==
>
> Am I doing something wrong?
>

You're assuming the context manager is doing something magical to verify
that all calls in the block raise the expected exception. What you want to
do is execute it in a loop::

  for test in (...):
with support.check_warnings(("automatic int conversions have been
deprecated", DeprecationWarning), quiet=False):
  exec(test)
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Cameron Simpson
On 11Jan2014 13:15, Juraj Sukop  wrote:
> On Sat, Jan 11, 2014 at 5:14 AM, Cameron Simpson  wrote:
> >   data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) )
> 
> Thanks for the suggestion! The problem with "bytify" is that some items
> might require different formatting than other items. For example, in
> "Cross-Reference Table" there are three different formats: non-padded
> integer ("1"), 10- and 15digit integer, ("03", "65535").

Well, this is partly my point: you probably want to exert more
control that is reasonable for the PEP to offer, and you're better
off with a helper function of your own. In particular, aside from
passing in a default char=>bytes encoding, you can provide your own
format hooks.

In particular, str already provides a completish % suite and you
have no issue with encodings in that phase because it is all Unicode.

So the points where you're treating PDF as text are probably best
tackled as text and then encoded with a helper like bytify when you
have to glom bytes and "textish" stuff together.

Crude example, hacked up from yours:

  data = b''.join( bytify(
("%d %d obj ... stream" % (10, 0)),
binary_image_data,
"endstream endobj",
  )))

where bytify swallows your encoding decisions.

Since encoding anything-not-bytes into a bytes sequence inherently
involves an encoding decision, I think I'm +1 on the PEP's aim of
never mixing bytes with non-bytes, keeping all the encoding decisions
in the caller's hands.

I quite understand not wanting to belabour the code with
".encode('ascii')" but that should be said somewhere, so best to
do so yourself in as compact and ergonomic fashion as possible.

Cheers,
-- 
Cameron Simpson 

Serious error.
All shortcuts have disappeared.
Screen. Mind. Both are blank.
- Haiku Error Messages 
http://www.salonmagazine.com/21st/chal/1998/02/10chal2.html
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Kristján Valur Jónsson
No, I don't think it is.
The purpose is to make it easier to work with bytes objects.  There can be no 
python 2 compatibility when it comes to bytes/unicode conversion.


From: Python-Dev [[email protected]] on 
behalf of Serhiy Storchaka [[email protected]]
Sent: Saturday, January 11, 2014 21:01
To: [email protected]
Subject: Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

11.01.14 21:40, Kristján Valur Jónsson написав(ла):
> How about a compromise?
> Personally, I think adding the full complement of integer/float formatting to 
> bytes is a bit over the top.
> How about just supporting two format specifiers?
> %b : interpolate a bytes object.  If it doesn't have the buffer interface, 
> error.
> %s : interpolate a str object, encoded to ASCII using 'strict' conversion.

%b is not supported in Python 2.7. And compatibility with Python 2.7 is
only the purpose of this feature.
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] byteformat() proposal: please critique

2014-01-11 Thread Brett Cannon
On Sat, Jan 11, 2014 at 8:20 PM, Terry Reedy  wrote:

> The following function interpolates bytes, bytearrays, and formatted
> strings, the latter two auto-converted to bytes, into a bytes (or
> auto-converted bytearray) format. This function automates much of what some
> people have recommended for combining ascii text and binary blogs. The test
> passes on 2.7.6 as well as 3.3.3, though a 2.7-only version would be
> simpler.
> ===
>
> # bf.py -- Terry Jan Reedy, 2014 Jan 11
> "Define byteformat(): a bytes version of str.format as a function."
> import re
>
> def byteformat(form, obs):
> '''Return bytes-formated objects interpolated into bytes format.
>
> The bytes or bytearray format has two types of replacement fields.
> b'{}' and b'{:}': The object can be any raw bytes or bytearray object.
> b'{:}: The object can by any object ob that can be
> string-formated with . Bytearray are converted to bytes.
>
> The text encoding is the default (encoding="utf-8", errors="strict").
> Users should be explicitly encode to bytes for any other encoding.
> The struct module can by used to produce bytes, such as binary-formated
> integers, that are not encoded text.
>
> Test passes on both 2.7.6 and 3.3.3.
> '''
>
> if isinstance(form, bytearray):
> form = bytes(form)
> fields = re.split(b'{:?([^}]*)}', form)
> # print(fields)
> if len(fields) != 2*len(obs)+1:
> raise ValueError('Number of replacement fields not same as
> len(obs)')
> j = 1 # index into fields
> for ob in obs:
> if isinstance(ob, bytearray):
> ob = bytes(ob)
> field = fields[j]
> fields[j] = format(ob, field.decode()).encode() if field else ob
> j += 2
> return b''.join(fields)
>
> # test code
> bformat = b"bytes: {}; bytearray: {:}; unicode: {:s}; int: {:5d}; float:
> {:7.2f}; end"
> objects = (b'abc', bytearray(b'def'), u'ghi', 123, 12.3)
> result = byteformat(bformat, objects)
> result2 = byteformat(bytearray(bformat), objects)
> strings = (ob.decode()  if isinstance(ob, (bytes, bytearray)) else ob
>for ob in objects)
> expect = bformat.decode().format(*strings).encode()
>
> #print(result)
> #print(result2)
> print(expect)
> assert result == result2 == expect
>
> =
> This has been edited from what I posted to issue 3982 to expand the
> docstrings and to work the same with both bytes and bytearrays on both 2.7
> and 3.3. When I posted before, I though of it merely as a proof-of-concept
> prototype. After reading the seemingly endless discussion of possible
> variations of byte formatting with % and .format, I now present it as a
> real, concrete, proposal.
>
> There are, of course, details that could be tweaked. The encoding uses the
> default, which on 3.x is (encoding='utf-8', errors='strict').  This could
> be changed to an explicit encoding='ascii'. If that were done, the encoding
> could be made a parameter that defaults to 'ascii'. The joiner could be
> defined as type(form)() so the output type matches the input form type. I
> did not do that because it complicates the test.
>

With that flexibility this matches what I have been mulling in the back of
my head all day. Basically everything that goes in is assumed to be bytes
unless {:s} says to expect something which can be passed to str() and then
use some specified encoding in all instances (stupid example following as
it might be easier with bytes.join, but it gets the point across)::

  formatter = format_bytes('latin1', 'strict')
  http_response = formatter(b'Content-Type: {:s}\r\n\r\nContent-Length:
{:s}\r\n\r\n{}', 'image/jpeg', len(data), data)

Nothing fancy, just an easy way to handle having to call str.encode() on
every text argument that is to end up as bytes as Terry is proposing (and
I'm fine with defaulting to ASCII/strict with no arguments). Otherwise you
do what R. David Murray suggested and just have people rely on their own
API which accepts what they want and then spits out what they want behind
the scenes.

It basically comes down to how much tweaking of existing Python 2.7
%/.format() calls people will be expected to make. I'm fine with asking
people to call a function like what Terry is proposing as it can do away
with baking in that ASCII is reasonable as well as not require a bunch of
work without us having to argue over what bytes.format() should or should
not do. Personally I say bytes.format() is fine but it shouldn't do any
text encoding which makes its usefulness rather minor (much like the other
text-like methods that got carried forward in hopes that they would be
useful to people porting code; maybe we should consider taking them out in
Python 4 or something if we find out no one is using them).


>
> The coercion of interpolated bytearray objects to bytes is needed for 2.7
> because in 2.7, str/bytes.join raises TypeError for bytearrays in the input
> sequence. A 3.x-only version could drop this.
>

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Steven D'Aprano
On Sat, Jan 11, 2014 at 11:05:36AM -0800, Ethan Furman wrote:
> On 01/11/2014 10:36 AM, Steven D'Aprano wrote:
> >On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:
> >>
> >>   unicode to bytes
> >>   bytes to unicode using latin1
> >>   unicode to bytes
> >
> >Where do you get this from? I don't follow your logic. Start with a text
> >template:
> >
> >template = """\xDE\xAD\xBE\xEF
> >Name:\0\0\0%s
> >Age:\0\0\0\0%d
> >Data:\0\0\0%s
> >blah blah blah
> >"""
> >
> >data = template % ("George", 42, blob.decode('latin-1'))

Since the use-cases people have been speaking about include only ASCII 
(or at most, Latin-1) text and arbitrary binary bytes, my example is 
limited to showing only ASCII text. But it will work with any text data, 
so long as you have a well-defined format that lets you tell which parts 
are interpreted as text and which parts as binary data. If your file 
format is not well-defined, then you have bigger problems than dealing 
with text versus bytes.


> >Only the binary blobs need to be decoded. We don't need to encode the
> >template to bytes, and the textual data doesn't get encoded until we're
> >ready to send it across the wire or write it to disk.
> 
> And what if your name field has data not representable in latin-1?
> 
> --> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8')
> u'\u0441\u0440\u0403'

Where did you get those bytes from? You got them from somewhere. Who 
knows? Who cares? Once you have bytes, you can treat them as a blob of 
arbitrary bytes and write them to the record using the Latin-1 trick. If 
you're reading those bytes from some stream that gives you bytes, you 
don't have to care where they came from.

But what if you don't start with bytes? If you start with a bunch of 
floats, you'll probably convert them to bytes using the struct module. 
If you start with non-ASCII text, you have to convert them to bytes too. 
No difference here.

You ask the user for their name, they answer "срЃ" which is given to you 
as a Unicode string, and you want to include it in your data record. The 
specifications of your file format aren't clear, so I'm going to assume 
that:

1) ASCII text is allowed "as-is" (that is, the name "George" will be 
   in the final data file as b'George');

2) any other non-ASCII text will be encoded as some fixed encoding 
   which we can choose to suit ourselves;

   (if the encoding is fixed by the file format, then just use that)

3) arbitrary binary data is allowed "as-is" (i.e. byte N has to end up 
   being written as byte N, for any value of N between 0 and 255).


So, to write the ASCII name "George", we can just 

"Name:\0\0\0%s" % "George"

since we know it is already ASCII. (It's a literal, so that's obvious. 
But see below.) To write arbitrary binary data, we take the *bytes* and 
decode to Latin-1:

blob = bunch_o_bytes()  # Completely arbitrary.
"Data:\0\0\0%s" % blob.decode('latin-1'))


Combine those two techniques to deal with non-ASCII names. First you 
have to get the non-ASCII name converted to *arbitrary bytes*, so any 
encoding that deals with the whole range of Unicode will do. Then you 
convert those arbitary bytes into Latin-1. Here I'll use UTF-32, just 
because I can and I feel like being wasteful:

"Name:\0\0\0%s" % "срЃ".encode("utf-32be").decode("latin-1")

UTF-8 is a better choice, because it doesn't use as much space and 
gives you something which looks like ASCII in a hex editor:

name = "George" if random.random() < 0.5 else "срЃ"
"Name:\0\0\0%s" % name.encode("utf-8").decode("latin-1")

If you don't know whether your name is pure ASCII, then you have to 
encode first. Otherwise how do you know what bytes to use?

Aside: if this point is not *bleedingly obvious*, then you 
need to read Joel on Software on Unicode RIGHT NOW. 

http://www.joelonsoftware.com/articles/Unicode.html‎


If the name data happens to be pure ASCII, then encoding to UTF-8 and 
decoding to Latin-1 ends up being a no-op:

py> "George".encode("utf-8").decode("latin-1")
'George'


Of course, if I know that the name is ASCII ahead of time (I wrote it as 
a literal, so I think I would know...) then I can short-cut the whole 
process and just do this:

"Name:\0\0\0%s" % name_which_is_guaranteed_to_be_ascii


If I screw up and insert a non-Latin-1 character, then when I eventually 
write it to a file, it will give me a Unicode error, exactly as it 
should.


I've assumed that I can pick the encoding. That's rather like assuming 
that, given a bunch of floats, I can pick whether to represent them as C 
doubles or singles or something else, whatever suits my purposes. If I'm 
dealing with some existing file format, it probably defines the 
encoding, either explicitly or implicitly. When I don't have the choice 
of encoding, but have to use some damned stupid legacy encoding that 
only includes a fraction of Unicode, then:

name.encode("legacy encoding", errors="whatever")

will give me the bytes I need to use the Latin-1 trick on.

This w

Re: [Python-Dev] byteformat() proposal: please critique

2014-01-11 Thread Ethan Furman

On 01/11/2014 05:20 PM, Terry Reedy wrote:

The following function . . .


Thanks, Terry, for doing that.

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Nick Coghlan
On 12 Jan 2014 03:29, "Ethan Furman"  wrote:
>
> On 01/11/2014 12:43 AM, Nick Coghlan wrote:
>>
>>
>> In particular, the bytes type is, and always will be, designed for
>> pure binary manipulation [...]
>
>
> I apologize for being blunt, but this is a lie.
>
> Lets take a look at the methods defined by bytes:
>
 dir(b'')
>
> ['__add__', '__class__', '__contains__', '__delattr__', '__dir__',
'__doc__', '__eq__', '__format__', '__ge__', '__getattribute__',
'__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__',
'__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__',
'__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__',
'__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'center',
'count', 'decode', 'endswith', 'expandtabs', 'find', 'fromhex', 'index',
'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition',
'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip',
'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title',
'translate', 'upper', 'zfill']
>
> Are you really going to insist that expandtabs, isalnum, isalpha,
isdigit, islower, isspace, istitle, isupper, ljust, lower, lstrip, rjust,
splitlines, swapcase, title, upper, and zfill are pure binary manipulation
methods?

Do you think I don't know that? However, those are all *in-place*
modifications. Yes, they assume ASCII compatible formats, but they're a far
cry from encouraging combination of data from potentially different sources.

I'm also on record as considering this a design decision I regret,
precisely because it has resulted in experienced Python 2 developers
failing to understand that the Python 3 text model is *different* and they
may need to  create a new type.

>
> Let's take a look at the repr of bytes:
>
 bytes([48, 49, 50, 51])
>
> b'0123'
>
> Wow, that sure doesn't look like binary data!
>
> Py3 did not go from three text models to two, it went to one good one
(unicode strings) and one broken one (bytes).  If the aim was indeed for
pure binary manipulation, we failed.  We left in bunches of methods which
can *only* be interpreted as supporting ASCII manipulation.

No, no, no. We made some concessions in the design of the bytes type to
*ease* development and debugging of ASCII compatible protocols *where we
believed we could do so without compromising the underlying text model
changes.

Many experienced Python 2 developers are now suffering one of the worst
cases of paradigm lock I have ever seen as they keep trying to make the
Python 3 text model the same as the Python 2 one instead of actually
learning how Python 3 works and recognising that they may actually need to
create a new type for their use case and then potentially seek core dev
assistance if that type reveals new interoperability bugs in the core types
(or encounters old ones).

>
> Due to backwards compatibility we cannot now finish yanking those out, so
either we live with a half-dead class screaming "I want be ASCII!  I want
to be ASCII!" or add back the missing functionality.

No, we don't - we treat the core bytes type as PEP 460 does, by adding a
*new* feature proposed by a couple people writing native Python 3 libraries
like asyncio that makes binary formats easier to deal with without carrying
forward even *more* broken assumptions from the Python 2 text model.
(Remember, I'm in favour of Antoine's updated PEP, because it's a real spec
for a new feature, rather than yet another proposal to bolt on even more
text specific formatting features from someone that has never bothered to
understand the reasons for the differences between the two versions).

People that want a full hybrid type back can then pursue the custom
extension type approach.

Cheers,
Nick.

>
>
> --
> ~Ethan~
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

2014-01-11 Thread Nick Coghlan
On 12 Jan 2014 03:44, "Victor Stinner"  wrote:
>
> Hi,
>
> I'm in favor of adding support of formatting integer and floatting
> point numbers in the PEP 460: %d, %u, %o, %x, %f with padding and
> precision (%10d, %010d, %1.5f) and sign (%-i, %+i) but without
> alternate format ("{:#x}"). %s would also accept int and float for
> convenience.
>
> int and float subclasses would not be handled differently, their
> __str__ and __format__ would be ignored.
>
> Other int-like and float-like types (ex: defining __int__ or
> __index__) are not supported. Explicit cast would be required.

asciistr will support the *full* text formatting API, so I don't see any
reason to add this complexity to the core bytes type. However, I like the
basic binary interpolation feature proposed by the current version of the
PEP - it's a nice convenience method that doesn't compromise the text model
by introducing implicit serialisation of other types (whether text or
numbers).

For Python 2 folks trying to grok where the "bright line" is in terms of
the Python 3 text model: if your proposal includes *any* kind of implicit
serialisation of non binary data to binary, it is going to be rejected as
an addition to the core bytes type. If it avoids crossing that line (as the
buffer-API-only version of PEP 460 does), then we can talk.

Folks that want implicit serialisation (and I agree it has its uses) should
go help Benno get asciistr up to speed.

Cheers,
Nick.

>
> For %s, the choice between string and number is made using
> "(PyLong_Check() || PyFloat_Check())".
>
> If you agree, I will modify the PEP. If Antoine disagree, I will fork
> the PEP 460 ;-)
>
> ---
>
> %s should not support precision (ex: %.100s), use Unicode for that.
>
> ---
>
> The PEP 460 should not reintroduce bytes+unicode, implicit decoding or
> implement encoding.
>
> b'x=%s' % 10 is well defined, it's pure bytes. If you consider that
> bytes should not contain text, why does the bytes type have methods
> like isalpha() or upper()? And why binary files have a readline()
> method? A "line" doesn't mean anything in pure bytes.
>
> It's an example of "practicality beats purity". Python 3 should not
> enforce Unicode if the developers *chose* to use bytes to handle mixed
> binary/text protocols like HTTP.
>
> But I'm against of adding "%r" and "%a" because they use Unicode and
> would require an implicit encoding. type(ascii(obj)) is str, not
> bytes. If you really want to use repr() and ascii(), encode the result
> explicitly.
>
> Victor
> ___
> Python-Dev mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Changing Clinic's output

2014-01-11 Thread Larry Hastings


On 01/08/2014 07:08 AM, Barry Warsaw wrote:

How hard would it be to put together some sample branches that provide
concrete examples of the various options?

My own opinion could easily be influenced by having some hands-on time with
actual code, and I suspect even Guido could be influenced if he could pull
some things up in his editor and take a look around.


I've uploaded a prototype here:

   https://bitbucket.org/larry/python-clinic-buffer

It's a clone of Python trunk, so if you already have a trunk handy, 
clone that first then "hg pull -u" from the above and it'll go a lot 
quicker.


The prototype adds some commands to Argument Clinic that allow you to 
specify where each bit of its output goes.  You have four choices:


 * You can write to the output block as before.
 * You can buffer up the text for writing out later in the same file.
 * You can write to a file on the side.
 * Or you can throw it away.

To learn how to run your own experiments, read "CLINIC.BUFFER.NOTES.TXT" 
in the root of the repository.  For your tl;dr pleasure I've included 
recipes for the proposed approaches so far.


I don't propose to check in the prototype in its current state.  But it 
should be sufficient for running everybody's experiments.  (If there's 
something you want to try that my prototype doesn't support, contact me 
and I should be able to throw in a feature for you.)



Happy experimenting,


//arry/
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Ethan Furman

On 01/11/2014 06:29 PM, Steven D'Aprano wrote:

On Sat, Jan 11, 2014 at 11:05:36AM -0800, Ethan Furman wrote:

On 01/11/2014 10:36 AM, Steven D'Aprano wrote:

On Sat, Jan 11, 2014 at 08:20:27AM -0800, Ethan Furman wrote:


   unicode to bytes
   bytes to unicode using latin1
   unicode to bytes


Where do you get this from? I don't follow your logic. Start with a text
template:

template = """\xDE\xAD\xBE\xEF
Name:\0\0\0%s
Age:\0\0\0\0%d
Data:\0\0\0%s
blah blah blah
"""

data = template % ("George", 42, blob.decode('latin-1'))


Since the use-cases people have been speaking about include only ASCII
(or at most, Latin-1) text and arbitrary binary bytes, my example is
limited to showing only ASCII text. But it will work with any text data,
so long as you have a well-defined format that lets you tell which parts
are interpreted as text and which parts as binary data.


Since you're talking to me, it would be nice if you addressed the same use-case I was addressing, which is mixed: 
ascii-encoded text, ascii-encoded numbers, ascii-encoded bools, binary-encoded numbers, and misc-encoded text.


And no, your example will not work with any text, it would completely moji-bake 
my dbf files.



Only the binary blobs need to be decoded. We don't need to encode the
template to bytes, and the textual data doesn't get encoded until we're
ready to send it across the wire or write it to disk.


No!  When I have text, part of which gets ascii-encoded and part of which gets, say, cp1251 encoded, I cannot wait till 
the end!




And what if your name field has data not representable in latin-1?

--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8')
u'\u0441\u0440\u0403'


Where did you get those bytes from? You got them from somewhere.


For the sake of argument, pretend a user entered them in.


Who knows? Who cares? Once you have bytes, you can treat them as a blob of
arbitrary bytes and write them to the record using the Latin-1 trick.


No, I can't.  See above.


 If
you're reading those bytes from some stream that gives you bytes, you
don't have to care where they came from.


You're kidding, right?  If I don't know where they came from (a graphics field?  a note field?) how am I going to know 
how to treat them?




But what if you don't start with bytes? If you start with a bunch of
floats, you'll probably convert them to bytes using the struct module.


Yup, and I do.


If you start with non-ASCII text, you have to convert them to bytes too.
No difference here.


Really?  You just said above that "it will work with any text data" -- you 
can't have it both ways.



You ask the user for their name, they answer "срЃ" which is given to you
as a Unicode string, and you want to include it in your data record. The
specifications of your file format aren't clear, so I'm going to assume
that:

1) ASCII text is allowed "as-is" (that is, the name "George" will be
in the final data file as b'George');


User data is not (typically) where the ASCII data is, but some of the metadata is definitely and always ASCII.  The user 
text data needs to be encoded using whichever codec is specified by the file, which is only occasionally ASCII.




2) any other non-ASCII text will be encoded as some fixed encoding
which we can choose to suit ourselves;


Well, the user chooses it, we have to abide by their choice.  (It's kept in the 
file metadata.)



3) arbitrary binary data is allowed "as-is" (i.e. byte N has to end up
being written as byte N, for any value of N between 0 and 255).


In a couple field types, yes.  Usually the binary data is numeric or date related and there is conversion going on 
there, too, to give me the bytes I need.



[snip]


--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1')
Traceback (most recent call last):
   File "", line 1, in 
UnicodeEncodeError: 'latin-1' codec can't encode characters in position
0-2: ordinal not in range(256)


That is backwards to what I've shown. Look at my earlier example again:


And you are not paying attention:

'\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1')
\--/  \-/
 a non-ascii compatible unicode string  to latin1 bytes

("срЃ".encode('some_non_ascii_encoding_such_as_cp1251').decode('latin-1'), 42, 
blob.decode('latin-1'))
  \--/  \--/
   getting the actual bytes I needand back into unicode 
until I write them later

You did say to use a *text* template to manipulate my data, and then write it later, no?  Well, this is what it would 
look like.




Bytes get DECODED to latin-1, not encoded.

Bytes -> text is *decoding*
Text -> bytes is *encoding*


Pretend for a moment I know that, and look at my examples again.

I am demonstrating the contortions needed when my TEXTual data is not ASCII-compatible:  It must be ENcoded using the 
appropriate codec to BYTES, then DEcoded back to unicode using latin1, all so later I c

Re: [Python-Dev] byteformat() proposal: please critique

2014-01-11 Thread Nick Coghlan
On 12 January 2014 12:13, Brett Cannon  wrote:
> With that flexibility this matches what I have been mulling in the back of
> my head all day. Basically everything that goes in is assumed to be bytes
> unless {:s} says to expect something which can be passed to str() and then
> use some specified encoding in all instances (stupid example following as it
> might be easier with bytes.join, but it gets the point across)::
>
>   formatter = format_bytes('latin1', 'strict')
>   http_response = formatter(b'Content-Type: {:s}\r\n\r\nContent-Length:
> {:s}\r\n\r\n{}', 'image/jpeg', len(data), data)
>
> Nothing fancy, just an easy way to handle having to call str.encode() on
> every text argument that is to end up as bytes as Terry is proposing (and
> I'm fine with defaulting to ASCII/strict with no arguments). Otherwise you
> do what R. David Murray suggested and just have people rely on their own API
> which accepts what they want and then spits out what they want behind the
> scenes.
>
> It basically comes down to how much tweaking of existing Python 2.7
> %/.format() calls people will be expected to make. I'm fine with asking
> people to call a function like what Terry is proposing as it can do away
> with baking in that ASCII is reasonable as well as not require a bunch of
> work without us having to argue over what bytes.format() should or should
> not do. Personally I say bytes.format() is fine but it shouldn't do any text
> encoding which makes its usefulness rather minor (much like the other
> text-like methods that got carried forward in hopes that they would be
> useful to people porting code; maybe we should consider taking them out in
> Python 4 or something if we find out no one is using them).

There are several that are useful for manipulating binary data *as
binary data*, including some of those that assume ASCII compatibility.
Even some of the odd ones (like bytes.title) which we considered
deprecating around 3.2 or so (if I recall correctly) were left because
they're useful for HTTP style headers.

The thing about them all is that even though they do assume ASCII
compatibility, they don't do any implicit conversions between raw
bytes and other formats - they're all purely about transforming binary
data. PEP 460 as it currently stands is in the same vein - it doesn't
blur the lines between binary data and other formats, but it *does*
make binary data easier to work with, and in a way that is a subset of
what Python 2 8-bit strings allowed, further increasing the size of
the Python 2/3 source compatible subset.

The line that is crossed by suggestions like including number
formatting in PEP 460 is that those suggestions *do* introduce
implicit encoding from structured semantic data (a numeric value) to a
serialised format (the ASCII text representation of that number).
Implicitly encoding text (even with the ASCII codec and strict error
handling) similarly blurs the line between binary and text data again,
and is the kind of change that gets rejected as attempting to
reintroduce the Python 2 text model back into the Python 3 core types.

That said, while I don't think such a hybrid type is appropriate as
part of the *core* text model, I agree that such a type *could* be
useful when implementing protocol handling code. That's why I
suggested "asciicompat" to Benno as the package name for the home of
asciistr - I think it could be a good home for various utilities
designed for working with ASCII compatible binary protocols using a
more text-like API than that offered by the bytes type in Python 3.

I actually see much of this debate as akin to that over the API
changes between Google's original ipaddr module and the ipaddress API
in the standard library. The original ipaddr API is fine *if you
already know how IP networks work* - it plays fast and loose with
terminology, but in a way that you can deal with if you already know
the real meaning of the underlying concepts. However, anyone
attempting to go the other way (learning IP networking concepts from
the ipaddr API) will be hopelessly, hopelessly confused because the
terminology is used in *very* loose ways. So ipaddress tightened
things up and made the names more formally correct, aiming to make it
usable both as an address manipulation library *and* as a way of
learning the underlying IP addressing concepts.

I see the Python 2 str type as similar to the ipaddr API - if you
already know what you're doing when it comes to Unicode, then it's
pretty easy to work with. However, if you're trying to use it to
*learn* Unicode concepts, then you're pretty much stuffed, as you get
lost in a mazy of twisty values, as the same data type is used with
very different semantics, depending on which end of a data
transformation you're on (although sometimes you'll get a different
data type, depending on the data *values* involved).

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___

Re: [Python-Dev] cpython (3.3): Update Sphinx toolchain.

2014-01-11 Thread Georg Brandl
Am 11.01.2014 21:11, schrieb Terry Reedy:
> On 1/11/2014 2:04 PM, georg.brandl wrote:
>> http://hg.python.org/cpython/rev/87bdee4d633a
>> changeset:   88413:87bdee4d633a
>> branch:  3.3
>> parent:  88410:05e84d3ecd1e
>> user:Georg Brandl 
>> date:Sat Jan 11 20:04:19 2014 +0100
>> summary:
>>Update Sphinx toolchain.
>>
>> files:
>>Doc/Makefile |  8 
>>1 files changed, 4 insertions(+), 4 deletions(-)
>>
>>
>> diff --git a/Doc/Makefile b/Doc/Makefile
>> --- a/Doc/Makefile
>> +++ b/Doc/Makefile
>> @@ -41,19 +41,19 @@
>>   checkout:
>>  @if [ ! -d tools/sphinx ]; then \
>>echo "Checking out Sphinx..."; \
>> -  svn checkout $(SVNROOT)/external/Sphinx-1.0.7/sphinx tools/sphinx; \
>> +  svn checkout $(SVNROOT)/external/Sphinx-1.2/sphinx tools/sphinx; \
>>  fi
> 
> Doc/make.bat needs to be similarly updated.

Indeed, thanks for the reminder.

Georg

___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] test.support.check_warnings

2014-01-11 Thread Ethan Furman

On 01/11/2014 05:37 PM, Brett Cannon wrote:


You're assuming the context manager is doing something magical to verify that 
all calls in the block raise the expected
exception. What you want to do is execute it in a loop::

   for test in (...):
 with support.check_warnings(("automatic int conversions have been 
deprecated", DeprecationWarning), quiet=False):
   exec(test)


Well, this is test.support!  I expect magic!  ;)

Thanks for setting me straight, got it working.

--
~Ethan~
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Nick Coghlan
On 12 January 2014 02:33, M.-A. Lemburg  wrote:
> On 11.01.2014 16:34, Nick Coghlan wrote:
>> While that was an *expedient* (and, in fact, necessary) solution at
>> the time, the fact it is still thoroughly confusing people 13 years
>> later shows it is not a *comprehensible* solution.
>
> FWIW: I quite liked the Python 2 model, but perhaps that's because
> I already knww how Unicode works, so could use it to make my
> life easier ;-)

Right, I tried to capture that in
http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_answers.html#what-actually-changed-in-the-text-model-between-python-2-and-python-3
by pointing out that there are two *very* different kinds of code to
consider when discussing text modelling.

Application code lives in a nice clean world of structured data, text
data and binary data, with clean conversion functions for switching
between them.

Boundary code, by contrast, has to deal with the messy task of
translating between them all.

The Python 2 text model is a convenient model for boundary code,
because it implicitly allows switch between binary and text
interpretations of a data stream, and that's often useful due to the
way protocols and file formats are designed.

However, that kind of implicit switching is thoroughly inappropriate
for *application* code. So Python 3 switches the core text model to
one where implicitly switching between the binary domain and the text
domain is considered a *bad* thing, and we object strongly to any
proposals which suggest blurry the boundaries again, since that is
going back to a boundary code model rather than an application code
one.

I've been saying for years that we may need a third type, but it has
been nigh on impossible to get boundary code developers to say
anything more useful than "I preferred the Python 2 model, that was
more convenient for me". Yes, we know it was (we do maintain both of
them, after all, and did the update for the standard library's own
boundary code), but application developers are vastly more common, so
boundary code developers lost out on that one and we need to come up
with solutions that *respect* the Python 3 text model, rather than
trying to change it back to the Python 2 one.

> Seriously, Unicode has always caused heated discussions and
> I don't expect this to change in the next 5-10 years.
>
> The point is: there is no 100% perfect solution either way and
> when you acknowledge this, things don't look black and white anymore,
> but instead full of colors :-)

It would be nice if more boundary code developers actually did that
rather than coming out with accusatory hyperbole and pining for the
halcyon days of Python 2 where the text model favoured their use case
over that of normal application developers.

> Python 3 forces people to actually use Unicode; in Python 2 they
> could easily avoid it. It's good to educate people on how it's
> used and the issues you can run into, but let's not forget
> that people are trying to get work done and we all love readable
> code.
>
> PEP 460 just adds two more methods to the bytes object which come
> in handy when formatting binary data; I don't think it has potential
> to muddy the Python 3 text model, given that the bytes
> object already exposes a dozen of other ASCII text methods :-)

I dropped my objections to PEP 460 once Antoine fixed it to respect
the boundaries between binary and text data. It's now a pure binary
interpolation proposal, and one I think is a fine idea - there's no
implicit encoding or decoding involved, it's just a tool for
manipulating binary data.

That leaves the implicit encoding and decoding to the third party
asciistr type, as it should be.

> asciistr is interesting in that it coerces to bytes instead
> of to Unicode (as is the case in Python 2).

Not quite - the idea of asciistr is that it is designed to be a
*hybrid* type, like str was in Python 2. If it interacts with binary
objects, it will give a binary result, if it interacts with text
objects, it will give a text result. This makes it potentially
suitable for use for constants in hybrid binary/text APIs like
urllib.parse, allowing them to be implemented using a shared code path
once again.

The initial experimental implementation only works with 7 bit ASCII,
but the UTF-8 caching in the PEP 393 implementation opens up the
possibility of offering a non-strict mode in the future, as does the
option of allowing arbitrary 8-bit data and disallowing interoperation
with text strings in that case.

> At the moment it doesn't cover the more common case bytes + str,
> just str + bytes, but let's assume it would,

Right, I suspect we have some overbroad PyUnicode_Check() calls in
CPython that will need to be addressed before this substitution works
seamlessly - that's one of the reasons I've been asking people to
experiment with the idea since at least 2010 and let us know what
doesn't work (nobody did though, until Benno agreed to try it out
because it sounded like an i

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Nick Coghlan
On 12 January 2014 04:38, R. David Murray  wrote:
> But!  Our goal should be to help people convert to Python3.  So how can
> we find out what the specific problems are that real-world programs are
> facing, look at the *actual code*, and help that project figure out the
> best way to make that code work in both python2 and python3?
>
> That seems like the best way to find out what needs to be added to
> python3 or pypi:  help port the actual code of the developers who are
> running into problems.
>
> Yes, I'm volunteering to help with this, though of course I can't promise
> exactly how much time I'll have available.

And, as has been the case for a long time, the PSF stands ready to
help with funding credible grant proposals for Python 3 porting
efforts. I believe some of the core devs (including David?) do
freelance and contract work, so that's an option definitely worth
considered if a project would like to support Python 3, but are having
difficulty getting their with purely volunteer effort.

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] cpython (3.3): Issue #19092 - Raise a correct exception when cgi.FieldStorage is given an

2014-01-11 Thread Nick Coghlan
On 12 January 2014 16:22, senthil.kumaran  wrote:
> summary:
>   Issue #19092 - Raise a correct exception when cgi.FieldStorage is given an
> invalid file-obj. Also use __bool__ to determine the bool of the FieldStorage
> object.

>  Library
>  ---
>
> +- Issue #19097: Raise the correct Exception when cgi.FieldStorage is given an
> +  Invalid fileobj.

You may want to tweak the tracker so the comment ends up on the
appropriate issue (#19092 is something else entirely)

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
___
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com