date:20090429

Re: [Python-Dev] PEP 383 (again)

2009-04-29 Thread Martin v. Löwis

> OK, so you are saying that under PEP 383, utf-8b wouldn't be used
> anywhere on Windows by default.  That's not clear from your proposal.

You didn't read it carefully enough. The first three paragraphs of
the "Specification" section make that clear.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

> C. File on disk with the invalid surrogate code, accessed via the str
> interface, no decoding happens, matches in memory the file on disk
> with
> the byte that translates to the same surrogate, accessed via the bytes
> interface.  Ambiguity.
 Is that an alternative to A and B?
>>> I guess it is an adjunct to case B, the current PEP.
>>>
>>> It is what happens when using the PEP on a system that provides both
>>> bytes and str interfaces, and both get used.
>>
>> Your formulation is a bit too stenographic to me, but please trust me
>> that there is *no* ambiguity in the case you construct.
> 
> 
> No Martin, the point of reviewing the PEP is to _not_ trust you, even
> though you are generally very knowledgeable and very trustworthy.  It is
> much easier to find problems before something is released, or even
> coded, than it is afterwards.

Sure. However, that requires you to provide meaningful, reproducible
counter-examples, rather than a stenographic formulation that might
hint some problem you apparently see (which I believe is just not
there).

> You assumed, and maybe I wasn't clear in my statement.
> 
> By "accessed via the str interface" I mean that (on Windows) the wide
> string interface would be used to obtain a file name.

What does that mean? What specific interface are you referring to to
obtain file names? Most of the time, file names are obtained by the
user entering them on the keyboard. GUI applications are completely
out of the scope of the PEP.

> Now, suppose that
> the file name returned contains "abc" followed by the half-surrogate
> U+DC10 -- four 16-bit codes.

Ok, so perhaps you might be talking about os.listdir here. Communication
would be much easier if I would not need to guess what you may mean.

Also, why is U+DC10 four 16-bit codes?

> Then, ask for the same filename via the bytes interface, using UTF-8
> encoding.

How do you do that on Windows? You cannot just pick an encoding, such
as UTF-8, and pass that to the byte interface, and expect it to work.
If you use the byte interface, you need to encode in the file system
encoding, of course.

Also, what do you mean by "ask for"?? WHAT INTERFACE ARE YOU
USING Please use specific python code.

> The PEP says that the above name would get translated to
> "abc" followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes
> used to represent the half-surrogate that is actually in the file name,
> specifically U+DCED U+DCB0 U+DC90.  This means that one name on disk can
> be seen as two different names in memory.

You are relying on false assumptions here, namely that the UTF-8
encoding would play any role.

What would happen instead is that the "mbcs" encoding would be used. The
"mbcs" encoding, by design from Microsoft, will never report an error,
so the error handler will not be invoked at all.

> Now posit another file which, when accessed via the str interface, has
> the name "abc" followed by U+DCED U+DCB0 U+DC90.
> 
> Looks ambiguous to me.  Now if you have a scheme for handling this case,
> fine, but I don't understand it from what is written in the PEP.

You were just making false assumptions in your reasoning, assumptions
that are way beyond the scope of the PEP.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello


Glenn Linderman a écrit :


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.




The problem with this strategy is: paths are often sliced, so your 2 codepoints 
could get separated. The good thing with the PEP's strategy is that 1 character 
stays 1 character.


Baptiste

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 12:17 AM, came the following characters from 
the keyboard of Martin v. Löwis:

OK, so you are saying that under PEP 383, utf-8b wouldn't be used
anywhere on Windows by default.  That's not clear from your proposal.


You didn't read it carefully enough. The first three paragraphs of
the "Specification" section make that clear.



Sorry, rereading those paragraphs even with this declaration in mind, 
does not make that clear.  It is not enough to have a solution that 
works; it is necessary to communicate that solution clearly enough that 
people understand it.  By the huge amount of feedback you have received, 
it is clear that either the solution doesn't work, or that it wasn't 
communicated clearly.


The following comments are an attempt to help you make the PEP clear, 
based on your above declaration that UTF-8b wouldn't be used on Windows. 
 I may still be unclear about what you mean, but if you can accept 
these enhancements to the PEP, then maybe we are approaching a common 
understanding; if not, you should be aware that the PEP still needs 
clarification.



In the first paragraph, you should make it clear that Python 3.0 does 
not use the Windows bytes interfaces, if it doesn't.  "Python uses 
*only* the wide character APIs..." would suffice.  As stated, it seems 
like Python *does* use the wide character APIs, but leaves open the 
possibility that it might use byte APIs also.  A short description of 
what happens on Windows when Python code uses bytes APIs would also be 
helpful.


In the second paragraph, it speaks of "currently" but then speaks of 
using the half-surrogates.  I don't believe that happens "currently". 
You did change tense, but that paragraph is quite confusing, currently, 
because of the tense change.  You should describe there, the action that 
is currently taken by Python for non-decodable byes, and then in the 
next paragraph talk about what the PEP changes.


The 4th paragraph is now confusing too... would it not be the decode 
error handler that returns the byte strings, in addition to the Unicode 
strings?


The 5th paragraph has apparently confused some people into thinking this 
PEP only applies to locale's using UTF-8 encodings; you should have an 
"else clause" to clear that up, pointing out that the reverse encoding 
of half-surrogates by other encodings already produces errors, that 
UTF-8 is a special case, not the only case.


The code added to the discussion has mismatched (), making me wonder if 
it is complete.  There is a reasonable possibility that only the final ) 
is missing.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-Dev PEP 383: Non-decodable Bytes in System Character?Interfaces

2009-04-29 Thread Cameron Simpson

On 29Apr2009 08:27, Martin v. L?wis  wrote:
| > I would like utility functions to perform:
| >   os-bytes->funny-encoded
| >   funny-encoded->os-bytes
| > or explicit example code snippets for same in the PEP text.
| 
| Done!

Thanks!
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Hrvoje Niksic

Zooko O'Whielacronx wrote:
If you switch to iso8859-15 only in the presence of undecodable  
UTF-8, then you have the same round-trip problem as the PEP: both  
b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a  
way to unambiguously recover the original file name.

Why do you say that?  It seems to work as I expected here:

 >>> '\xff'.decode('iso-8859-15')
u'\xff'
 >>> '\xc3\xbf'.decode('iso-8859-15')
u'\xc3\xbf'

Here is what I mean by "switch to iso8859-15" only in the presence of 
undecodable UTF-8:

def file_name_to_unicode(fn, encoding):
try:
return fn.decode(encoding)
except UnicodeDecodeError:
return fn.decode('iso-8859-15')

Now, assume a UTF-8 locale and try to use it on the provided example 
file names.

>>> file_name_to_unicode(b'\xff', 'utf-8')
'ÿ'
>>> file_name_to_unicode(b'\xc3\xbf', 'utf-8')
'ÿ'

That is the ambiguity I was referring to -- to different byte sequences 
result in the same unicode string.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello


Lino Mastrodomenico a écrit :


Only for the new utf-8b encoding (if Martin agrees), while the
existing utf-8 is fine as is (or at least waaay outside the scope of
this PEP).



This is questionable. This would have the consequence that \udcxx in a python 
string would sometimes mean a surrogate, and sometimes mean raw bytes, depending 
on the history of the string.


By contrast, if the new utf-8b codec would *supercede* the old one, \udcxx would 
always mean raw bytes (at least on UCS-4 builds, where surrogates are unused). 
Thus ambiguity could be avoided.


Baptiste

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello


Glenn Linderman a écrit :



If there is going to be a required transformation from de novo strings 
to funny-encoded strings, then why not make one that people can actually 
see and compare and decode from the displayable form, by using 
displayable characters instead of lone surrogates?




The problem with your "escape character" scheme is that the meaning is lost with 
slicing of the strings, which is a very common operation.




I though half-surrogates were illegal in well formed Unicode. I confess
to being weak in this area. By "legitimate" above I meant things like
half-surrogates which, like quarks, should not occur alone?
  


"Illegal" just means violating the accepted rules.  In this case, the 
accepted rules are those enforced by the file system (at the bytes or 
str API levels), and by Python (for the str manipulations).  None of 
those rules outlaw lone surrogates.  [...]




Python could as well *specify* that lone surrogates are illegal, as their 
meaning is undefined by Unicode. If this rule is respected language-wise, there 
is no ambiguity. It might be unrealistic on windows, though.


This rule could even be specified only for strings that represent filesystem 
paths. Sure, they are the same type as other strings, but the programmer usually 
knows if a given string is intended to be a path or not.


Baptiste

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Thomas Breuel

> Sure. However, that requires you to provide meaningful, reproducible
> counter-examples, rather than a stenographic formulation that might
> hint some problem you apparently see (which I believe is just not
> there).


Well, here's another one: PEP 383 would disallow UTF-8 encodings of half
surrogates.  But such encodings are currently supported by Python, and they
are used as part of CESU-8 coding.  That's, in fact, a common way of
converting UTF-16 to UTF-8.  How are you going to deal with existing code
that relies on being able to code half surrogates as UTF-8?

Tom
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-29 Thread Antoine Pitrou

Thomas Breuel  gmail.com> writes:
> 
> The error checking isn't necessarily deficient.  For example, a safe and
legitimate thing to do is for third party libraries to throw a C++ exception,
raise a Python exception, or delete the half surrogate.

Do you have any concrete examples of this behaviour? When e.g. Nautilus shows
some illegal UTF-8 filenames in an UTF-8 locale, it replaces the offending bytes
with placeholders rather than crash in your face.

> PEP 383 is a proposal that suggests changing Python such that malformed
unicode strings become a required part of Python and such that Pyhon writes
illegal UTF-8 encodings to UTF-8 encoded file systems.

That's again a misleading statement.
It only writes an "illegal encoding" if it received one from the filesystem in
the first place. A clean filesystem will only receive clean filenames.

>  Those are big changes, and it's legitimate to ask that PEP 383 address the
implications of that choice before it's made.

No, it's legitimate to ask that /you/ back up your arguments with concrete
facts. It's difficult to demonstrate the non-existence of a problem. On the
other hand, you can easily demonstrate that it exists, if it really does.

By the way, most of those libraries under Unix would take a char * as input, so
they wouldn't deal with an "illegal unicode string", they would deal with the
original byte string.

Regards

Antoine.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 12:38 AM, came the following characters from 
the keyboard of Baptiste Carvello:

Glenn Linderman a écrit :


3. When an undecodable byte 0xPQ is found, decode to the escape 
codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.




The problem with this strategy is: paths are often sliced, so your 2 
codepoints could get separated. The good thing with the PEP's strategy 
is that 1 character stays 1 character.


Baptiste



Except for half-surrogates that are in the file names already, which get 
converted to 3 characters.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 12:29 AM, came the following characters from 
the keyboard of Martin v. Löwis:

C. File on disk with the invalid surrogate code, accessed via the str
interface, no decoding happens, matches in memory the file on disk
with
the byte that translates to the same surrogate, accessed via the bytes
interface.  Ambiguity.

Is that an alternative to A and B?

I guess it is an adjunct to case B, the current PEP.

It is what happens when using the PEP on a system that provides both
bytes and str interfaces, and both get used.

Your formulation is a bit too stenographic to me, but please trust me
that there is *no* ambiguity in the case you construct.


No Martin, the point of reviewing the PEP is to _not_ trust you, even
though you are generally very knowledgeable and very trustworthy.  It is
much easier to find problems before something is released, or even
coded, than it is afterwards.


Sure. However, that requires you to provide meaningful, reproducible
counter-examples, rather than a stenographic formulation that might
hint some problem you apparently see (which I believe is just not
there).


You assumed, and maybe I wasn't clear in my statement.

By "accessed via the str interface" I mean that (on Windows) the wide
string interface would be used to obtain a file name.


What does that mean? What specific interface are you referring to to
obtain file names? Most of the time, file names are obtained by the
user entering them on the keyboard. GUI applications are completely
out of the scope of the PEP.


Now, suppose that
the file name returned contains "abc" followed by the half-surrogate
U+DC10 -- four 16-bit codes.


Ok, so perhaps you might be talking about os.listdir here. Communication
would be much easier if I would not need to guess what you may mean.



os.listdir("")




Also, why is U+DC10 four 16-bit codes?



It isn't.

First 16-bit code is U+0061
Second 16-bit code is U+0062
Third 16-bit code is U+0063
Fourth 16-bit code is U+DC10




Then, ask for the same filename via the bytes interface, using UTF-8
encoding.


How do you do that on Windows? You cannot just pick an encoding, such
as UTF-8, and pass that to the byte interface, and expect it to work.
If you use the byte interface, you need to encode in the file system
encoding, of course.

Also, what do you mean by "ask for"?? WHAT INTERFACE ARE YOU
USING Please use specific python code.



os.listdir(b"")

I find that on my Windows system, with all ASCII path file names, that I 
get quite different results when I pass os.listdir an empty str vs an 
empty bytes.


Rather than keep you guessing, I get the root directory contents from 
the empty str, and the current directory contents from an empty bytes. 
That is rather unexpected.


So I guess I'd better suggest that a specific, equivalent directory name 
be passed in either bytes or str form.




The PEP says that the above name would get translated to
"abc" followed by 3 half-surrogates, corresponding to the 3 UTF-8 bytes
used to represent the half-surrogate that is actually in the file name,
specifically U+DCED U+DCB0 U+DC90.  This means that one name on disk can
be seen as two different names in memory.


You are relying on false assumptions here, namely that the UTF-8
encoding would play any role.

What would happen instead is that the "mbcs" encoding would be used. The
"mbcs" encoding, by design from Microsoft, will never report an error,
so the error handler will not be invoked at all.



So what you are saying here is that Python doesn't use the "A" forms of 
the Windows APIs for filenames, but only the "W" forms, and uses lossy 
decoding (from MS) to the current code page (which can never be UTF-8 on 
Windows).


You are further saying that Python doesn't give the programmer control 
over the codec that is used to convert from W results to bytes, so that 
on Windows, it is impossible to obtain a bytes result containing UTF-8 
from os.listdir, even though sys.setfilesystemencoding exists, and 
sys.getfilesystemencoding is affected by it, and the latter is 
documented as returning "mbcs", and as returning the codec that should 
be used by the application to convert str to bytes for filenames. 
(Python 3.0.1).


While I can hear a "that is outside the scope of the PEP" coming, this 
documentation is confusing, to say the least.




Now posit another file which, when accessed via the str interface, has
the name "abc" followed by U+DCED U+DCB0 U+DC90.

Looks ambiguous to me.  Now if you have a scheme for handling this case,
fine, but I don't understand it from what is written in the PEP.


You were just making false assumptions in your reasoning, assumptions
that are way beyond the scope of the PEP.



Absolutely correct.  I was making what seemed to be reasonable 
assumptions about Python internals on Windows, and several of them are 
false, including misleading documentation for listdir (which doesn't 
specify that bytes and str parameters affect whether or

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread R. David Murray

On Tue, 28 Apr 2009 at 20:29, Glenn Linderman wrote:
On approximately 4/28/2009 7:40 PM, came the following characters from the 
keyboard of R. David Murray:

 On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote:
>  C. File on disk with the invalid surrogate code, accessed via the str 
>  interface, no decoding happens, matches in memory the file on disk with 
>  the byte that translates to the same surrogate, accessed via the bytes 
>  interface. Ambiguity.

 Unless I'm missing something, one of these is type str, and the other is
 type bytes, so no ambiguity.

You are missing that the bytes value would get decoded to a str; thus both 
are str; so ambiguity is possible.

Only if you as the programmer decode it.  Now, I don't understand the
subtleties of Unicode enough to know if Martin has already successfully
addressed this concern in another fashion, but personally I think that
if you as a programmer are comparing funnydecoded-str strings gotten
via a string interface with normal-decoded strings gotten via a bytes
interface, that we could claim that your program has a bug.

--David
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson

On 29Apr2009 02:56, Glenn Linderman  wrote:
> os.listdir(b"")
>
> I find that on my Windows system, with all ASCII path file names, that I  
> get quite different results when I pass os.listdir an empty str vs an  
> empty bytes.
>
> Rather than keep you guessing, I get the root directory contents from  
> the empty str, and the current directory contents from an empty bytes.  
> That is rather unexpected.
>
> So I guess I'd better suggest that a specific, equivalent directory name  
> be passed in either bytes or str form.

I think you may have uncovered an implementation bug rather than an
encoding issue (because I'd expect "" and b"" to be equivalent).

In ancient times, "" was a valid UNIX name for the working directory.
POSIX disallows that, and requires people to use ".".

Maybe you're seeing an artifact; did python move from UNIX to Windows or the
other way around in its porting history? I'd guess the former.

Do you get differing results from listdir(".") and listdir(b".") ?
How's python2 behave for ""? (Since there's no b"" in python2.)

Cheers,
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

'Supposing a tree fell down, Pooh, when we were underneath it?'
'Supposing it didn't,' said Pooh after careful thought.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 4:07 AM, came the following characters from 
the keyboard of R. David Murray:

On Tue, 28 Apr 2009 at 20:29, Glenn Linderman wrote:
On approximately 4/28/2009 7:40 PM, came the following characters from 
the keyboard of R. David Murray:

 On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote:
>  C. File on disk with the invalid surrogate code, accessed via the 
str >  interface, no decoding happens, matches in memory the file on 
disk with >  the byte that translates to the same surrogate, accessed 
via the bytes >  interface. Ambiguity.


 Unless I'm missing something, one of these is type str, and the 
other is

 type bytes, so no ambiguity.



You are missing that the bytes value would get decoded to a str; thus 
both are str; so ambiguity is possible.


Only if you as the programmer decode it.  Now, I don't understand the
subtleties of Unicode enough to know if Martin has already successfully
addressed this concern in another fashion, but personally I think that
if you as a programmer are comparing funnydecoded-str strings gotten
via a string interface with normal-decoded strings gotten via a bytes
interface, that we could claim that your program has a bug.


Hopefully Martin will clarify the PEP as I suggested in another branch 
of this thread.  He has eventually convinced me that this ambiguity is 
not possible, via email discussion, but the PEP is certainly less than 
sufficiently explanatory to make that obvious.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 4:36 AM, came the following characters from 
the keyboard of Cameron Simpson:

On 29Apr2009 02:56, Glenn Linderman  wrote:
  

os.listdir(b"")

I find that on my Windows system, with all ASCII path file names, that I  
get quite different results when I pass os.listdir an empty str vs an  
empty bytes.


Rather than keep you guessing, I get the root directory contents from  
the empty str, and the current directory contents from an empty bytes.  
That is rather unexpected.


So I guess I'd better suggest that a specific, equivalent directory name  
be passed in either bytes or str form.



I think you may have uncovered an implementation bug rather than an
encoding issue (because I'd expect "" and b"" to be equivalent).
  


Me too.


In ancient times, "" was a valid UNIX name for the working directory.
POSIX disallows that, and requires people to use ".".

Maybe you're seeing an artifact; did python move from UNIX to Windows or the
other way around in its porting history? I'd guess the former.

Do you get differing results from listdir(".") and listdir(b".") ?
  


No.  Both are the same as b""


How's python2 behave for ""? (Since there's no b"" in python2.)


Python2 os.listdir("") produces the same thing as Python3 os.listdir(b"")
Python2 os.listdir(u"") produces the same thing as Python3 os.listdir("")


Another phenomenon of note:

I created a directory named ábc.  (Windows XP, Python 3.0.1, Python 
2.6.1, SetConsoleOutputCP(65001))

Python3 os.listdir(b".") prints it as b"\xe1bc"
Python2 os.listdir(".") prints it as b"\xe1bc"
Python2 os.listdir(u".") prints it as u"\xe1bc"
Python3 os.listdir(".") prints it as "bc"


--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-29 Thread Stephen J. Turnbull

Thomas Breuel writes:

 > PEP 383 violated (2), and I think that's a bad thing.

The whole purpose of PEP 383 is to send the exact same bytes that were
read from the OS back to the OS => violating (2) (for whatever the
apparent system file-encoding is, not limited to UTF-8), and that has
overwhelmingly popular support.

Note that this won't happen automatically, either, AIUI.  The PEP's
proposed implementation is as an error handler, and this would need to
be specified explicitly.  It's not intended to be the default.

 > I think the best solution would be to use (3a) and fall back to (3b) if that
 > doesn't work.  If people try to write those strings, they will always get
 > written as correctly encoded UTF-8 strings.

The intended audience aren't trying to write anything in particular,
though.  They just want to repeat verbatim what the OS told them.

 > There is yet another option, which is arguably the "right" one: make the
 > results of os.listdir() subclasses of string that keep track of where they
 > came from.

Sure.  This has been mentioned by several people.  Martin has no
intention of doing it in PEP 383, though, so it will need a new PEP.
It has gotten strong pushback from several people, as well.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Stephen J. Turnbull

Baptiste Carvello writes:

 > By contrast, if the new utf-8b codec would *supercede* the old one,
 > \udcxx would always mean raw bytes (at least on UCS-4 builds, where
 > surrogates are unused). Thus ambiguity could be avoided.

Unfortunately, that's false.  It could have come from a literal string
(similar to the text above ;-), a C extension, or a string slice (on
16-bit builds), and there may be other ways to do it.  The only way to
avoid ambiguity is to change the definition of a Python string to be
*valid* Unicode (possibly with Python extensions such as PEP 383 for
internal use only).  But Guido has rejected that in the past;
validation is the application's problem, not Python's.

Nor is a UCS-4 build exempt.  IIRC Guido specifically envisioned
Python strings being used to build up code point sequences to be
directly output, which means that a UCS-4 string might none-the-less
contain surrogates being added to a string intended to be sent as
UTF-16 output simply by truncating the 32-bit code units to 16 bits.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Stephen J. Turnbull

"Martin v. Löwis" writes:

 > I find the case pretty artificial, though: if the locale encoding
 > changes, all file names will look incorrect to the user, so he'll
 > quickly switch back, or rename all the files.

It's not necessarily the case that the locale encoding changes, but
rather the name of the file.  I have a couple of directories where I
have Japanese in both EUC-JP and UTF-8, for example.  (The
applications where I never bothered to do a conversion from EUC to
UTF-8 are things like stripping MIME attachments from messages and
saving them to files when I changed my default.)

So I have a little Emacs Lisp function that tries EUC or UTF8
depending on date and falls back to the other on a decode error.

Another possible situation would be a user program in the user's
locale communicating with a daemon running in some other locale (quite
likely POSIX).

So while out of scope of the PEP, I don't think it's at all
artificial.
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] string to float containing whitespace

2009-04-29 Thread skip

Someone please tell me I'm not going mad.  I could have sworn that once upon
a time attempting to convert numeric strings to ints or floats if they
contained whitespace raised an exception.  As far back as 1.5.2 it appears
that float(), string.atof() and string.atoi() allow whitespace.  Maybe I'm
thinking of trailing non-numeric, non-whitespace characters.

Skip
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] string to float containing whitespace

2009-04-29 Thread Amaury Forgeot d'Arc

Hi,

2009/4/29  :
> Someone please tell me I'm not going mad.  I could have sworn that once upon
> a time attempting to convert numeric strings to ints or floats if they
> contained whitespace raised an exception.  As far back as 1.5.2 it appears
> that float(), string.atof() and string.atoi() allow whitespace.  Maybe I'm
> thinking of trailing non-numeric, non-whitespace characters.

You are maybe referring to the Decimal constructor:
   decimal.Decimal(" 123")
fails with 2.5, but works with 2.6. (issue 1780)

-- 
Amaury Forgeot d'Arc
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] string to float containing whitespace

2009-04-29 Thread skip


Amaury> You are maybe referring to the Decimal constructor:
Amaury>decimal.Decimal(" 123")
Amaury> fails with 2.5, but works with 2.6. (issue 1780)

Highly unlikely, since my recollection is from way back in the early days.
Also, I have yet to actually use the decimal module. :-/

Skip
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Installing Python 2.5.4 from Source under Windows

2009-04-29 Thread Paul Franz

I have looked and looked and looked. But I can not find any directions 
on how to install the version of Python build using Microsoft's 
compiler. It builds. I get the dlls and the exe's. But there is no 
documentation that says how to install what has been built. I have read 
every readme and stop by the IRC channel and there seems to be nothing.


Any ideas where I can look?

Paul Franz
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Installing Python 2.5.4 from Source under Windows

2009-04-29 Thread Aahz

On Wed, Apr 29, 2009, Paul Franz wrote:
>
> I have looked and looked and looked. But I can not find any directions  
> on how to install the version of Python build using Microsoft's  
> compiler. It builds. I get the dlls and the exe's. But there is no  
> documentation that says how to install what has been built. I have read  
> every readme and stop by the IRC channel and there seems to be nothing.
>
> Any ideas where I can look?

Please use comp.lang.python -- python-dev is for discussion of core
development.
-- 
Aahz ([email protected])   <*> http://www.pythoncraft.com/

"If you think it's expensive to hire a professional to do the job, wait
until you hire an amateur."  --Red Adair
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Installing Python 2.5.4 from Source under Windows

2009-04-29 Thread Paul Franz


Ok. I will ask on the python-list.

Paul Franz

Aahz wrote:

On Wed, Apr 29, 2009, Paul Franz wrote:
  
I have looked and looked and looked. But I can not find any directions  
on how to install the version of Python build using Microsoft's  
compiler. It builds. I get the dlls and the exe's. But there is no  
documentation that says how to install what has been built. I have read  
every readme and stop by the IRC channel and there seems to be nothing.


Any ideas where I can look?



Please use comp.lang.python -- python-dev is for discussion of core
development.
  

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Proposed: add support for UNC paths to all functions in ntpath

2009-04-29 Thread Larry Hastings



I've written a patch for Python 3.1 that changes os.path so it handles 
UNC paths on Windows:


  http://bugs.python.org/issue5799

In a Windows path string, a UNC path functions *exactly* like a drive
letter.  This patch means that the Python path split/join functions
treats them as if they were.

For instance:
   >>> splitdrive("A:\\FOO\\BAR.TXT")
   ("A:", "\\FOO\\BAR.TXT")

With this patch applied:
   >>> splitdrive("HOSTNAME\\SHARE\\FOO\\BAR.TXT")
   ("HOSTNAME\\SHARE", "\\FOO\\BAR.TXT")

This methodology only breaks down in one place: there is no "default
directory" for a UNC share point.  E.g. you can say
   >>> os.chdir("c:")
or
   >>> os.chdir("c:foo\\bar")
but you can't say
   >>> os.chdir("hostname\\share")
But this is irrelevant to the patch.

Here's what my patch changes:
* Modify join, split, splitdrive, and ismount to add explicit support
 for UNC paths.  (The other functions pick up support from these four.)
* Simplify isabs and normpath, now that they don't need to be delicate
 about UNC paths.
* Modify existing unit tests and add new ones.
* Document the changes to the API.
* Deprecate splitunc, with a warning and a documentation remark.

This patch adds one subtle change I hadn't expected.  If you call
split() with a drive letter followed by a trailing slash, it returns the
trailing slash as part of the "head" returned.  E.g.
   >>> os.path.split("\\")
   ("\\", "")
   >>> os.path.split("A:\\")
   ("A:\\", "")
This is mentioned in the documentation, as follows:
   Trailing slashes are stripped from head unless it is the root
   (one or more slashes only).

For some reason, when os.path.split was called with a UNC path with only
a trailing slash, it stripped the trailing slash:
   >>> os.path.split("hostname\\share\\")
   ("hostname\\share", "")
My patch changes this behavior; you would now see:
   >>> os.path.split("hostname\\share\\")
   ("hostname\\share\\", "")
I think it's an improvement--this is more consistent.  Note that this
does *not* break the documented requirement that
os.path.join(os.path.split(path)) == path; that continues to work fine.


In the interests of full disclosure: I submitted a patch providing this
exact behavior just over ten years ago.  GvR accepted it into Python
1.5.2b2 (marked "*EXPERIMENTAL*") and removed it from 1.5.2c1.

You can read GvR's commentary upon removing it; see comments in
Misc/HISTORY  dated "Tue Apr  
6 19:38:18 1999".  If memory serves
correctly, the problems cited were only on Cygwin.  At the time Cygwin
used "ntpath", and it supported "//a/foo" as an alias for "A:\\FOO". 
You can see how this would cause Cygwin problems.


In the intervening decade, two highly relevant things have happened:
* Python no longer uses ntpath for os.path on Cygwin.  Instead it uses
 posixpath.
* Cygwin removed the "//a/foo" drive letter hack.  In fact, I believe it
 now support UNC paths.
Therefore this patch will have no effect on Cygwin users.


What do you think?


/larry/

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-29 Thread Martin v. Löwis

> In the first paragraph, you should make it clear that Python 3.0 does
> not use the Windows bytes interfaces, if it doesn't.  "Python uses
> *only* the wide character APIs..." would suffice.

That's not quite exact. It uses both ANSI and Wide APIs - depending
on whether you pass bytes as input or strings. Please see the Python
source code to find out how this works, and what that means.

> As stated, it seems
> like Python *does* use the wide character APIs, but leaves open the
> possibility that it might use byte APIs also.  A short description of
> what happens on Windows when Python code uses bytes APIs would also be
> helpful.

I'm at a loss how to make the text more clear than it already is. I'm
really not good at writing long essays, with a lot of
explanatory-but-non-normative text. I also think that explanations do
not belong in the section titled specification, nor does a full
description of the status quo belongs into the PEP at all. The reader
should consult the current Python source code if in doubt what the
status quo is.

> In the second paragraph, it speaks of "currently" but then speaks of
> using the half-surrogates.  I don't believe that happens "currently".
> You did change tense, but that paragraph is quite confusing, currently,
> because of the tense change.  You should describe there, the action that
> is currently taken by Python for non-decodable byes, and then in the
> next paragraph talk about what the PEP changes.

Thanks, fixed.

> The 4th paragraph is now confusing too... would it not be the decode
> error handler that returns the byte strings, in addition to the Unicode
> strings?

No, why do you think so? That's intended as stated.

> The 5th paragraph has apparently confused some people into thinking this
> PEP only applies to locale's using UTF-8 encodings; you should have an
> "else clause" to clear that up, pointing out that the reverse encoding
> of half-surrogates by other encodings already produces errors, that
> UTF-8 is a special case, not the only case.

I have fixed that by extending the third paragraph.

> The code added to the discussion has mismatched (), making me wonder if
> it is complete.  There is a reasonable possibility that only the final )
> is missing.

Indeed; this is now also fixed.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

> Sure. However, that requires you to provide meaningful, reproducible
> counter-examples, rather than a stenographic formulation that might
> hint some problem you apparently see (which I believe is just not
> there).
> 
> 
> Well, here's another one: PEP 383 would disallow UTF-8 encodings of half
> surrogates.  But such encodings are currently supported by Python, and
> they are used as part of CESU-8 coding.  That's, in fact, a common way
> of converting UTF-16 to UTF-8.  How are you going to deal with existing
> code that relies on being able to code half surrogates as UTF-8?

Can you please elaborate? What code specifically are you talking about?

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

>>> C. File on disk with the invalid surrogate code, accessed via the
>>> str interface, no decoding happens, matches in memory the file on disk
>>> with the byte that translates to the same surrogate, accessed via the
>>> bytes interface.  Ambiguity.
>> What does that mean? What specific interface are you referring to to
>> obtain file names? 
> 
> os.listdir("")
> 
> os.listdir(b"")
> 
> So I guess I'd better suggest that a specific, equivalent directory name
> be passed in either bytes or str form.

[Leaving the issue of the empty string apparently having different
meanings aside ...]

Ok. Now I understand the example. So you do

os.listdir("c:/tmp")
os.listdir(b"c:/tmp")

and you have a file in c:/tmp that is named "abc\uDC10".

> So what you are saying here is that Python doesn't use the "A" forms of
> the Windows APIs for filenames, but only the "W" forms, and uses lossy
> decoding (from MS) to the current code page (which can never be UTF-8 on
> Windows).

Actually, it does use the A form, in the second listdir example. This,
in turn (inside Windows), uses the lossy CP_ACP encoding. You get back
a byte string; the listdirs should give

["abc\uDC10"]
[b"abc?"]

(not quite sure about the second - I only guess that CP_ACP will replace
the half surrogate with a question mark).

So where is the ambiguity here?

> You are further saying that Python doesn't give the programmer control
> over the codec that is used to convert from W results to bytes, so that
> on Windows, it is impossible to obtain a bytes result containing UTF-8
> from os.listdir, even though sys.setfilesystemencoding exists, and
> sys.getfilesystemencoding is affected by it, and the latter is
> documented as returning "mbcs", and as returning the codec that should
> be used by the application to convert str to bytes for filenames.
> (Python 3.0.1).

Not exactly. You *can* do setfilesystemencoding on Windows, but it has
no effect, as the Python file system encoding is never used on Windows.
For a string, it passes it to the W API as is; for bytes, it passes it
to the A API as-is. Python never invokes any codec here.

> While I can hear a "that is outside the scope of the PEP" coming, this
> documentation is confusing, to say the least.

Only because you are apparently unaware of the status quo. If you would
study the current Python source code, it would be all very clear.

> Things are a little clearer in the documentation for
> sys.setfilesystemencoding, which does say the encoding isn't used by
> Windows -- so why is it permitted to change it, if it has no effect?).

As in many cases: because nobody contributed code to make it behave
otherwise. It's not that the file system encoding is "mbcs" - the
file system encoding is simply unused on Windows (but that wasn't
always the case, in particular not when Windows 9x still had to
be supported).

Regards,
Martin

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

> So while out of scope of the PEP, I don't think it's at all
> artificial.

Sure - but I see this as the same case as "the file got renamed".
If you have a LRU list in your app, and a file gets renamed, then
the LRU list breaks (unless you also store the inode number in the
LRU list, and lookup the file by inode number - or object UUID
on NTFS, possibly using distributed link tracking).

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy


Glenn Linderman wrote:
On approximately 4/29/2009 4:36 AM, came the following characters from 
the keyboard of Cameron Simpson:

On 29Apr2009 02:56, Glenn Linderman  wrote:
 

os.listdir(b"")

I find that on my Windows system, with all ASCII path file names, 
that I  get quite different results when I pass os.listdir an empty 
str vs an  empty bytes.


Rather than keep you guessing, I get the root directory contents 
from  the empty str, and the current directory contents from an empty 
bytes.  That is rather unexpected.


So I guess I'd better suggest that a specific, equivalent directory 
name  be passed in either bytes or str form.



I think you may have uncovered an implementation bug rather than an
encoding issue (because I'd expect "" and b"" to be equivalent).
  


Me too.


Sounds like an issue for the tracker.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy


Thomas Breuel wrote:


Sure. However, that requires you to provide meaningful, reproducible
counter-examples, rather than a stenographic formulation that might
hint some problem you apparently see (which I believe is just not
there).


Well, here's another one: PEP 383 would disallow UTF-8 encodings of half 
surrogates. 


By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows that.

But such encodings are currently supported by Python, and 
they are used as part of CESU-8 coding.  That's, in fact, a common way 
of converting UTF-16 to UTF-8.  How are you going to deal with existing 
code that relies on being able to code half surrogates as UTF-8?


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 1:28 PM, came the following characters from 
the keyboard of Martin v. Löwis:

C. File on disk with the invalid surrogate code, accessed via the
str interface, no decoding happens, matches in memory the file on disk
with the byte that translates to the same surrogate, accessed via the
bytes interface.  Ambiguity.

What does that mean? What specific interface are you referring to to
obtain file names? 

os.listdir("")

os.listdir(b"")

So I guess I'd better suggest that a specific, equivalent directory name
be passed in either bytes or str form.


[Leaving the issue of the empty string apparently having different
meanings aside ...]

Ok. Now I understand the example. So you do

os.listdir("c:/tmp")
os.listdir(b"c:/tmp")

and you have a file in c:/tmp that is named "abc\uDC10".


So what you are saying here is that Python doesn't use the "A" forms of
the Windows APIs for filenames, but only the "W" forms, and uses lossy
decoding (from MS) to the current code page (which can never be UTF-8 on
Windows).


Actually, it does use the A form, in the second listdir example. This,
in turn (inside Windows), uses the lossy CP_ACP encoding. You get back
a byte string; the listdirs should give

["abc\uDC10"]
[b"abc?"]

(not quite sure about the second - I only guess that CP_ACP will replace
the half surrogate with a question mark).

So where is the ambiguity here?


None.  But not everyone can read all the Python source code to try to 
understand it; they expect the documentation to help them avoid that. 
Because the documentation is lacking in this area, it makes your 
concisely stated PEP rather hard to understand.


Thanks for clarifying the Windows behavior, here.  A little more 
clarification in the PEP could have avoided lots of discussion.  It 
would seem that a PEP, proposed to modify a poorly documented (and 
therefore likely poorly understood) area, should be educational about 
the status quo, as well as presenting the suggested change.  Or is it 
the Python philosophy that the PEPs should be as incomprehensible as 
possible, to generate large discussions?



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] string to float containing whitespace

2009-04-29 Thread Martin v. Löwis

[email protected] wrote:
> Someone please tell me I'm not going mad.  I could have sworn that once upon
> a time attempting to convert numeric strings to ints or floats if they
> contained whitespace raised an exception.  As far back as 1.5.2 it appears
> that float(), string.atof() and string.atoi() allow whitespace.  Maybe I'm
> thinking of trailing non-numeric, non-whitespace characters.

Maybe you remember truly *embedded* whitespace:

py> float("1. 3")
Traceback (most recent call last):
  File "", line 1, in 
ValueError: invalid literal for float(): 1. 3

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-29 Thread Martin v. Löwis

> The whole purpose of PEP 383 is to send the exact same bytes that were
> read from the OS back to the OS => violating (2) (for whatever the
> apparent system file-encoding is, not limited to UTF-8), and that has
> overwhelmingly popular support.
> 
> Note that this won't happen automatically, either, AIUI.  The PEP's
> proposed implementation is as an error handler, and this would need to
> be specified explicitly.  It's not intended to be the default.

Actually, no: the error handler will be automatically used in all places
that convert file names to bytes. I have clarified the PEP to make that
explicit. IOW, it replaces the current "strict" setting in these cases.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson

On 29Apr2009 17:03, Terry Reedy  wrote:
> Thomas Breuel wrote:
>> Sure. However, that requires you to provide meaningful, reproducible
>> counter-examples, rather than a stenographic formulation that might
>> hint some problem you apparently see (which I believe is just not
>> there).
>>
>> Well, here's another one: PEP 383 would disallow UTF-8 encodings of 
>> half surrogates. 
>
> By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows that.

5.0 also disallows it. No surprise I guess.
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

Out on the road, feeling the breeze, passing the cars.  - Bob Seger
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-29 Thread Glenn Linderman

On approximately 4/29/2009 1:06 PM, came the following characters from 
the keyboard of Martin v. Löwis:


> Thanks, fixed.


Thanks for your fixes.  They are helpful.



I'm at a loss how to make the text more clear than it already is. I'm
really not good at writing long essays, with a lot of
explanatory-but-non-normative text. I also think that explanations do
not belong in the section titled specification, nor does a full
description of the status quo belongs into the PEP at all. The reader
should consult the current Python source code if in doubt what the
status quo is.



The status quo is what justifies the existence of the PEP.  If the 
status quo were perfect, there would be no need for the PEP.


The status quo should be described in the Rationale.  Some of it is. 
The rest of it should be.  If there is a need for this PEP for POSIX, 
but not Windows, the reason why should be given (Para 2 in Rationale 
seems to try to describe that, but doesn't go far enough), and also the 
reason that cross-platform code can install this PEP's error handler on 
both platforms, yet it won't affect bytes interfaces on Windows.  These 
are two omissions that have both caused large amounts of discussion.


Attempting to understand the Python source code is a good thing, but 
there is a lot to understand, and few will achieve a full understanding.





The 4th paragraph is now confusing too... would it not be the decode
error handler that returns the byte strings, in addition to the Unicode
strings?


No, why do you think so? That's intended as stated.



Here, a use case, or several, in the PEP could help clarify why it would 
be the encode error handler that would return both the bytes string and 
the Unicode string.  And why the decode error handler would not need to.


Seems that if the decode handler preserved the bytes from the OS, and 
made them available as well as the decoded Unicode, that could be 
interesting to the application that is wanting to manipulate the file.


Seems that if the encode handler is given the Unicode, so not clear why 
it should also return it.  I guess if there is an error during the 
encode process (can there be?) then the bytes and Unicode for comparison 
could be useful for error reporting.


But I shouldn't have to guess.  The PEP should explain how these things 
are useful.  The discussion section could be extended with use cases for 
both the encode and decode cases.



--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson

On 29Apr2009 22:14, Stephen J. Turnbull  wrote:
| Baptiste Carvello writes:
|  > By contrast, if the new utf-8b codec would *supercede* the old one,
|  > \udcxx would always mean raw bytes (at least on UCS-4 builds, where
|  > surrogates are unused). Thus ambiguity could be avoided.
| 
| Unfortunately, that's false.  It could have come from a literal string
| (similar to the text above ;-), a C extension, or a string slice (on
| 16-bit builds), and there may be other ways to do it.  The only way to
| avoid ambiguity is to change the definition of a Python string to be
| *valid* Unicode (possibly with Python extensions such as PEP 383 for
| internal use only).  But Guido has rejected that in the past;
| validation is the application's problem, not Python's.
| 
| Nor is a UCS-4 build exempt.  IIRC Guido specifically envisioned
| Python strings being used to build up code point sequences to be
| directly output, which means that a UCS-4 string might none-the-less
| contain surrogates being added to a string intended to be sent as
| UTF-16 output simply by truncating the 32-bit code units to 16 bits.

Wouldn't you then be bypassing the implicit encoding anyway, at least to
some extent, and thus not trip over the PEP?
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

Clemson is the Harvard of cardboard packaging.
- overhead by WIRED at the Intelligent Printing conference Oct2006
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath

2009-04-29 Thread Michael Foord

Larry Hastings wrote:

I've written a patch for Python 3.1 that changes os.path so it handles 
UNC paths on Windows:

  http://bugs.python.org/issue5799

+1 for the feature. I have to deal with Windows networks from time to 
time and this would be useful.

Michael

In a Windows path string, a UNC path functions *exactly* like a drive
letter.  This patch means that the Python path split/join functions
treats them as if they were.

For instance:
   >>> splitdrive("A:\\FOO\\BAR.TXT")
   ("A:", "\\FOO\\BAR.TXT")

With this patch applied:
   >>> splitdrive("HOSTNAME\\SHARE\\FOO\\BAR.TXT")
   ("HOSTNAME\\SHARE", "\\FOO\\BAR.TXT")

This methodology only breaks down in one place: there is no "default
directory" for a UNC share point.  E.g. you can say
   >>> os.chdir("c:")
or
   >>> os.chdir("c:foo\\bar")
but you can't say
   >>> os.chdir("hostname\\share")
But this is irrelevant to the patch.

Here's what my patch changes:
* Modify join, split, splitdrive, and ismount to add explicit support
 for UNC paths.  (The other functions pick up support from these four.)
* Simplify isabs and normpath, now that they don't need to be delicate
 about UNC paths.
* Modify existing unit tests and add new ones.
* Document the changes to the API.
* Deprecate splitunc, with a warning and a documentation remark.

This patch adds one subtle change I hadn't expected.  If you call
split() with a drive letter followed by a trailing slash, it returns the
trailing slash as part of the "head" returned.  E.g.
   >>> os.path.split("\\")
   ("\\", "")
   >>> os.path.split("A:\\")
   ("A:\\", "")
This is mentioned in the documentation, as follows:
   Trailing slashes are stripped from head unless it is the root
   (one or more slashes only).

For some reason, when os.path.split was called with a UNC path with only
a trailing slash, it stripped the trailing slash:
   >>> os.path.split("hostname\\share\\")
   ("hostname\\share", "")
My patch changes this behavior; you would now see:
   >>> os.path.split("hostname\\share\\")
   ("hostname\\share\\", "")
I think it's an improvement--this is more consistent.  Note that this
does *not* break the documented requirement that
os.path.join(os.path.split(path)) == path; that continues to work fine.

In the interests of full disclosure: I submitted a patch providing this
exact behavior just over ten years ago.  GvR accepted it into Python
1.5.2b2 (marked "*EXPERIMENTAL*") and removed it from 1.5.2c1.

You can read GvR's commentary upon removing it; see comments in
Misc/HISTORY  
dated "Tue Apr  6 19:38:18 1999".  If memory serves

correctly, the problems cited were only on Cygwin.  At the time Cygwin
used "ntpath", and it supported "//a/foo" as an alias for "A:\\FOO". 
You can see how this would cause Cygwin problems.

In the intervening decade, two highly relevant things have happened:
* Python no longer uses ntpath for os.path on Cygwin.  Instead it uses
 posixpath.
* Cygwin removed the "//a/foo" drive letter hack.  In fact, I believe it
 now support UNC paths.
Therefore this patch will have no effect on Cygwin users.

What do you think?

/larry/

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk 

--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Barry Scott



On 22 Apr 2009, at 07:50, Martin v. Löwis wrote:



If the locale's encoding is UTF-8, the file system encoding is set to
a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
(which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.



Forgive me if this has been covered. I've been reading this thread for  
a long time

and still have a 100 odd replies to go...

How do get a printable unicode version of these path strings if they  
contain

none unicode data?

I'm guessing that an app has to understand that filenames come in two  
forms
unicode and bytes if its not utf-8 data. Why not simply return string  
if its valid
utf-8 otherwise return bytes? Then in the app you check for the type  
for the object,

string or byte and deal with reporting errors appropriately.

Barry

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath

2009-04-29 Thread Eric Smith


Michael Foord wrote:

Larry Hastings wrote:


I've written a patch for Python 3.1 that changes os.path so it handles 
UNC paths on Windows:


  http://bugs.python.org/issue5799


+1 for the feature. I have to deal with Windows networks from time to 
time and this would be useful.


+1 from me, too. I haven't looked at the implementation, but for sure 
the feature would be welcome.



In the interests of full disclosure: I submitted a patch providing this
exact behavior just over ten years ago.  GvR accepted it into Python
1.5.2b2 (marked "*EXPERIMENTAL*") and removed it from 1.5.2c1.



In the intervening decade, two highly relevant things have happened:
* Python no longer uses ntpath for os.path on Cygwin.  Instead it uses
 posixpath.
* Cygwin removed the "//a/foo" drive letter hack.  In fact, I believe it
 now support UNC paths.
Therefore this patch will have no effect on Cygwin users.


Yes, cygwin supports UNC paths with //host/share, and they use 
/cygdrive/a, etc., to refer to physical drives. It's been that way for 
as long as I recall, at least 7 years.

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson

On 29Apr2009 23:41, Barry Scott  wrote:
> On 22 Apr 2009, at 07:50, Martin v. Löwis wrote:
>> If the locale's encoding is UTF-8, the file system encoding is set to
>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes
>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF.
>
> Forgive me if this has been covered. I've been reading this thread for a 
> long time and still have a 100 odd replies to go...
>
> How do get a printable unicode version of these path strings if they  
> contain none unicode data?

Personally, I'd use repr(). One might ask, what would you expect to see
if you were printing such a string?

> I'm guessing that an app has to understand that filenames come in two  
> forms unicode and bytes if its not utf-8 data. Why not simply return string 
> if 
> its valid utf-8 otherwise return bytes? Then in the app you check for the 
> type for 
> the object, string or byte and deal with reporting errors appropriately.

Because it complicates the app enormously, for every app.

It would be _nice_ to just call os.listdir() et al with strings, get
strings, and not worry.

With strings becoming unicode in Python3, on POSIX you have an issue of
deciding how to get its filenames-are-bytes into a string and the
reverse. One could naively map the byte values to the same Unicode code
points, but that results in strings that do not contain the same
characters as the user/app expects for byte values above 127.

Since POSIX does not really have a filesystem level character encoding,
just a user environment setting that says how the current user encodes
characters into bytes (UTF-8 is increasingly common and useful, but
it is not universal), it is more useful to decode filenames on the
assumption that they represent characters in the user's (current) encoding
convention; that way when things are displayed they are meaningful,
and they interoperate well with strings made by the user/app. If all
the filenames were actually encoded that way when made, that works. But
different users may adopt different conventions, and indeed a user may
have used ACII or and ISO8859-* coding in the past and be transitioning
to something else now, so they will have a bunch of files in different
encodings.

The PEP uses the user's current encoding with a handler for byte
sequences that don't decode to valid Unicode scaler values in
a fashion that is reversible. That is, you get "strings" out of
listdir() and those strings will go back in (eg to open()) perfectly
robustly.

Previous approaches would either silently hide non-decodable names in
listdir() results or throw exceptions when the decode failed or mangle
things no reversably. I believe Python3 went with the first option
there.

The PEP at least lets programs naively access all files that exist,
and create a filename from any well-formed unicode string provided that
the filesystem encoding permits the name to be encoded.

The lengthy discussion mostly revolves around:

  - Glenn points out that strings that came _not_ from listdir, and that are
_not_ well-formed unicode (== "have bare surrogates in them") but that
were intended for use as filenames will conflict with the PEP's scheme -
programs must know that these strings came from outside and must be
translated into the PEP's funny-encoding before use in the os.*
functions. Previous to the PEP they would get used directly and
encode differently after the PEP, thus producing different POSIX
filenames. Breakage.

  - Glenn would like the encoding to use Unicode scalar values only,
using a rare-in-filenames character.
That would avoid the issue with "outside' strings that contain
surrogates. To my mind it just moves the punning from rare illegal
strings to merely uncommon but legal characters.

  - Some parties think it would be better to not return strings from
os.listdir but a subclass of string (or at least a duck-type of
string) that knows where it came from and is also handily
recognisable as not-really-a-string for purposes of deciding
whether is it PEP-funny-encoded by direct inspection.

Cheers,
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

The peever can look at the best day in his life and sneer at it.
- Jim Hill, JennyGfest '95
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Aahz

On Thu, Apr 30, 2009, Cameron Simpson wrote:
>
> The lengthy discussion mostly revolves around:
> 
>   - Glenn points out that strings that came _not_ from listdir, and that are
> _not_ well-formed unicode (== "have bare surrogates in them") but that
> were intended for use as filenames will conflict with the PEP's scheme -
> programs must know that these strings came from outside and must be
> translated into the PEP's funny-encoding before use in the os.*
> functions. Previous to the PEP they would get used directly and
> encode differently after the PEP, thus producing different POSIX
> filenames. Breakage.
> 
>   - Glenn would like the encoding to use Unicode scalar values only,
> using a rare-in-filenames character.
> That would avoid the issue with "outside' strings that contain
> surrogates. To my mind it just moves the punning from rare illegal
> strings to merely uncommon but legal characters.
> 
>   - Some parties think it would be better to not return strings from
> os.listdir but a subclass of string (or at least a duck-type of
> string) that knows where it came from and is also handily
> recognisable as not-really-a-string for purposes of deciding
> whether is it PEP-funny-encoded by direct inspection.

Assuming people agree that this is an accurate summary, it should be
incorporated into the PEP.
-- 
Aahz ([email protected])   <*> http://www.pythoncraft.com/

"If you think it's expensive to hire a professional to do the job, wait
until you hire an amateur."  --Red Adair
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-29 Thread Aahz

On Wed, Apr 29, 2009, "Martin v. L?wis" wrote:
>
> I'm at a loss how to make the text more clear than it already is. I'm
> really not good at writing long essays, with a lot of
> explanatory-but-non-normative text. I also think that explanations do
> not belong in the section titled specification, nor does a full
> description of the status quo belongs into the PEP at all. The reader
> should consult the current Python source code if in doubt what the
> status quo is.

Perhaps not a full description of the status quo, but the PEP definitely
needs a good summary -- remember that PEPs are not just for the time that
they are written, but also for the future.  While telling people to "read
the source, Luke" makes some sense at a specific point in time, I don't
think that requiring a trawl through code history is fair.

And, yes, PEP-writing is painful.
-- 
Aahz ([email protected])   <*> http://www.pythoncraft.com/

"If you think it's expensive to hire a professional to do the job, wait
until you hire an amateur."  --Red Adair
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-29 Thread Thomas Breuel

>
> The whole purpose of PEP 383 is to send the exact same bytes that were
> read from the OS back to the OS => violating (2) (for whatever the
> apparent system file-encoding is, not limited to UTF-8),


It's fine to read a file name from a file system and write the same file
back as the same raw byte sequence.  That I don't have a problem with; it's
not quite right, but it's harmless.

The problem with this PEP is that the malformed unicode it produces can end
up in so many other places: as file names on another file system, in string
processing libraries, in text files, in databases, in user interfaces,
etc.   Some of those destinations will use the utf-8b decoder, so they will
get byte sequences that never could occur before and that are illegal under
unicode.

Nobody knows what will happen.  And, yes, Martin is proposing that this is
the default behavior.

There are several other issues that are unresolved: utf-8b makes some
current practices illegal; for example, it might break CESU-8 encodings.
Also, what are Jython and IronPython supposed to do on UNIX?  Can they
implement these semantics at all?


> and that has overwhelmingly popular support.


I think people don't fully understand the tradeoffs.  I certainly don't.
Although there is a slight benefit, there are unknown and potentially large
costs. We'd be changing Python's entire unicode string behavior for the sake
of one use cases.  Since our uses of Python actually involve a lot of
unicode, I am wary of having malformed unicode crop up legally in Python
code.

And that's why I think this proposal should be shelved for a while until
people have had more time to try to understand the issues and also come up
with alternative proposals.  Once this is adopted and implemented in
C-Python, Python is stuck with it forever.

Tom
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-29 Thread Curt Hagenlocher

On Wed, Apr 29, 2009 at 8:16 PM, Thomas Breuel  wrote:
>
> Also, what are Jython and IronPython supposed to do on UNIX?  Can they
> implement these semantics at all?

IronPython will inherit whatever behavior Mono has implemented. The
Microsoft CLR defines the native string type as UTF-16 and all of the
managed APIs for things like file names and environmental variables
operate on UTF-16 strings -- there simply are no byte string APIs.

I assume that Mono does the same but I don't have any Mono experience.

--
Curt Hagenlocher
[email protected]
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-29 Thread Steven D'Aprano

On Thu, 30 Apr 2009 01:16:20 pm Thomas Breuel wrote:
> And that's why I think this proposal should be shelved for a while
> until people have had more time to try to understand the issues and
> also come up with alternative proposals.  Once this is adopted and
> implemented in C-Python, Python is stuck with it forever.

+1 on this. I'm going to quote the Zen here:

Now is better than never.
Although never is often better than *right* now.

I don't understand the proposal and issues. I see a lot of people 
claiming that they do, and then spending all their time either 
talking past each other, or disagreeing. If everyone who claims they 
understand the issues actually does, why is it so hard to reach a 
consensus?

I'd like to see some real examples of how things can break in the 
current system, and I'd like any potential solution to be made 
available as a third-party package before it goes into the standard 
library (if possible). Currently, we're reduced to trying to predict 
the consequences of implementing the PEP, instead of being able to 
try it out and see.

Even something like a test suite would be useful: here are a bunch of 
malformed file names, and this is what happens when you try to work 
with them. Please, let's see some code we can run, not more words.

-- 
Steven D'Aprano 
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy


Glenn Linderman wrote:
On approximately 4/29/2009 1:28 PM, came the following characters from 



So where is the ambiguity here?


None.  But not everyone can read all the Python source code to try to 
understand it; they expect the documentation to help them avoid that. 
Because the documentation is lacking in this area, it makes your 
concisely stated PEP rather hard to understand.


If you think a section of the doc is grossly inadequate, and there is no 
existing issue on the tracker, feel free to add one.


Thanks for clarifying the Windows behavior, here.  A little more 
clarification in the PEP could have avoided lots of discussion.  It 
would seem that a PEP, proposed to modify a poorly documented (and 
therefore likely poorly understood) area, should be educational about 
the status quo, as well as presenting the suggested change.


Where the PEP proposes to change, it should start with the status quo. 
But Martin's somewhat reasonable position is that since he is not 
proposing to change behavior on Windows, it is not his responsibility to 
document what he is not proposing to change more adequately.  This 
means, of course, that any observed change on Windows would then be a 
bug, or at least a break of the promise.  On the other hand, I can see 
that this is enough related to what he is proposing to change that 
better doc would help.


tjr

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-29 Thread Martin v. Löwis

> But I shouldn't have to guess.  The PEP should explain how these things
> are useful.  The discussion section could be extended with use cases for
> both the encode and decode cases.

See PEP 293.

Regards,
Martin

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

> How do get a printable unicode version of these path strings if they
> contain none unicode data?

Define "printable". One way would be to use a regular expression,
replacing all codes in a certain range with a question mark.

> I'm guessing that an app has to understand that filenames come in two forms
> unicode and bytes if its not utf-8 data. Why not simply return string if
> its valid utf-8 otherwise return bytes?

That would have been an alternative solution, and the one that 2.x uses
for listdir. People didn't like it.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-29 Thread Martin v. Löwis

> I don't understand the proposal and issues. I see a lot of people 
> claiming that they do, and then spending all their time either 
> talking past each other, or disagreeing. If everyone who claims they 
> understand the issues actually does, why is it so hard to reach a 
> consensus?

Because the problem is difficult, and any solution has trade-offs.
People disagree on which trade-offs are worse than others.

> I'd like to see some real examples of how things can break in the 
> current system

Suppose I create a new directory, and run the following script
in 3.x:

py> open("x","w").close()
py> open(b"\xff","w").close()
py> os.listdir(".")
['x']

If I quit Python, I can now do

mar...@mira:~/work/3k/t$ ls
?  x
mar...@mira:~/work/3k/t$ ls -b
\377  x

As you can see, there are two files in the current directory, but
only one of them is reported by os.listdir. The same happens to
command line arguments and environment variables: Python might swallow
some of them.

> and I'd like any potential solution to be made 
> available as a third-party package before it goes into the standard 
> library (if possible).

Unfortunately, at least for my solution, this isn't possible. I need
to change the implementation of the existing file IO APIs.

> Currently, we're reduced to trying to predict 
> the consequences of implementing the PEP, instead of being able to 
> try it out and see.

In a sense, this is one of the primary points of the PEP process:
to discuss a specification before the effort to produce an
implementation is started.

> Even something like a test suite would be useful: here are a bunch of 
> malformed file names, and this is what happens when you try to work 
> with them. Please, let's see some code we can run, not more words.

Just try my example above, on a Linux system, in a UTF-8 locale.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-29 Thread Martin v. Löwis

Curt Hagenlocher wrote:
> On Wed, Apr 29, 2009 at 8:16 PM, Thomas Breuel  wrote:
>> Also, what are Jython and IronPython supposed to do on UNIX?  Can they
>> implement these semantics at all?
> 
> IronPython will inherit whatever behavior Mono has implemented. The
> Microsoft CLR defines the native string type as UTF-16 and all of the
> managed APIs for things like file names and environmental variables
> operate on UTF-16 strings -- there simply are no byte string APIs.
> 
> I assume that Mono does the same but I don't have any Mono experience.

Marcin Kowalczyk once did a review, at

http://mail.python.org/pipermail/python-3000/2007-September/010450.html

It may have changed since then; at the time, Mono would omit
non-decodable files in directory listings, and would refuse to start
if a non-decodable command line argument is passed. The environment
variable MONO_EXTERNAL_ENCODINGS can be set to specify what
encodings should be tried in what order.

However, I don't think it is relevant for the PEP: as Curt says, these
details will be inherited from the VM; the mechanism proposed is really
specific to CPython. To implement it on the other VMs, those would have
to either implement it natively, or provide byte-oriented APIs to allow
Jython/IronPython to implement it on top of it (the latter being not
realistic or useful).

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 (again)

2009-04-29 Thread Martin v. Löwis

> Perhaps not a full description of the status quo, but the PEP definitely
> needs a good summary

I completely agree, and believe that the PEP *does* have a good
summary - it has both an abstract, and a rationale, and both say
exactly what I want them to say. If people want them to say different
things, they have to tell me what specifically they want it to say
(perhaps even with specific formulations). If they can't communicate
their requests to me, I can't comply.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis

> Thanks for clarifying the Windows behavior, here.  A little more
> clarification in the PEP could have avoided lots of discussion.  It
> would seem that a PEP, proposed to modify a poorly documented (and
> therefore likely poorly understood) area, should be educational about
> the status quo, as well as presenting the suggested change.  Or is it
> the Python philosophy that the PEPs should be as incomprehensible as
> possible, to generate large discussions?

Certainly not. See PEP 277 for a description of a specification of
how file names are handled on Windows.

Large discussions could be reduced if readers would try to
constructively comment on the PEP, rather than making counter-proposals,
or making statements about the PEP without making their implied
assumptions explicit.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-29 Thread Jeroen Ruigrok van der Werven

-On [20090430 07:18], "Martin v. Löwis" ([email protected]) wrote:
>Suppose I create a new directory, and run the following script
>in 3.x:
>
>py> open("x","w").close()
>py> open(b"\xff","w").close()
>py> os.listdir(".")
>['x']

That is actually a regression in 3.x:

Python 2.6.1 (r261:67515, Mar  8 2009, 11:36:21) 
>>> import os
>>> open("x","w").close()
>>> open(b"\xff","w").close()
>>> os.listdir(".")
['x', '\xff']

[Apologies if that was completely clear through the entire discussion, but
I've lost track at a given point.]

-- 
Jeroen Ruigrok van der Werven  / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Heart is the engine of your body, but Mind is the engine of Life...
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Thomas Breuel

On Wed, Apr 29, 2009 at 23:03, Terry Reedy  wrote:

> Thomas Breuel wrote:
>
>>
>>Sure. However, that requires you to provide meaningful, reproducible
>>counter-examples, rather than a stenographic formulation that might
>>hint some problem you apparently see (which I believe is just not
>>there).
>>
>>
>> Well, here's another one: PEP 383 would disallow UTF-8 encodings of half
>> surrogates.
>>
>
> By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows
> that.

If we use conformance to Unicode 5.1 as the basis for our discussion, then
PEP 383 is off the table anyway.  I'm all for strict Unicode compliance.
But apparently, the Python community doesn't care.

CESU-8 is described in Unicode Technical Report #26, so it at least has some
official recognition.  More importantly, it's also widely used.  So, my
question: what are the implications of PEP 383 for CESU-8 encodings on
Python?

My meta-point is: there are probably many more such issues hidden away and
it is a really bad idea to rush something like PEP 383 out.  Unicode is hard
anyway, and tinkering with its semantics requires a lot of thought.

Tom
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-29 Thread Thomas Breuel

On Thu, Apr 30, 2009 at 05:40, Curt Hagenlocher wrote:

>  IronPython will inherit whatever behavior Mono has implemented. The
> Microsoft CLR defines the native string type as UTF-16 and all of the
> managed APIs for things like file names and environmental variables
> operate on UTF-16 strings -- there simply are no byte string APIs.

Yes.  Now think about the implications.  This means that adopting PEP 383
will make IronPython and Jython running on UNIX intrinsically incompatible
with CPython running on UNIX, and there's no way to fix that.

Tom
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-29 Thread Martin v. Löwis

Jeroen Ruigrok van der Werven wrote:
> -On [20090430 07:18], "Martin v. Löwis" ([email protected]) wrote:
>> Suppose I create a new directory, and run the following script
>> in 3.x:
>>
>> py> open("x","w").close()
>> py> open(b"\xff","w").close()
>> py> os.listdir(".")
>> ['x']
> 
> That is actually a regression in 3.x:

Correct - and precisely the issue that this PEP wants to address.

For comparison, do os.listdir(u"."), though:

py> os.listdir(u".")
[u'x', '\xff']

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-29 Thread Martin v. Löwis

Thomas Breuel wrote:
> On Thu, Apr 30, 2009 at 05:40, Curt Hagenlocher  > wrote:
> 
> IronPython will inherit whatever behavior Mono has implemented. The
> Microsoft CLR defines the native string type as UTF-16 and all of the
> managed APIs for things like file names and environmental variables
> operate on UTF-16 strings -- there simply are no byte string APIs.
> 
> 
> Yes.  Now think about the implications.  This means that adopting PEP
> 383 will make IronPython and Jython running on UNIX intrinsically
> incompatible with CPython running on UNIX, and there's no way to fix that. 

*Not* adapting the PEP will also make CPython and IronPython
incompatible, and there's no way to fix that.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

59 matches

Mail list logo