Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Martin v. Löwis
> I propose to raise Unicode errors if a filename cannot be decoded on Windows, 
> instead of creating a bogus filenames with questions marks.

Can you please elaborate what APIs you are talking about exactly?

If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on
this proposal. People that explicitly use bytes for file names deserve
to get whatever exact platform semantics the platform has to offer. This
is true on Unix, and it is also true on Windows.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Antoine Pitrou
On Tue, 25 Oct 2011 00:57:42 +0200
Victor Stinner  wrote:
> Hi,
> 
> I propose to raise Unicode errors if a filename cannot be decoded on Windows, 
> instead of creating a bogus filenames with questions marks. Because this 
> change 
> is incompatible with Python 3.2, even if such filenames are unusable and I 
> consider the problem as a (Python?) bug, I would like your opinion on such 
> change before working on a patch.

+1 from me.

Regards

Antoine.


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Victor Stinner
Le Mardi 25 Octobre 2011 13:20:12 vous avez écrit :
> Victor Stinner writes:
>  > I propose to raise Unicode errors if a filename cannot be decoded
>  > on Windows, instead of creating a bogus filenames with questions
>  > marks.
> 
> By "bogus" you mean "sometimes (?) invalid and the OS will refuse to
> use them, causing a later hard-to-diagnose exception", rather than
> "not what the user thinks he wants", right?

If the ("Unicode") filename cannot be encoded to the ANSI code page, which is 
usually a small charset (e.g. cp1252 contains 256 code points), Windows 
replaces unencodable characters by question marks.

Imagine that the code page is ASCII, the ("Unicode") filename "hého.txt" will 
be encoded to b"h?ho.txt". You can display this string in a dialog, but you 
cannot open the file to read its content... If you pass the filename to 
os.listdir(), it is even worse because "?" is interpreted ("?" means any 
character, it's a pattern to match a filename).

I would like to raise an error on such situation, because currently the user 
cannot be noticed otherwise. The user may search "?" in the filename, but 
Windows replaces also unencodable characters by *similar glyph* (e.g. "é" 
replaced by "e").

> In the "hard errors" case, a hearty +1 (I'm dealing with this in an
> experimental version of XEmacs and it's a right PITA if the codec
> doesn't complain). 

If you use MultiByteToWideChar and WideCharToMultiByte, you can be noticed on 
error using some flags, but functions of the ANSI API doesn't give access to 
these flags...

> Backward compatibility is important, but here the
> costs of fixing such bugs outweigh the value of bug-compatibility.

I only want to change how unencodable filenames are handled, the bytes API will 
still be available. If you filesystem has the "8dot3name" feature enable, it 
may work even for unencodable filenames (Windows generates names like 
HEHO~1.TXT).

Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Victor Stinner
Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit :
> > I propose to raise Unicode errors if a filename cannot be decoded on
> > Windows, instead of creating a bogus filenames with questions marks.
> 
> Can you please elaborate what APIs you are talking about exactly?

Basically, all functions processing filenames, so most functions of 
posixmodule.c. Some examples:

- os.listdir(): FindFirstFileA, FindNextFileA, FindCloseA
- os.lstat(): CreateFileA
- os.getcwdb(): getcwd()
- os.mkdir(): CreateDirectoryA
- os.chmod(): SetFileAttributesA
- ...

> If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on
> this proposal. People that explicitly use bytes for file names deserve
> to get whatever exact platform semantics the platform has to offer. This
> is true on Unix, and it is also true on Windows.

My proposition is a fix to user reported by a user:
http://bugs.python.org/issue13247

I want to keep the bytes API for backward compatibility, and it will still 
work for non-ASCII characters, but only for non-ASCII characters encodable to 
the ANSI code page.

In practice, characters not encodable to the ANSI code page are very rare. For 
example: it's difficult to write such characters directly with the keyboard. I 
bet that very few people will notify the change.

Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] memcmp performance

2011-10-25 Thread Stefan Behnel

Richard Saunders, 25.10.2011 01:17:

-On [20111024 09:22], Stefan Behnel wrote:
  >>I agree. Given that the analysis shows that the libc memcmp() is
  >>particularly fast on many Linux systems, it should be up to the Python
  >>package maintainers for these systems to set that option externally through
  >>the optimisation CFLAGS.

Indeed, this is how I constructed my Python 3.3 and Python 2.7 :
setenv CFLAGS '-fno-builtin-memcmp'
just before I configured.

I would like to revisit changing unicode_compare: adding a
special arm for using memcmp when the "unicode kinds" are the
same will only work in two specific instances:

(1) the strings are the same kind, the char size is 1
* We could add THIS to unicode_compare, but it seems extremely
specialized by itself


But also extremely likely to happen. This means that the strings are pure 
ASCII, which is highly likely and one of the main reasons why the unicode 
string layout was rewritten for CPython 3.3. It allows CPython to save a 
lot of memory (thus clearly proving how likely this case is!), and it would 
also allow it to do faster comparisons for these strings.


Stefan

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Victor Stinner
Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit :
> If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on
> this proposal. People that explicitly use bytes for file names deserve
> to get whatever exact platform semantics the platform has to offer. This
> is true on Unix, and it is also true on Windows.

For your information, it took me something like 3 months (when I was working 
on the issue #12281) to understand exactly how Windows handles undecodable 
bytes and unencodable characters. I did a lot of tests on different Windows 
versions (XP, Vista and Seven, the behaviour changed in Windows Vista). I had 
to take notes because it is really complex. Well, I wanted to understand 
exactly *all* code pages, including CP_UTF7 and CP_UTF8, not only the most 
common ones like cp1252 or cp932.

See the dedicated section in my book to learn more about these funtions:

http://www.haypocalc.com/tmp/unicode-2011-07-20/html/operating_systems.html#encode-
and-decode-functions

Some information are available in MultiByteToWideChar and WideCharToMultiByte 
documentation, but they are not well explained :-p

Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] memcmp performance

2011-10-25 Thread Victor Stinner
Le Mardi 25 Octobre 2011 10:44:16 Stefan Behnel a écrit :
> Richard Saunders, 25.10.2011 01:17:
> > -On [20111024 09:22], Stefan Behnel wrote:
> >   >>I agree. Given that the analysis shows that the libc memcmp() is
> >   >>particularly fast on many Linux systems, it should be up to the
> >   >>Python package maintainers for these systems to set that option
> >   >>externally through the optimisation CFLAGS.
> > 
> > Indeed, this is how I constructed my Python 3.3 and Python 2.7 :
> > setenv CFLAGS '-fno-builtin-memcmp'
> > just before I configured.
> > 
> > I would like to revisit changing unicode_compare: adding a
> > special arm for using memcmp when the "unicode kinds" are the
> > same will only work in two specific instances:
> > 
> > (1) the strings are the same kind, the char size is 1
> > * We could add THIS to unicode_compare, but it seems extremely
> > specialized by itself
> 
> But also extremely likely to happen. This means that the strings are pure
> ASCII, which is highly likely and one of the main reasons why the unicode
> string layout was rewritten for CPython 3.3. It allows CPython to save a
> lot of memory (thus clearly proving how likely this case is!), and it would
> also allow it to do faster comparisons for these strings.

Python 3.3 has already some optimizations for latin1: CPU and the C language 
are more efficient to process char* strings than Py_UCS2 and Py_UCS4 strings. 
For example, we are using memchr() to search a single character is a latin1 
string.

Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] cpython: #13251: update string description in datamodel.rst.

2011-10-25 Thread Petri Lehtinen
Hi,

ezio.melotti wrote:
> http://hg.python.org/cpython/rev/11d18ebb2dd1
> changeset:   73116:11d18ebb2dd1
> user:Ezio Melotti 
> date:Tue Oct 25 09:23:42 2011 +0300
> summary:
>   #13251: update string description in datamodel.rst.
> 
> files:
>   Doc/reference/datamodel.rst |  20 ++--
>   1 files changed, 10 insertions(+), 10 deletions(-)
> 
> 
> diff --git a/Doc/reference/datamodel.rst b/Doc/reference/datamodel.rst
> --- a/Doc/reference/datamodel.rst
> +++ b/Doc/reference/datamodel.rst
> @@ -276,16 +276,16 @@
>  single: integer
>  single: Unicode
>  
> - The items of a string object are Unicode code units.  A Unicode code
> - unit is represented by a string object of one item and can hold 
> either
> - a 16-bit or 32-bit value representing a Unicode ordinal (the maximum
> - value for the ordinal is given in ``sys.maxunicode``, and depends on
> - how Python is configured at compile time).  Surrogate pairs may be
> - present in the Unicode object, and will be reported as two separate
> - items.  The built-in functions :func:`chr` and :func:`ord` convert
> - between code units and nonnegative integers representing the Unicode
> - ordinals as defined in the Unicode Standard 3.0. Conversion from 
> and to
> - other encodings are possible through the string method 
> :meth:`encode`.
> + A string is a sequence of values that represent Unicode codepoints.
> + All the codepoints in range ``U+ - U+10`` can be represented
> + in a string.  Python doesn't have a :c:type:`chr` type, and
> + every characters in the string is represented as a string object
  typo ^

Should be "character", right?

> + with length ``1``.  The built-in function :func:`chr` converts a
> + character to its codepoint (as an integer); :func:`ord` converts
> + an integer in range ``0 - 10`` to the corresponding character.

Actually chr() converts an integer to a string and ord() converts a
string to an integer. chr and ord are swapped in your text.

> + :meth:`str.encode` can be used to convert a :class:`str` to
> + :class:`bytes` using the given encoding, and :meth:`bytes.decode` 
> can
> + be used to achieve the opposite.


Petri
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] cpython: Issue #13226: Add RTLD_xxx constants to the os module. These constants can by

2011-10-25 Thread Petri Lehtinen
Hi,

victor.stinner wrote:
> http://hg.python.org/cpython/rev/c75427c0da06
> changeset:   73127:c75427c0da06
> user:Victor Stinner 
> date:Tue Oct 25 13:34:04 2011 +0200
> summary:
>   Issue #13226: Add RTLD_xxx constants to the os module. These constants can 
> by
> used with sys.setdlopenflags().
> 
> files:
>   Doc/library/os.rst |  13 +
>   Doc/library/sys.rst|  10 +-
>   Lib/test/test_posix.py |   7 +++
>   Misc/NEWS  |   3 +++
>   Modules/posixmodule.c  |  26 ++
>   5 files changed, 54 insertions(+), 5 deletions(-)

[snip]

> diff --git a/Misc/NEWS b/Misc/NEWS
> --- a/Misc/NEWS
> +++ b/Misc/NEWS
> @@ -341,6 +341,9 @@
>  Library
>  ---
>  
> +- Issue #13226: Add RTLD_xxx constants to the os module. These constants can 
> by

Typo: s/by/be/

> +  used with sys.setdlopenflags().
> +
>  - Issue #10278: Add clock_getres(), clock_gettime() and CLOCK_xxx constants 
> to
>the time module. time.clock_gettime(time.CLOCK_MONOTONIC) provides a
>monotonic clock


Petri
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Martin v. Löwis
> My proposition is a fix to user reported by a user:
> http://bugs.python.org/issue13247

So your proposal is that abspath(b".") shall raise a UnicodeError in
this case?

Are you serious???

> In practice, characters not encodable to the ANSI code page are very rare. 
> For 
> example: it's difficult to write such characters directly with the keyboard. 
> I 
> bet that very few people will notify the change.

Except people running into the very issues you are trying to resolve.
I'm not sure these people are really helped by having their applications
crash all of a sudden.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Modules of plat-* directories

2011-10-25 Thread Martin v. Löwis
Am 24.10.2011 14:06, schrieb Victor Stinner:
> There are open issues related to plat-XXX.
> 
> Le Lundi 24 Octobre 2011 00:03:42 Martin v. Löwis a écrit :
>> no, we make no changes to them unless a user actually requests a change
> 
> Matthias Klose asked for socket SIO* constants in september 2006 (5 years 
> ago).
> http://bugs.python.org/issue1565071
> 
> I would prefer to see such constants in the socket module.

These are not mutually exclusive. You can regenerate IN.py and still
add the constants to the socket module.

> Thiemo Seufer noticed that "the linux2 platform definition is incorrect for 
> several architectures, namely Alpha, PA-RISC(hppa), MIPS and SPARC." in 
> september 2008 (3 years ago). He proposed to add a sublevel: Lib/plat-
> linux2/CDROM.py would become:
> 
>  - Lib/plat-linux2-alpha/CDROM.py
>  - Lib/plat-linux2-hppa/CDROM.py
>  - Lib/plat-linux2-mips/CDROM.py,
>  - Lib/plat-linux2-sparc/CDROM.py
>  - (and a default for other platforms like Intel x86?)
> 
> => http://bugs.python.org/issue3990
> 
> I really don't like this idea (of adding the architecture in the directory 
> name) :-p

Neither do I. In the specific case, I'd generate four versions of
CDROM.py (with differing names), and provide a CDROM.py that imports the
right one.

> IMO plat-XXX is wrong by design.

I disagree. It's limited, not wrong.

> It would be better if at least these files 
> were regenerated at build, but Martin doesn't want to regenerate them. And 
> there is still the problem of Mac OS X which embed 3 binarires for 3 
> architectures in the same "FAT" file.

These are problems, but not necessarily issues. Even if some of the
values are incorrect, the values that are correct may still be useful.

Regards,
Martin
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Victor Stinner
Le mardi 25 octobre 2011 00:57:42, Victor Stinner a écrit :
> I propose to raise Unicode errors if a filename cannot be decoded on
> Windows, instead of creating a bogus filenames with questions marks.
> Because this change is incompatible with Python 3.2, even if such
> filenames are unusable and I consider the problem as a (Python?) bug, I
> would like your opinion on such change before working on a patch.

Most people like the idea, so I wrote a patch and attached it to:

   http://bugs.python.org/issue13247

The patch only changes os.getcwdb() and os.listdir().

> We might use the PEP 383 to store undecoable bytes as surrogates (U+DC80-
> U+DCFF). But the situation is the opposite of the situtation on UNIX: on
> Windows, the problem is more on encoding (text->bytes) than on decoding
> (bytes->text). On UNIX, problems occur when the system is misconfigured
> (e.g. wrong locale encoding). On Windows, problems occur when your
> application uses the old (ANSI) API, whereas your filesystem is fully
> Unicode compliant and you created Unicode filenames with a program using
> the new (Windows) API.

I only changed functions returning filenames, so os.mkdir() is unchanged for 
example.

We may also patch the other functions to simplify the source code.

Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] cpython: Issue #13226: Add RTLD_xxx constants to the os module. These constants can by

2011-10-25 Thread Victor Stinner
Le mardi 25 octobre 2011 14:50:44, Petri Lehtinen a écrit :
> Hi,
> 
> victor.stinner wrote:
> > http://hg.python.org/cpython/rev/c75427c0da06
> > changeset:   73127:c75427c0da06
> > user:Victor Stinner 
> > date:Tue Oct 25 13:34:04 2011 +0200
> > 
> > summary:
> >   Issue #13226: Add RTLD_xxx constants to the os module. These constants
> >   can by
> > 
> > used with sys.setdlopenflags().
> > 
> > files:
> >   Doc/library/os.rst |  13 +
> >   Doc/library/sys.rst|  10 +-
> >   Lib/test/test_posix.py |   7 +++
> >   Misc/NEWS  |   3 +++
> >   Modules/posixmodule.c  |  26 ++
> >   5 files changed, 54 insertions(+), 5 deletions(-)
> 
> [snip]
> 
> > diff --git a/Misc/NEWS b/Misc/NEWS
> > --- a/Misc/NEWS
> > +++ b/Misc/NEWS
> > @@ -341,6 +341,9 @@
> > 
> >  Library
> >  ---
> > 
> > +- Issue #13226: Add RTLD_xxx constants to the os module. These constants
> > can by
> 
> Typo: s/by/be/

Fixed, thanks.

Victor
___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Terry Reedy

On 10/25/2011 4:31 AM, Victor Stinner wrote:

Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit :

I propose to raise Unicode errors if a filename cannot be decoded on
Windows, instead of creating a bogus filenames with questions marks.


Can you please elaborate what APIs you are talking about exactly?


Basically, all functions processing filenames, so most functions of
posixmodule.c. Some examples:


This seems way too broad. From you previous posts, I presumed that you 
only propose to change behavior when the user asks for the bytes 
versions of a unicode name that cannot be properly converted to a bytes 
version.



- os.listdir():


os.listdir(unicode) works fine and should not be changed.
os.listdir(bytes) is what OP of issue wants changed.


FindFirstFileA, FindNextFileA, FindCloseA


There are not Python names. Are they Windows API names?


- os.lstat(): CreateFileA


This does not create a path and should not be changed as far as I can see.


- os.getcwdb():


This you might change.

> getcwd()

This should not be, as no bytes are involved.


- os.mkdir(): CreateDirectoryA
- os.chmod(): SetFileAttributesA


Like os.lstat, these accept only accept a path and should do what they 
are supposed to do.



If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on
this proposal. People that explicitly use bytes for file names deserve
to get whatever exact platform semantics the platform has to offer. This
is true on Unix, and it is also true on Windows.


My proposition is a fix to user reported by a user:
http://bugs.python.org/issue13247

I want to keep the bytes API for backward compatibility, and it will still
work for non-ASCII characters, but only for non-ASCII characters encodable to
the ANSI code page.

In practice, characters not encodable to the ANSI code page are very rare. For
example: it's difficult to write such characters directly with the keyboard. I
bet that very few people will notify the change.


Actually, Windows makes switching keyboard setups rather easy once you 
enable the feature. It might be that people who routinely use non-'ansi' 
characters in file and directory names do not routinely ask for bytes 
versions thereof.


The doc says "All functions accepting path or file names accept both 
bytes and string objects, and result in an object of the same type, if a 
path or file name is returned." It does that now, though it says nothing 
about the encoding assumed for input bytes or used for output bytes. It 
does not mention raising exceptions, so doing so is a feature-change 
that would likely break code. Currently, exceptional situations are 
signalled with "'?' in returned_path" rather than with an exception 
object. It ('?') is a bad choice of signal though, given the other uses 
of '?' in paths.


--
Terry Jan Reedy


___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Stephen J. Turnbull
In general I agree with what you write, Terry.  One clarification and
one comment, though.

Terry Reedy writes:

 > The doc says "All functions accepting path or file names accept both 
 > bytes and string objects, and result in an object of the same type, if a 
 > path or file name is returned." It does that now, though it says nothing 
 > about the encoding assumed for input bytes or used for output
 > bytes.

That's determined by the OS, and figuring that out is the end user's
problem.

 > It does not mention raising exceptions, so doing so is a
 > feature-change that would likely break code. Currently, exceptional
 > situations are signalled with "'?' in returned_path" rather than
 > with an exception object. It ('?') is a bad choice of signal
 > though, given the other uses of '?' in paths.

True, but this isn't really Python's problem.  And IIUC Martin's post,
it is hardly "exceptional": isn't Python doing this, it's just
standard Windows behavior, which results in pathnames that are
perfectly acceptable to Windows APIs, but unreliable in use because
they have different semantics in different Windows APIs.  If that is
true, there are almost surely user programs that depend on this
behavior, even though it sucks.[1]

My original "hearty +1" was dependent on my understanding from
Victor's post that this substitution could cause later exceptions
because filename is invalid (eg, contains illegal characters causing
Windows to signal an error).  If that's not true, I think the proper
remedy is to add a strong warning to pylint that use of those APIs is
supported (eg, for interaction with existing programs that use them)
but that they require careful error-checking for robust use.

As a card-carrying Unicode nazi I wouldn't mind tagging the bytes APIs
with a DeprecationWarning but I know that proposal is going nowhere so
I withdraw it in advance. 


Footnotes: 
[1]  Note that the original rationale for this was surely "since users
will have a very hard time using file names with this character in
them, using it as a substitution character internally will make the
problem evident and Sufficiently Smart Programs can deal with it."

___
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com