New submission from STINNER Victor <victor.stin...@haypocalc.com>:

Python 3 has a very important variable: the filesystem encoding, 
sys.getfilesystemencoding(). It is used to encode and decode filenames to 
access to the filesystem, to encode program arguments in subprocess, etc.

The encoding is hardcoded to "mbcs" on Windows and "utf-8" on Mac OS X. On 
other OSes, Python gets the encoding from the locale. The problem is that the 
code getting the locale encoding loads Python modules (eg. locale) and Python 
uses a default encoding before the locale encoding is known. As a result, 
modules and code objects created before Python sets the locale encoding are 
encoded with the old encoding.

The default encoding is "utf-8". If the locale encoding is also "utf-8", there 
is no problem because the filename are correctly encoded. If the locale 
encoding is different, we keep filenames encoded in the wrong encoding.

It becomes worse when the locale encoding is unable to encode the filenames, 
eg. ASCII encoding.

--

A solution would be to avoid loading any Python module, but I don't think that 
it is possible. The locale encoding can be something different than ascii, 
latin-1, utf-8 or mbcs. The locale encoding can be an alias like 'utf8' 
(instead of 'utf-8'), 'iso-8859-1' (Python uses 'latin_1') or 'ANSI_x3.4_1968' 
(for 'ascii') and encoding aliases are implemented as Lib/encodings/aliases.py 
which is... a Python module.

--

I wrote a patch to reencode filenames of all module and code objects in 
initfsencoding() when the locale encoding is known.

I tested my patch on my import_unicode branch (branch to fix #8611, see also 
#9425: issue to merge the branch to py3k). I would like one or more reviews of 
the patch because it is long and complex. Please check for refleaks :-)

--

About the patch.

I don't know how to list *all* code objects and so I created a list to store 
weak references to all code objects, list filled by the code object 
constructor. The list is destroyed at initfsencoding() exit (early in Python 
initialization).

There is a FIXME: I don't know if sys.path_importer_cache keys should also be 
reencoded.

I tried to apply all remarks made on the first patch (posted on Rietveld for 
#9425). The patch now stores weak references instead of strong references to 
code objects in the code object list.

(r84168 creates PyModule_GetFilenameObject, function needed by this patch)

----------
components: Interpreter Core, Unicode
files: reencode_modules_path.patch
keywords: patch
messages: 114191
nosy: haypo
priority: normal
severity: normal
status: open
title: Reencode filenames of all module and code objects when setting the 
filesystem encoding
versions: Python 3.2
Added file: http://bugs.python.org/file18560/reencode_modules_path.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue9630>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to