Eryk Sun added the comment:

The ANSI API is problematic because it returns a best-fit encoding to the 
system codepage. For example:

    >>> os.listdir('.')
    ['ƠƨưƸǀLjǐǘǠǨǰǸ']

    >>> os.listdir(b'.')
    [b'O?u?|?iu?Kj?']

To somewhat work around this problem, listdir and scandir could return the 
cAlternateFilename of the WIN32_FIND_DATA struct if present. This is the 
classic 8.3 short name that Microsoft file systems create for MS-DOS 
compatibility. For NTFS it can be disabled in the registry, or per volume, but 
I assume whoever does that knows what to expect. 

Also, since Python wouldn't need the short name for a wide-character path, 
there's no point in asking for it. (For NTFS it's a separate entry in the MFT. 
If it exists, which is known ahead of time, finding the entry requires a second 
lookup.) In this case it's better to call FindFirstFileExW and request only 
FindExInfoBasic. Generally the difference is inconsequential, but in a 
contrived example with 10000 similarly-named files from "ĀāĂă0000" to 
"ĀāĂă9999" and short names from "0000~1" to "9999~1", skipping the short name 
lookup shaved about 10% off the total time. For this test, I replaced the 
FindFirstFileW call in posix_scandir with the following call:

    iterator->handle = FindFirstFileExW(path_strW,
                                        FindExInfoBasic,
                                        &iterator->file_data,
                                        FindExSearchNameMatch,
                                        NULL, 0);

----------
nosy: +eryksun

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25911>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to