[issue45239] email.utils.parsedate_tz raises UnboundLocalError if time has more than 2 dots in it
New submission from Ben Hoyt : In going through some standard library code, I found that the email.utils.parsedate_tz() function (https://docs.python.org/3/library/email.utils.html#email.utils.parsedate_tz) has a bug if the time value is in dotted format and has more than 2 dots in it, for example: "12.34.56.78". Per the docs, it should return None in the case of invalid input. This is happening because in the case that handles the '.' separator (instead of the normal ':'), there's no else clause to return None when there's not 2 or 3 segments. >From >https://github.com/python/cpython/blob/dea59cf88adf5d20812edda330e085a4695baba4/Lib/email/_parseaddr.py#L118-L132: if len(tm) == 2: [thh, tmm] = tm tss = '0' elif len(tm) == 3: [thh, tmm, tss] = tm elif len(tm) == 1 and '.' in tm[0]: # Some non-compliant MUAs use '.' to separate time elements. tm = tm[0].split('.') if len(tm) == 2: [thh, tmm] = tm tss = 0 elif len(tm) == 3: [thh, tmm, tss] = tm # HERE: need "else: return None" else: return None We simply need to include that additional "else: return None" block in the '.' handling case. I'll submit a pull request that fixes this soon (and adds a test for this case). -- components: Library (Lib) messages: 402140 nosy: barry, benhoyt priority: normal severity: normal status: open title: email.utils.parsedate_tz raises UnboundLocalError if time has more than 2 dots in it versions: Python 3.10 ___ Python tracker <https://bugs.python.org/issue45239> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45239] email.utils.parsedate_tz raises UnboundLocalError if time has more than 2 dots in it
Ben Hoyt added the comment: For reference, here's a repro case: $ python3.10 -c 'import email.utils; \ email.utils.parsedate_tz("Wed, 3 Apr 2002 12.34.56.78+0800")' Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.10/email/_parseaddr.py", line 50, in parsedate_tz res = _parsedate_tz(data) File "/usr/local/lib/python3.10/email/_parseaddr.py", line 134, in _parsedate_tz thh = int(thh) UnboundLocalError: local variable 'thh' referenced before assignment -- ___ Python tracker <https://bugs.python.org/issue45239> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45239] email.utils.parsedate_tz raises UnboundLocalError if time has more than 2 dots in it
Change by Ben Hoyt : -- keywords: +patch pull_requests: +26853 stage: -> patch review pull_request: https://github.com/python/cpython/pull/28452 ___ Python tracker <https://bugs.python.org/issue45239> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30659] Use ** correctly for "kwargs" in signature of namedtuple._replace()
New submission from Ben Hoyt: The documentation for namedtuple._replace() needs a tweak: currently its signature shows "kwargs" without the ** notation, indicating it's a single, positional argument. It should have the ** to indicate it's actually any number of keyword arguments. See https://docs.python.org/3/library/collections.html#collections.somenamedtuple._replace and compare with the docs for the signature of Logger.debug() for example: https://docs.python.org/3/library/logging.html#logging.Logger.debug -- assignee: docs@python components: Documentation messages: 295949 nosy: benhoyt, docs@python, rhettinger priority: normal pull_requests: 2225 severity: normal status: open title: Use ** correctly for "kwargs" in signature of namedtuple._replace() versions: Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6, Python 3.7 ___ Python tracker <http://bugs.python.org/issue30659> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29540] Add compact=True flag to json.dump/dumps
Ben Hoyt added the comment: Agreed. Seems much the same as other argument constants, like pickle.HIGHEST_PROTOCOL for the "protocol" argument. These are not changing the API, just adding a helper constant to avoid the magic values. -Ben On Tue, Mar 14, 2017 at 12:24 PM, Brett Cannon wrote: > > Brett Cannon added the comment: > > I agree with David that I don't see how adding a constant to the module is > really a complication of an API. > > -- > > ___ > Python tracker > <http://bugs.python.org/issue29540> > ___ > -- ___ Python tracker <http://bugs.python.org/issue29540> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue31803] Remove not portable time.clock(), replaced by time.perf_counter() and time.process_time()
Ben Hoyt added the comment: I don't think this is a good idea. I still use time.clock() out of habit (on Windows), and from the PR it's clear that the standard library and tests do too. I wouldn't be at all surprised to find others do as well. I realize it's been deprecated since 3.3, but without a DeprecatingWarning, that's easy to miss. What about adding a DeprecatingWarning for 3.7 and deprecate it later, maybe in 3.9? (I know that's a long time, but time.clock() is an old function so we should be cautious!) -- nosy: +benhoyt ___ Python tracker <https://bugs.python.org/issue31803> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue34712] Style fixes in examples of "Input and Output" tutorial section
New submission from Ben Hoyt : There are a couple of examples in the "Input and Output" section of the tutorial that use an unusual / non-PEP 8 Python style or whitespace. I think our examples, especially in the tutorial, should reflect good, PEP 8 style. I'll fix these in a PR which I'll attach soon. -- assignee: docs@python components: Documentation messages: 325550 nosy: akuchling, benhoyt, docs@python, rhettinger priority: normal severity: normal status: open title: Style fixes in examples of "Input and Output" tutorial section versions: Python 3.8 ___ Python tracker <https://bugs.python.org/issue34712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue34712] Style fixes in examples of "Input and Output" tutorial section
Change by Ben Hoyt : -- keywords: +patch pull_requests: +8784 stage: -> patch review ___ Python tracker <https://bugs.python.org/issue34712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue34712] Style fixes in examples of "Input and Output" tutorial section
Ben Hoyt added the comment: GitHub PR link with changes and commit notes attached. -- ___ Python tracker <https://bugs.python.org/issue34712> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15207] mimetypes.read_windows_registry() uses the wrong regkey, creates wrong mappings
Ben Hoyt added the comment: Thanks, Tim! Works for me! A couple of code review comments: 1) On 2.7, guess_type(s)[0] is a byte string as usual if the type doesn't exist in the registry, but it's a unicode string if it came from the registry. Seems like it should be a byte string in all cases (the mime type can only be ASCII char). I would say .encode('ascii') and if it raises UnicodeError, ignore that key. 2) Would 'subkeyname[0] == "."' be better as subkeyname.startswith(".")? More idiomatic, and won't bomb out if subkeyname is zero length (though that probably can't happen). Relatedly, "not subkeyname.startswith()" with an early-continue would avoid an indent and is what the rest of the code does. 3) I believe the default_encoding variable is not needed anymore. That was used in the old registry code. 4) Super-minor: "raises EnvironmentError" would be the Pythonic way to say "throws EnvironmentError". 5) Would it be worth adding a test for 'foo.png' as well, as that was another super-common type that was wrong? -- ___ Python tracker <http://bugs.python.org/issue15207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15207] mimetypes.read_windows_registry() uses the wrong regkey, creates wrong mappings
Ben Hoyt added the comment: All looks great. I like what you've done with default_encoding now. Thanks, Tim (and Dave for the original report). -- ___ Python tracker <http://bugs.python.org/issue15207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10551] mimetypes read from the registry should not overwrite standard mime mappings
Ben Hoyt added the comment: This is definitely a real issue, and makes mimetypes.guess_type() useless out of the box on Windows. However, I believe the reason it's broken is that the fix for Issue4969 doesn't actually work, and I'm not sure this is possible with the Windows registry. You see, "MIME\Database\Content Type" in the Windows registry is a mime type -> file extension mapping, *not the other way around*. But read_windows_registry() tries to use it as a file extension -> mime type mapping, and bad things happen, because there are multiple mime types for certain file extensions. As far as I can tell, there's nothing in the Windows registry that says which is the "canonical" mime type for a given extension. Again, this is because Microsoft intends it (and uses it) as a mime type -> extension mapping. See more here: http://msdn.microsoft.com/en-us/library/ms775148(v=vs.85).aspx For example, in my "MIME\Database\Content Type" we have: image/jpeg -> .jpg image/jpg -> .jpg image/pjpeg -> .jpg And read_windows_registry() picks the last one for .jpg, which in this case is image/pjpeg -- NOT what users expect. In short, I think the fix for Issue4969 is broken as is, and that you can't actually use the mime types database in the Windows registry in this way. I suggest reverting the fix for Issue4969. Or, we could get clever and only use the Windows registry value if there's a single mime type -> extension mapping for a given extension, and if there's more than one (meaning it'd be ambiguous), use the mimetypes default from types_map / common_types. -- nosy: +benhoyt ___ Python tracker <http://bugs.python.org/issue10551> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21719] Returning Windows file attribute information via os.stat()
New submission from Ben Hoyt: I asked recently on python-dev [1] about adding a "st_winattrs" attribute to stat result objects on Windows, to return the full set of Windows file attribute bits, such as "hidden" or "compressed" status. Copying from that thread for a bit more context here: Python's os.stat() simply discards most of the file attribute information fetched via the Win32 system calls. On Windows, os.stat() calls CreateFile to open the file and get the dwFileAttributes value, but it throws it all away except the FILE_ATTRIBUTE_DIRECTORY and FILE_ATTRIBUTE_READONLY bits. See CPython source at [2]. Given that os.stat() returns extended, platform-specific file attributes on Linux and OS X platforms (for example, st_blocks, st_rsize, etc), it seems that Windows is something of a second-class citizen here. There are several questions on StackOverflow about how to get this information on Windows, and one has to resort to ctypes. For example, [3]. To solve this problem, I think we should add a "st_winattrs" attribute to the object returned by os.stat() on Windows. And we should add the relevant Win32 FILE_ATTRIBUTE_* constants to the "stat" module. Then, similarly to existing code like hasattr(st, 'st_blocks') on Linux, you could write a cross-platform function to determine if a file was hidden, something like so: def is_hidden(path): if startswith(os.path.basename(path), '.'): return True st = os.stat(path) return getattr(st, 'st_winattrs', 0) & FILE_ATTRIBUTE_HIDDEN != 0 There was general support for this idea on python-dev (see [4] [5] [6]), so I'd love to see this in Python 3.5. Basically we need to add a "st_winattrs" integer attribute to the win32_stat struct in posixmodule.c and add the FILE_ATTRIBUTE_* constants to _stat.c, as well as adding Windows-specific tests and documentation. I've never compiled CPython on Windows, but I aim to provide a patch for this at some stage soonish. Feedback and other patches welcome in the meantime. :-) [1] https://mail.python.org/pipermail/python-dev/2014-June/134990.html [2] https://github.com/python/cpython/blob/master/Modules/posixmodule.c#L1462 [3] http://stackoverflow.com/a/6365265 [4] https://mail.python.org/pipermail/python-dev/2014-June/134993.html [5] https://mail.python.org/pipermail/python-dev/2014-June/135006.html [6] https://mail.python.org/pipermail/python-dev/2014-June/135007.html -- components: Library (Lib) messages: 220266 nosy: benhoyt, ethan.furman, zach.ware priority: normal severity: normal status: open title: Returning Windows file attribute information via os.stat() type: enhancement versions: Python 3.5 ___ Python tracker <http://bugs.python.org/issue21719> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21719] Returning Windows file attribute information via os.stat()
Ben Hoyt added the comment: Fair call -- "st_file_attributes" sounds good to me, and matches the prefix of the FILE_ATTRIBUTES_* constants better too. -- ___ Python tracker <http://bugs.python.org/issue21719> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21719] Returning Windows file attribute information via os.stat()
Ben Hoyt added the comment: I've got a patch for this. Need to finish the docs and add tests, and then I'll post here. -- ___ Python tracker <http://bugs.python.org/issue21719> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21745] Devguide: mention requirement to install Visual Studio SP1 on Windows
New submission from Ben Hoyt: Per my email on core-mentorship, the instructions for compiling CPython on Windows at https://docs.python.org/devguide/setup.html#windows are good, however I did have one issue where the dev guide didn't help. During the link step, I got this error: LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt It took me a while to figure out how to fix it. I found this question on StackOverflow: http://stackoverflow.com/questions/10888391/error-link-fatal-error-lnk1123-failure-during-conversion-to-coff-file-inval The first part of the answer didn't help (because incremental linking is already disabled). But the second part, kinda tucked away there, is what fixed it for me -- simply installing Visual Studio 2010 SP1 here: http://www.microsoft.com/en-us/download/details.aspx?id=23691 After this, I restarted my PC and it linked fine. So I suggest a small addition to the dev guide to mention this -- adding the following paragraph after the "Python 3.3 and later" paragraph: - You'll also need to install the Visual Studio `Service Pack 1 (SP1) <http://www.microsoft.com/en-us/download/details.aspx?id=23691>`_. If you don't install this service pack, you may receive errors like the following during linking: ``LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt``. - Patch attached. -- assignee: docs@python components: Documentation files: visual_studio_sp1.patch keywords: patch messages: 220418 nosy: benhoyt, docs@python priority: normal severity: normal status: open title: Devguide: mention requirement to install Visual Studio SP1 on Windows type: enhancement Added file: http://bugs.python.org/file35607/visual_studio_sp1.patch ___ Python tracker <http://bugs.python.org/issue21745> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21745] Devguide: mention requirement to install Visual Studio SP1 on Windows
Changes by Ben Hoyt : -- nosy: +r.david.murray ___ Python tracker <http://bugs.python.org/issue21745> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21719] Returning Windows file attribute information via os.stat()
Ben Hoyt added the comment: Full patch to add this to Python 3.5 attached: * code changes to posixmodule.c and _stat.c * tests added in test_os.py and test_stat.py * docs added to os.rst and stat.rst -- keywords: +patch Added file: http://bugs.python.org/file35615/issue21719.patch ___ Python tracker <http://bugs.python.org/issue21719> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21745] Devguide: mention requirement to install Visual Studio SP1 on Windows
Ben Hoyt added the comment: Cool, thanks for applying. Out of curiosity, how often is the online devguide HTML updated? -- ___ Python tracker <http://bugs.python.org/issue21745> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21719] Returning Windows file attribute information via os.stat()
Ben Hoyt added the comment: Updated patch attached based on code reviews. I'm replying to the code review here as when I tried to reply on bugs.python.org/review I got a Python exception, "AttributeError: NoneType has no attribute something or other". FYI, it seems Django is set up in debug mode, because it was the full colourful traceback with local vars etc. I didn't save this file, sorry! Zach Ware = > Either way, these should probably be defined in stat.py as well (sorry; I > should > have said 'as well as' instead of 'instead of' when I mentioned _stat.c in the > email thread). I don't think so -- I thought the purpose of moving them to _stat.c was so you could use the windows.h constants rather than hard-coded integers? But if they're in stat.py, why do they need to be in _stat.c as well? http://bugs.python.org/review/21719/diff/12149/Modules/_stat.c#newcode563 > Modules/_stat.c:563: // For some reason FILE_ATTRIBUTE_INTEGRITY_STREAM and > PEP 7 bans C++ style comments. Good point. stoneleaf = > 'covert' should be 'convert' - can you fix this as part of your patch? ;) Fixed in latest patch. > Instead of spelling out FILE_ATTRIBUTE_ each time, can we abbreviate that part > to FA_ ? I disagree. I realize FILE_ATTRIBUTE_* is longish, but it's what it's called in the Windows API, so will make these constants much more findable and obvious to Windows programmers (consistent with the Win32 documentation). > On non-Windows machines 'st_file_attributes' will be 0? No, it won't be present at all (like the other non-cross-platform attributes). > There's no attribute for the value 8? Correct -- see the Win32 API documentation and loewis's comment. haypo = > It would be nice to document these options here :) I don't think we should duplicate the Win32 documentation. Ours will just be less thorough and potentially get out of date. I agree we should add a link to the Win32 docs though. > Instead of == 0, you can use assertFalse. Could, but I don't like this when I'm doing bit/integer comparisons. Explicit is better. > The purpose of the _stat module is to only provide data from the OS, to not > hardcode constants. So -1 to provide these constants on all platforms. Agreed. See above. > If the constant is only available on new Visual Studio version, we may > hardcode > the constant to avoid the dependency on a specific Visual Studio version. For > example, in unicodeobject.c I used: > > #ifndef WC_ERR_INVALID_CHARS > # define WC_ERR_INVALID_CHARS 0x0080 > #endif That's a good way of doing it. I'll use this approach. > Modules/posixmodule.c:1437: unsigned long st_file_attributes; > Should we use a DWORD here? I don't think so. None of the other values in win32_stat (e.g., st_dev) are defined using DWORD or Win32-style types. See also loewis's comment. loewis == > We shouldn't replicate Microsoft's documentation (if for no other reason than > copyright). Instead, linking to > > http://msdn.microsoft.com/en-us/library/windows/desktop/gg258117.aspx > > might be helpful (as long as Microsoft preserves this URL). Agreed. I'll make this change. -- Added file: http://bugs.python.org/file35621/issue21719-2.patch ___ Python tracker <http://bugs.python.org/issue21719> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21719] Returning Windows file attribute information via os.stat()
Ben Hoyt added the comment: > The idea is that _stat.c will use system-provided values wherever > possible, but stat.py should be as accurate as we can make it to > provide a backup for when _stat isn't around (either when it's just > not built, which isn't an issue on Windows, or in another Python > implementation). Fair call. I've added the constants to stat.py as well. Updated patch attached. -- Added file: http://bugs.python.org/file35622/issue21719-3.patch ___ Python tracker <http://bugs.python.org/issue21719> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21719] Returning Windows file attribute information via os.stat()
Ben Hoyt added the comment: Uploading a (hopefully final! :-) patch to fix Zach Ware's points from the code review: 1) use stat.FILE_ATTRIBUTE_DIRECTORY constant in test_os.py 2) break line length in stat.rst doc source -- Added file: http://bugs.python.org/file35632/issue21719-4.patch ___ Python tracker <http://bugs.python.org/issue21719> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21719] Returning Windows file attribute information via os.stat()
Ben Hoyt added the comment: Great, thanks for committing! -- ___ Python tracker <http://bugs.python.org/issue21719> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15207] mimetypes.read_windows_registry() uses the wrong regkey, creates wrong mappings
Ben Hoyt added the comment: Ah, thanks for making this an issue of its own! As I commented over at Issue10551, it's a serious problem, and makes mimetypes.guess_type() unusable out of the box on Windows. Yes, the fix in Issue4969 uses "MIME\Database\Content Type", which is a mime type -> file extension mapping, *not the other way around*. So while this patch is definitely an improvement (for the most part it doesn't produce wrong values!), but I'm not sure it's the way to go, for a few reasons: 1) Many of the most important keys aren't in the Windows registry (in HKEY_CLASSES_ROOT, where this patch looks). This includes .png, .jpg, and .gif. All of these important types fall back to the hard-coded "types_map" in mimetypes.py anyway. 2) Some that do exist are wrong in the registry (or at the least, different from the built-in "types_map"). This includes .zip, which is "application/x-zip-compressed" (at least in my registry) but should be "application/zip". 3) It's slowish, as it has to load about 6000 registry keys (and more as you install more stuff on your system), but only about 200 of those have the "Content Type" subkey. On my machine (Windows 7, 64 bit CPython) this adds over 100ms to the startup time even on subsequent runs when cached -- and I think 100ms is pretty significant. Issue4969's version takes about 25ms, and reverting this completely would of course take 0ms. 4) Users and other programs can (and sometimes do!) change the Content Type keys in the registry -- whereas one wants mime type mappings to be consistent across systems. This last point is debatable for various reasons, and I think the above three points should carry the day, but I include it here for completeness. ;-) For these reasons, I think we should revert the fix for Issue4969 and leave Windows users to get the default types_map as before, which is at least consisent -- and for mimetypes.guess_type(), you want consistency. -- nosy: +benhoyt ___ Python tracker <http://bugs.python.org/issue15207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15207] mimetypes.read_windows_registry() uses the wrong regkey, creates wrong mappings
Ben Hoyt added the comment: Either way -- this needs to be reverted or fixed. It's a nasty gotcha for folks writing Python web services at the moment. I'm still for reverting, per my reasons above. Dave Chambers, I'm not for "faster but broken" but for "faster and fixed" -- from what I've shown above, it's the Windows registry that's broken, so removing read_windows_registry() entirely would fix this (and as a bonus, be faster and simplify the code :-). Per your suggestion http://bugs.python.org/issue15207#msg177092 -- I don't understand how mimetypes.py would know the types "that aren't hardcoded". R. David Murray, I don't understand the advantage of trying to maintain a list of "Windows fixes". What if this list was wrong, or there was a Windows update which broke more mime types? Why can't we just avoid the complication and go back to the hardcoded types for Windows? -- ___ Python tracker <http://bugs.python.org/issue15207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15207] mimetypes.read_windows_registry() uses the wrong regkey, creates wrong mappings
Ben Hoyt added the comment: Actually, I was suggesting using the hardcoded types for Windows only (i.e., only removing read_windows_registry). Several bugs have been opened on problems with the Windows registry mimetypes, but as far as I know this isn't an issue on Linux -- in other words, if Linux/other systems ain't broke, no need to fix them. -- ___ Python tracker <http://bugs.python.org/issue15207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15207] mimetypes.read_windows_registry() uses the wrong regkey, creates wrong mappings
Ben Hoyt added the comment: Any update on this, Tim or other Windows developers? -- ___ Python tracker <http://bugs.python.org/issue15207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue21719] Returning Windows file attribute information via os.stat()
Ben Hoyt added the comment: BTW, thanks for the mention in "What's New in Python 3.5": https://docs.python.org/3.5/whatsnew/3.5.html#os Can I make one small suggestion for a tweak there? A link to the docs for os.stat() would be good. So maybe instead of mentioning "os.stat_result" -- which isn't really a documented class -- say "The return value of :func:`os.stat` now has..." where os.stat is a link to https://docs.python.org/3.5/library/os.html#os.stat with the documentation for that. -- ___ Python tracker <http://bugs.python.org/issue21719> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11406] There is no os.listdir() equivalent returning generator instead of list
Ben Hoyt added the comment: Raymond, there are very compelling timings/benchmarks for this -- not so much the original issue here (generator vs list, that's not really an issue) but having a scandir() function that returns the stat-like info from the OS so you don't need extra stat calls. This speeds up os.walk() by 7-20 times on Windows and 4-5 times on Linux. See more at: https://github.com/benhoyt/scandir#benchmarks I've written a draft PEP that I've sent to the PEP editors (if you're interested, it's at https://github.com/benhoyt/scandir/blob/master/PEP.txt). If any of the PEP editors are listening here ... would love some feedback on that at some stage. :-) Victor -- development has started outside the stdlib here: https://github.com/benhoyt/scandir and PyPI module here: https://pypi.python.org/pypi/scandir Both are being used by various people. -- ___ Python tracker <http://bugs.python.org/issue11406> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11406] There is no os.listdir() equivalent returning generator instead of list
Ben Hoyt added the comment: Thanks! Will post the PEP to python-dev in the next day or two. -- ___ Python tracker <http://bugs.python.org/issue11406> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11406] There is no os.listdir() equivalent returning generator instead of list
Ben Hoyt added the comment: Nick -- sorry, already posted to python-dev before seeing your latest. However, I think it's the right place, as there's already been a fair bit of hashing this idea and API out on python-ideas first and then also python-dev. See links in the PEP here: http://legacy.python.org/dev/peps/pep-0471/#previous-discussion -- ___ Python tracker <http://bugs.python.org/issue11406> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10551] mimetypes read from the registry should not overwrite standard mime mappings
Ben Hoyt added the comment: Mark, are you referring to part 3 of this issue, the image/pjpeg type of problem? This was fixed in Python 2.7.6 -- see changeset http://hg.python.org/cpython/rev/e8cead08c556 and http://bugs.python.org/issue15207 -- ___ Python tracker <http://bugs.python.org/issue10551> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18577] lru_cache enhancement: lru_timestamp helper function
Ben Hoyt added the comment: I really like this idea (and am needing this functionality), but I don't think this API (or implementation) is very nice: 1) It means you have to change your function signature to use the timeout feature. 2) Specifying the interval in minutes seems odd (most similar timeouts in Python are specified in seconds). I would love to see an optional timeout=seconds keyword arg to the lru_cache() decorator, or some other way to support this. Raymond, what do you think would be the simplest way to hook this in? One way I think would be nice (and also support other neat things) is to allow you to specify the dict-like object that's used for the cache (defaults to "dict", of course). So the lru_cache signature would change to: def lru_cache(maxsize=100, typed=False, cache=None): ... >From looking at the source, cache would need to support these methods: get, >clear, __setitem__, __contains__, __len__, __delitem__ Would this just work? Or could there be a race condition if __contains__ (key in cache) returned True but then cache.get(key) returned False a bit later? In any case, this seems nice and general to me, and would mean you could implement a simple ExpiringDict() and then pass that as cache=ExpiringDict(expiry_time). Thoughts? -- nosy: +benhoyt ___ Python tracker <http://bugs.python.org/issue18577> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11406] There is no os.listdir() equivalent returning generator instead of list
Ben Hoyt added the comment: Yes, PEP 471 has been accepted, and I've got a mostly-finished C implementation of os.scandir() for CPython 3.5, as well as tests and docs. If you want a sneak preview, see posixmodule_scandir*.c, test/test_scandir.py, and os.rst here: https://github.com/benhoyt/scandir It's working well on Windows, but the Linux version has a couple of tiny issues yet (core dumps ;-). Given that os.scandir() will solve this issue (as well as the bigger performance problem due to listdir throwing away file type info), can we close this issue and open another one to track the implementation of os.scandir() / PEP 471? -- ___ Python tracker <http://bugs.python.org/issue11406> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
New submission from Ben Hoyt: Opening this to track the implementation of PEP 471: os.scandir() [1]. This supercedes Issue #11406 (and possibly others)? The implementation is most of the way there, but not yet done as a CPythono 3.5 patch. Before I have a proper patch, it's available on GitHub [2] -- see posixmodule_scandir*.c, test/test_scandir.py, and os.rst. [1] http://legacy.python.org/dev/peps/pep-0471/ [2] https://github.com/benhoyt/scandir -- components: Library (Lib) messages: 227933 nosy: benhoyt, haypo, mmarkk, pitrou, tim.golden priority: normal severity: normal status: open title: PEP 471 implementation: os.scandir() directory scanning function type: enhancement versions: Python 3.5 ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11406] There is no os.listdir() equivalent returning generator instead of list
Ben Hoyt added the comment: Okay, I've opened http://bugs.python.org/issue22524, but I don't have the permissions to close this one, so could someone with bugs.python.org superpowers please do that? -- ___ Python tracker <http://bugs.python.org/issue11406> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Attaching my first patch here. It includes docs, the implementation of scandir and DirEntry in posixmodule.c, and tests in test_scandir.py. It does not yet include the changes for os.walk() to use scandir; I'm hoping to do that in a separate patch for easier code review. I'd love some code and documentation review, as well as some thoughts on these open questions: 1) posixmodule.c is getting kinda huge and unwieldy; it'd be nice to implement this in a separate C file, but that comes with its own problems, as several functions would need to be made available externally. 2) While writing the tests, I was getting issues with Windows (apparently) not deleting the symlinks immediately in my teardown function, so for the moment I just don't call the teardown function at all. Not great; I think we'll want a better solution. 3) Partly due to #2, I haven't yet added tests for the file-not-found condition if a file is found by the directory scanning but then deleted before a DirEntry.is_dir() or DirEntry.is_file() call. I intend to add these tests, and there's a TODO comment in test_scandir.py so I remember. 4) test_scandir.py will need some further cleanup for inclusion in CPython; it's currently taken straight from my scandir module, which supports Python down to Python 2.6. -- keywords: +patch Added file: http://bugs.python.org/file36831/scandir-1.patch ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Thanks for the initial response and code review, Victor. I'll take a look and respond further in the next few days. In the meantime, however, I'm definitely open to splitting scandir out into its own C file. This will mean a little refactoring (making some functions public/non-static). Based on the numbers so far, I'm not so keen on implementing just the sys calls in C and the rest in Python. I already do basically this with ctypes in the scandir module, and it's slowish. I'll send proper numbers through soon, but here's what I remember from running benchmark.py on my Windows laptop with SSD drive: ctypes version: os.walk() 9x faster with scandir CPython 3.5 C version (debug): os.walk() 24x faster with scandir CPython 3.5 C version (release): os.walk() 55x faster with scandir So you do get a lot of speedup from just the ctypes version, but you get a lot more (55/9 = 6x more here) by using the all-C version. Again, these numbers are from memory -- I'll send exact ones later. One of the problems is that creating the DirEntry objects and calling their methods is fairly expensive, and if this is all done in Python, you lose a lot. I believe scandir() would end up being slower than listdir() in many cases. -- ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Here are the actual numbers (instead of just from memory) running on my Windows laptop, which happens to have an SSD drive, so os.walk() with scandir is particularly good here. python 3.4: benchmark.py -c python (all implemented in Python using ctypes): os.walk took 1.011s, scandir.walk took 0.057s -- 17.8x as fast python 3.4: benchmark.py -c c (using _scandir.c, so the iteration implemented in C, the DirEntry object in Python): os.walk took 1.014s, scandir.walk took 0.039s -- 25.8x as fast python 3.5: benchmark.py -c os (using the new all-C version in posixmodule.c): os.walk took 0.983s, scandir.walk took 0.019s -- 52.3x as fast So as you can see, there's still another 2x speedup to be gained from having everything in C. I know it's a bit more to review, but my thinking is if we're going to speed this baby up, let's speed it right up! I haven't done these tests on Linux yet. -- ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6689] subprocess doesn't pass arguments correctly on Linux when shell=True
Changes by Ben Hoyt : -- nosy: +benhoyt type: behavior -> feature request ___ Python tracker <http://bugs.python.org/issue6689> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6689] subprocess doesn't pass arguments correctly on Linux when shell=True
Ben Hoyt added the comment: Oops, didn't intend to change the type, changing back. -- type: feature request -> behavior ___ Python tracker <http://bugs.python.org/issue6689> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29549] Improve docstring for str.index
Ben Hoyt added the comment: Good call. Additionally, it's weird to me that the docstring for str.find() says "Return -1 on failure." because the string not being found is not a "failure". Like the docs (https://docs.python.org/3.5/library/stdtypes.html#str.find) say, the str.find() docstring should say "Return -1 if sub is not found." -- nosy: +benhoyt ___ Python tracker <http://bugs.python.org/issue29549> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29540] Add compact=True flag to json.dump/dumps
Ben Hoyt added the comment: I agree with the confusion (PR proposes separators=COMPACT, issue compact=True). I like the concept but not either of the APIs proposed. I *much* more often want to get pretty output -- if I had a dime for every time I've written "json.dumps(obj, sort_keys=True, indent=4)" I'd be rich enough to buy an entire cup of Starbucks coffee. But then you'd need a pretty=True option as well, which would be mutually exclusive with compact=True, so not great. But what about a style= (or "format="?) parameter, which defaults to 'default' (or just None) meaning the same as now. If you pass format='pretty' you get "sort_keys=True, indent=4" and if you pass format='compact' you get "separators=(',', ':')". -- nosy: +benhoyt ___ Python tracker <http://bugs.python.org/issue29540> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28609] argparse claims '*' positional argument is required in error output
Changes by Ben Hoyt : -- nosy: +benhoyt ___ Python tracker <http://bugs.python.org/issue28609> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue28609] argparse claims '*' positional argument is required in error output
Ben Hoyt added the comment: I definitely agree with Reuben here -- I just ran into this issue while creating a "production quality" tool, and the help message produced by argparse with nargs='*' confused me too. It's pretty clear that this is a simple bug in argparse's production of that usage message: it says ARGUMENT is required, but it isn't. However, the workaround of specifying an (unused) default is a reasonable workaround in the meantime -- thanks Paul. -- ___ Python tracker <http://bugs.python.org/issue28609> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue27719] Misleading note about "args" attribute in "User-defined Exceptions" section of tutorial
New submission from Ben Hoyt: In the official tutorial in the "User-defined Exceptions" section (https://docs.python.org/3.5/tutorial/errors.html#user-defined-exceptions) there's a note about a user-defined Exception subclass as follows: "In this example, the default __init__() of Exception has been overridden. The new behavior simply creates the value attribute. This replaces the default behavior of creating the args attribute." That last sentence is wrong: it used to be that way (it is in Python 2.x and I believe in Python pre 3.2), but now the implementation of BaseException.__new__ now sets args, so even when you override __init__ and don't call super() args is set. I think that's what you want, so I'm putting this down to a documentation bug. I think the sentence "This replaces the default behavior of creating the args attribute." should simply be removed (BaseException.__init__ basically does nothing now). This change happened for Python 3.3 and was backported to Python 3.2. See also: * The relevant part of BaseException.__new__ in the CPython codebase: https://github.com/python/cpython/blob/601ee5fab5df81a25611da0667030de531c1cda9/Objects/exceptions.c#L44-L48 * The issue where this behaviour was changed: http://bugs.python.org/issue1692335 * The commit where it was changed: https://hg.python.org/cpython/rev/68e2690a471d (on GitHub at https://github.com/python/cpython/commit/a71a87e695b05a67bd22c6ac74311b1f56f3932e) * You can repro this and confirm that "args" is definitely set with the attached script: $ python exception_args_test.py sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0) Error args: ('error!',) -- assignee: docs@python components: Documentation files: exception_args_test.py messages: 272256 nosy: benhoyt, docs@python, georg.brandl, sbt priority: normal severity: normal status: open title: Misleading note about "args" attribute in "User-defined Exceptions" section of tutorial versions: Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6 Added file: http://bugs.python.org/file44058/exception_args_test.py ___ Python tracker <http://bugs.python.org/issue27719> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue27719] Misleading note about "args" attribute in "User-defined Exceptions" section of tutorial
Ben Hoyt added the comment: Note that I added the committers from issue 1692335 to the Nosy List -- probably overzealous and probably not the folks who maintain the docs, but oh well. :-) -- ___ Python tracker <http://bugs.python.org/issue27719> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue27719] Misleading note about "args" attribute in "User-defined Exceptions" section of tutorial
Ben Hoyt added the comment: Okay, patch attached -- I think it's best to simply remove that sentence. Side note: this is why GitHub is so good -- click edit, remove the sentence, click create pull request (approx time 30 seconds). I've got a new machine since I last did Python development, so to create an actual patch I needed to: download Mercurial, clone the repo, install Sphinx, edit the file, build the docs, create the patch (approx time 30 minutes). On the positive side, each of those steps was painless and trouble-free. :-) -- keywords: +patch Added file: http://bugs.python.org/file44071/issue27719-1.patch ___ Python tracker <http://bugs.python.org/issue27719> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue27719] Misleading note about "args" attribute in "User-defined Exceptions" section of tutorial
Ben Hoyt added the comment: Removing that whole example sounds good to me, thanks. -- ___ Python tracker <http://bugs.python.org/issue27719> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15207] mimetypes.read_windows_registry() uses the wrong regkey, creates wrong mappings
Ben Hoyt added the comment: Okay, I'm looking at the diff between mt-tip-noregistry.txt and mt-tip-newregistry.txt, and I've attached a file showing the lines that are *different* between the two, as well as the Apache mime.types value for that file extension. In most cases, noregistry gives the right mime type, and newregistry is wrong. However, in a few cases, the registry value is right (i.e., according to Apache's mime.types). However, I think that's a totally separate issue, and maybe we should probably open a bug to update a few of the hard-coded mappings in mimetypes.py. The cases where noregistry is right (according to Apache): * .aif * .aifc * .aiff * .avi * .sh * .wav * .xsl *. zip The cases where noregistry is wrong (according to Apache): * .bmp is hard-coded as "image/x-ms-bmp", but it should be image/bmp * .dll and .exe are hard-coded as "application/octet-stream", but should be "application/x-msdownload" * .ico is hard-coded as "image/vnd.microsoft.icon" but should be "image/x-icon" * .m3u is hard-coded as "application/vnd.apple.mpegurl" but should be "audio/x-mpegurl" None of these are standardized IANA mime types, and they're not particularly common for web servers to be serving, so it probably doesn't matter too much that the current hard-coded values are wrong. Also, I'm guessing web browsers will interpret the older type image/x-ms-bmp as image/bmp anyway. So maybe we should open another issue to fix the hard-coded types in mimetypes.py shown above, but again, that's another issue. The other thing here is all the *new types* that the registry adds, such as ".acrobatsecuritysettings". I don't see that these add much value -- just a bunch of types that depend on the programs you have installed. And in my mind at least, the behaviour of mimetypes.guess_type() should not change based on what programs you have installed. In short, "noregistry" gives more accurate results in most cases that matter, and I still strongly feel we should revert to that. (The only alternative, in my opinion, is to switch to Dave Chambers' version of read_windows_registry(), but not call it by default.) -- Added file: http://bugs.python.org/file29913/different.txt ___ Python tracker <http://bugs.python.org/issue15207> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15535] Fix pickling efficiency of named tuples in 2.7.3
Ben Hoyt added the comment: I just hit this issue in a big way -- would have been nice for this fix to go into Python 2.7.4. :-) It was quite hard to track down (as in, a day or two of debugging :-) because the symptoms didn't point directly to namedtuple. In our setup we pickle/unpickle some big files, and the symptoms we noticed were extremely high memory usage after *un*pickling -- as in, 3x what we were getting before upgrading from Python 2.6. We first tracked it down to unpickling, and then from there narrowed it down to namedtuple. The first "fix" I discovered was that I could use pickletools.optimize() to reduce the memory-usage-after-unpickling back down to sensible levels. I don't know enough about pickle to know exactly why this is -- perhaps fragmentation due to extra unpickling data structures allocated on the heap, that optimize() removes? Here's the memory usage of our Python process after unpickling a ~9MB pickle file (protocol 2) which includes a lot of namedtuples. This is on Python 2.7.4 64-bit. With the original collections.py -- "normal" means un-optimized pickle, "optimized" means run through pickletools.optimize(): Memory usage after loading normal: 106664 KB Memory usage after loading optimized: 31424 KB With collections.py modified so namedtuple's templates include "def __getstate__(self): return None": Memory usage after loading normal: 33676 KB Memory usage after loading optimized: 26392 KB So you can see the Python 2.7 version of namedtuple makes the process use basically 3x the RAM when unpickled (without pickletools.optimize). Note that Python 2.6 does *not* do this (it doesn't define __dict__ or use OrderedDict so doesn't have this issue). And for some Python 3.3(.1) doesn't have the issue either, even though that does define __dict__ and use OrderedDict. I guess Python 3.3 does pickling (or garbage collection?) somewhat differently. You can verify this yourself using the attached unpickletest.py script. Note that I'm running on Windows 7, but I presume this would happen on Linux/OS X too, as this issue has nothing to do with the OS. The script should work on non-Windows OSes, but you have to type in the RAM usage figures manually (using "top" or similar). Note that I'm doing a gc.collect() just before fetching the memory usage figure just in case there's uncollected cyclical garbage floating around, and I didn't want that to affect the measurement. I'm not sure I fully understand the cause (of where all this memory is going), or the fix for that matter. The OrderedDict is being pickled along with the namedtuple instance, because an OrderedDict is returned by __dict__, and pickle uses that. But is that staying in memory on unpickling? Why does optimizing the pickle fix the RAM usage issue to a large extent? In any case, I've made the __getstate__ fix in our code, and that definitely fixes the RAM usage for us. (We're also going to be optimizing our pickles from now on.) -- nosy: +benhoyt Added file: http://bugs.python.org/file30021/unpickletest.py ___ Python tracker <http://bugs.python.org/issue15535> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11406] There is no os.listdir() equivalent returning generator instead of list
Ben Hoyt added the comment: Ah, this is great. I definitely like the idea of a generator version of os.listdir(). And I like the name iterdir() -- it fits with iteritems() etc. They've gone in 3.x, of course, but listdir didn't change to an iterator, so... See also Betterwalk, my work-in-progress with very similar goals: https://github.com/benhoyt/betterwalk#readme It implements iterdir(), as well as iterdir_stat() which yields (name, stat) tuples. iterdir_stat() is especially important on Windows, where the directory iteration functions (FindFirstFile/FindNextFile) already give you full stat information. The intent of Betterwalk is to use these functions to speed up os.walk() by 2-3 times (and that's with the ctypes version, so it'll only get better in pure C). So I'm +1 for adding iterdir(), and I'd love to see iterdir_stat() so users can write fast os.walk() type functions without resorting to C. I'll look over the attached patches at some stage, especially the Windows code. -- nosy: +benhoyt ___ Python tracker <http://bugs.python.org/issue11406> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11406] There is no os.listdir() equivalent returning generator instead of list
Ben Hoyt added the comment: Some folks have asked about benchmarks. I don't know about iterdir() vs listdir() -- I kind of suspect the speed gains there wouldn't be big. However, the reason I'm keen on iterdir_stat() is that I'm seeing it speed up os.walk() by a factor of 10 in my recent tests (note that I've made local mods, so these results aren't reproducible for others yet). This is doing a walk on a dir tree with 7800 files and 155 dirs: Using fast _betterwalk Priming the system's cache... Benchmarking walks on C:\Work\betterwalk\benchtree, repeat 1/3... Benchmarking walks on C:\Work\betterwalk\benchtree, repeat 2/3... Benchmarking walks on C:\Work\betterwalk\benchtree, repeat 3/3... os.walk took 0.178s, BetterWalk took 0.017s -- 10.5x as fast Sometimes Windows will go into this "I'm really caching stat results good" mode -- I don't know what heuristic determines this -- and then I'm seeing a 40x speed increase. And no, you didn't read that wrong. :-) Sorry, I'm getting carried away. This bug is really more about iterdir. But seeing Martin suggested the stat/d_type info... -- ___ Python tracker <http://bugs.python.org/issue11406> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue15535] Fix pickling efficiency of named tuples in 2.7.3
Ben Hoyt added the comment: 2.7 fix works for me, thanks! Just curious -- why the different fix for 3.3 (addition of __getstate__ instead of removal of __dict__)? -- ___ Python tracker <http://bugs.python.org/issue15535> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11406] There is no os.listdir() equivalent returning generator instead of list
Ben Hoyt added the comment: Thanks. I thought about that -- but I think I *want* to benchmark it when they're cached, so that we're comparing apples with apples, cached system calls with cached systems calls. The benchmark would almost certainly be a lot "better" (BetterWalk would be even faster) if I was comparing the non-cached results. I'll think about it some more though. Thoughts? -Ben On Fri, May 3, 2013 at 7:03 PM, Charles-François Natali < rep...@bugs.python.org> wrote: > > Charles-François Natali added the comment: > > > However, the reason I'm keen on iterdir_stat() is that I'm seeing it > speed up os.walk() by a factor of 10 in my recent tests (note that I've > made local mods, so these results aren't reproducible for others yet). This > is doing a walk on a dir tree with 7800 files and 155 dirs: > > > > Using fast _betterwalk > > Priming the system's cache... > > Benchmarking walks on C:\Work\betterwalk\benchtree, repeat 1/3... > > Benchmarking walks on C:\Work\betterwalk\benchtree, repeat 2/3... > > Benchmarking walks on C:\Work\betterwalk\benchtree, repeat 3/3... > > os.walk took 0.178s, BetterWalk took 0.017s -- 10.5x as fast > > > > Sometimes Windows will go into this "I'm really caching stat results > good" mode -- I don't know what heuristic determines this -- and then I'm > seeing a 40x speed increase. And no, you didn't read that wrong. :-) > > I/O benchmarks shouldn't use timeit or repeated calls: after the first > run, most of your data is in cache, so subsequent runs are > meaningless. > > I don't know about Windows, but on Linux you should do something like: > # echo 3 > /proc/sys/vm/drop_caches > > to start out clean. > > -- > > ___ > Python tracker > <http://bugs.python.org/issue11406> > ___ > -- ___ Python tracker <http://bugs.python.org/issue11406> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11406] There is no os.listdir() equivalent returning generator instead of list
Ben Hoyt added the comment: That's right: if we have a separate scandir() that returns (name, stat) tuples, then a plain iterdir() is pretty much unnecessary -- callers just ignore the second stat value if they don't care about it. I'd slightly prefer the name iterdir_stat(), as that almost makes the (name, stat) return values explicit in the name. But that's kind of bikeshedding -- scandir() works too. -- ___ Python tracker <http://bugs.python.org/issue11406> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11406] There is no os.listdir() equivalent returning generator instead of list
Ben Hoyt added the comment: > I find iterdir_stat() ugly :-) I like the scandir name, which has some > precedent with POSIX. Fair enough. I'm cool with scandir(). > scandir() cannot return (name, stat), because on POSIX, readdir() only > returns d_name and d_type (the type of the entry): to return a stat, we would > have to call stat() on each entry, which would defeat the performance gain. Yes, you're right. I "solved" this in BetterWalk with the solution you propose of returning a stat_result object with the fields it could get "for free" set, and the others set to None. So on Linux, you'd get a stat_result with only st_mode set (or None for DT_UNKNOWN), and all the other fields None. However -- st_mode is the one you're most likely to use, usually looking just for whether it's a file or directory. So calling code would look something like this: files = [] dirs = [] for name, st in scandir(path): if st.st_mode is None: st = os.stat(os.path.join(path, name)) if stat.S_ISDIR(st.st_mode): dirs.append(name) else: files.append(name) Meaning you'd get the speed improvements 99% of the time (when st_mode) was set, but if st_mode is None, you can call stat and handle errors and whatnot yourself. > That's why scandir would be a rather low-level call, whose main user would be > walkdir, which only needs to know the entry time and not the whole stat > result. Agreed. This is in the OS module after all, and there's tons of stuff that's OS-dependent in there. However, I think that doing something like the above, we can make it usable and performant on both Linux and Windows for use cases like walking directory trees. > Also, I don't know which information is returned by the readdir equivalent on > Windows, but if we want a consistent API, we have to somehow map d_type and > Windows's returned type to a common type, like DT_FILE, DT_DIRECTORY, etc > (which could be an enum). The Windows scan directory functions (FindFirstFile/FindNextFile) return a *full* stat (or at least, as much info as you get from a stat in Windows). We *could* map them to a common type -- but I'm suggesting that common type might as well be "stat_result with None meaning not present". That way users don't have to learn a completely new type. > The other approach would be to return a dummy stat object with only st_mode > set, but that would be kind of a hack to return a dummy stat result with only > part of the attributes set (some people will get bitten by this). We could document any platform-specific stuff, and places you'd users could get bitten. But can you give me an example of where the stat_result-with-st_mode-or-None approach falls over completely? -- ___ Python tracker <http://bugs.python.org/issue11406> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11406] There is no os.listdir() equivalent returning generator instead of list
Ben Hoyt added the comment: Yeah, I very much agree with what Nick says -- we really want a way to expose what the platform provides. It's less important (though still the ideal), to expose that in a platform-independent way. Today the only way to get access to opendir/readdir on Linux and FindFirst/Next on Windows is by using a bunch of messy (and slowish) ctypes code. And yes, os.walk() would be the main cross-platform abstraction built on top of this. Charles gave this example of code that would fall over: size = 0 for name, st in scandir(path): if stat.S_ISREG(st.st_mode): size += st.st_size I don't see it, though. In this case you need both .st_mode and .st_size, so a caller would check that those are not None, like so: size = 0 for name, st in scandir(path): if st.st_mode is None or st.st_size is None: st = os.stat(os.path.join(path, name)) if stat.S_ISREG(st.st_mode): size += st.st_size One line of extra code for the caller, but a big performance gain in most cases. Stinner said, "But as usual, a benchmark on a real platform would be more convicing". Here's a start: https://github.com/benhoyt/betterwalk#benchmarks -- but it's not nearly as good as it gets yet, because those figures are still using the ctypes version. I've got a C version that's half-finished, and on Windows it makes os.walk() literally 10x the speed of the default version. Not sure about Linux/opendir/readdir yet, but I intend to do that too. -- ___ Python tracker <http://bugs.python.org/issue11406> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11406] There is no os.listdir() equivalent returning generator instead of list
Ben Hoyt added the comment: > A normal "caller" would never expect a stat object to be partially populated: > if a function has a prototype returning a stat object, then I definitely > expect it to be a regular stat object, with all the fields guaranteed by > POSIX set (st_size, st_ino, st_dev...). I don't think that's true in general, or true of how other Python APIs work. For instance, many APIs return a "file-like object", and you can only do certain things on that object, depending on what the documentation says, or what EAFP gets you. Some file-like object don't support seek/tell, some don't support close, etc. I've seen plenty of walk-like-a-duck checks like this: if hasattr(f, 'close'): f.close() Anyway, my point boils down to: * scandir() is a new function, so there aren't old trends or things that will break * we clearly document it as returning a tuple of (name, st), where st is a "stat-like object" whose invididual fields are None if they couldn't be determined for free with the directory scanning * in fact, that's kind of the point of the "st" object in this function, so the example could be the one I gave above where you call os.stat() if either of the fields you want is None * if that's clear in the documentation (of this new function) and the first example shows you exactly how it's meant to be used, I think that's pretty sane and sensible... -- ___ Python tracker <http://bugs.python.org/issue11406> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue11406] There is no os.listdir() equivalent returning generator instead of list
Ben Hoyt added the comment: > Please bring this up on python-dev. Good idea. Thread started: http://mail.python.org/pipermail/python-dev/2013-May/126119.html -- ___ Python tracker <http://bugs.python.org/issue11406> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Thanks, Victor and Antone. I'm somewhat surprised at the 2-3x numbers you're seeing, as I was consistently getting 4-5x in the Linux tests I did. But it does depend quite a bit on what file system you're running, what hardware, whether you're running in a VM, etc. Still, 2-3x faster is a good speedup! The numbers are significantly better on Windows, as you can see. Even the smallest numbers I've seen with "--scandir os" are around 12x range on Windows. In any case, Victor's last tests are "right" -- I presume we'll have *some* C, so what we want to be comparing is "benchmark.py --scandir c" versus "benchmark.py --scandir os": the some C version versus the all C version in the attached CPython 3.5 patch. BTW, Victor, "Generic" isn't really useful. I just used it as a test case that calls listdir() and os.stat() to implement the scandir/DirEntry interface. So it's going to be strictly slower than listdir + stat due to using listdir and creating all those DirEntry objects. Anyway, where to from here? Are we agreed given the numbers that -- especially on Linux -- it makes good performance sense to use an all-C approach? -- ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: I dunno, I'd be happy if you implement this, but it does mean *more* C code, not less. :-) I feel this would be a nice-to-have, but we should get the thing working first, and then do the multiple-OS-calls-in-one optimization. In any case, if you implement this, I think you should do so against the CPython 3.5 patch, not the _scandir.c/scandir_helper code. -- ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Attaching updated patch (scandir-2.patch) per Victor's code review here: http://bugs.python.org/review/22524/diff/13005/Modules/posixmodule.c Note that I haven't included test_scandir.py this time, as I haven't made any changes there, and it's not really ready for Python 3.5 yet. -- Added file: http://bugs.python.org/file36963/scandir-2.patch ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Akira, note the "Using slower ctypes version of scandir" -- this is the older, ctypes implementation of scandir. On Linux, depending on file system etc, that can often be slower. You need to at least be using the "fast C version" in _scandir.c, which is then half C, half Python. But ideally you'd use the all-C version that I've provided as a CPython 3.5 patch. -- ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Victor, I'd love to push forward with this (I doubt it's going to make alpha 1 on February 8, but hopefully alpha 2 on March 8). It seems pretty settled that we need to use the all-C version I've created, especially for Linux. Can you please review scandir-2.patch? It's on Rietveld here: http://bugs.python.org/review/22524/#ps13104 You can see what changed between scandir-1.patch and scandir-2.patch more easily here: https://github.com/benhoyt/scandir/commit/d4e2d7083319ed8988aef6ed02d670fb36a00a33 Also, here is just the scandir C code in its own file: https://github.com/benhoyt/scandir/blob/d4e2d7083319ed8988aef6ed02d670fb36a00a33/posixmodule_scandir_main.c One other open question is: should it go into CPython's posixmodule.c as I have it in the patch, or should I try to separate it out? -- ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Hi Victor, I thank you for your efforts here, especially your addition of DirEntry.inode() and your work on the tests. However, I'm a bit frustrated that you just re-implemented the whole thing without discussion: I've been behind scandir and written the first couple of implementations, and I asked if you could review my code, but instead of reviewing it or interacting with my fairly clear desire for the all-C version, you simply turn up and say "I've rewritten it, can you please review?" You did express concern off list about an all-C vs part-Python implementation, but you said "it's still unclear to me which implementation should be added to Python" and "I don't feel able today to take the right decision. We may use python-dev to discuss that". But I didn't see that discussion at all, and I had expressed fairly clearly (both here and off list), with benchmarks, why I thought it should be the all-C version. So it was a bit of a let down for a first-time Python contributor. Even a simple question would have been helpful here: "Ben, I really think this much C code for a first version is bug-prone; what do you think about me re-implementing it and comparing?" TLDR: I'd like to see os.scandir() in Python even if it means the slower implementation, but some discussion would have been nice. :-) So if it's not too late, I'll continue that discussion in my next message. -- ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: To continue the actual "which implementation" discussion: as I mentioned last week in http://bugs.python.org/msg235458, I think the benchmarks above show pretty clearly we should use the all-C version. For background: PEP 471 doesn't add any new functionality, and especially with the new pathlib module, it doesn't make directory iteration syntax nicer either: os.scandir() is all about letting the OS give you whatever info it can *for performance*. Most of the Rationale for adding scandir given in PEP 471 is because it can be so so much faster than listdir + stat. My original all-C implementation is definitely more code to review (roughly 800 lines of C vs scandir-6.patch's 400), but it's also more than twice as fast. On my Windows 7 SSD just now, running benchmark.py: Original scandir-2.patch version: os.walk took 0.509s, scandir.walk took 0.020s -- 25.4x as fast New scandir-6.patch version: os.walk took 0.455s, scandir.walk took 0.046s -- 10.0x as fast So the all-C implementation is literally 2.5x as fast on Windows. (After both tests, just for a sanity check, I ran the ctypes version as well, and it said about 8x as fast for both runs.) Then on Linux, not a perfect comparison (different benchmarks) but shows the same kind of trend: Original scandir-2.patch benchmark (http://bugs.python.org/msg228857): os.walk took 0.860s, scandir.walk took 0.268s -- 3.2x as fast New scandir-6.patch benchmark (http://bugs.python.org/msg235865) -- note that "1.3x faster" should actually read "1.3x as fast" here: bench: 1.3x faster (scandir: 164.9 ms, listdir: 216.3 ms) So again, the all-C implementation is 2.5x as fast on Linux too. And on Linux, the incremental improvement provided by scandir-6 over listdir is hardly worth it -- I'd use a new directory listing API for 3.2x as fast, but not for 1.3x as fast. Admittedly a 10x speed gain (!) on Windows is still very much worth going for, so I'm positive about scandir even with a half-Python implementation, but hopefully the above shows fairly clearly why the all-C implementation is important, especially on Linux. Also, if the consensus is in favour of slow but less C code, I think there are further tweaks we can make to the Python part of the code to improve things a bit more. -- ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Akari: yes, I'm aware of this, and it's definitely a concern to me -- it may be that scandir has a bug with counting the size of certain non-file directory entries, or my implementation of os.walk() via scandir is not quite correct. I'll look at it before too long! -- ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: BTW, I replied to Victor privately, but for the thread: no worries, and apology accepted! :-) I'm working on updating my C patch with his feedback, and will update this issue hopefully in the next few days. -- ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue25911] Regression: os.walk now using os.scandir() breaks bytes filenames on windows
Ben Hoyt added the comment: Just to clarify what Victor and Zachary mentioned: bytes filenames have been deprecated only *on Windows* since Python 3.3; bytes filenames are still fine on POSIX. -- ___ Python tracker <http://bugs.python.org/issue25911> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue25994] File descriptor leaks in os.scandir()
Ben Hoyt added the comment: I'm not sure this is actually a leak, because (looking at the code) ScandirIterator_close() is called not just in the destructor, but also at the end of iterating, just before raising StopIteration. Is the issue that if an exception is raised or you stop iterating before the end, then it's less determined when the destructor/close is called? I think we could fairly easily add an explicit close method to the iterator and make it a context manager (add __enter__ and __exit__ methods to the iterator). Is this what you're thinking of: # manual close it = scandir.scandir(path) first_entry = next(it) it.close() with scandir.scandir(path) as it: first_entry = next(it) -- ___ Python tracker <http://bugs.python.org/issue25994> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26032] Use scandir() to speed up pathlib globbing
Ben Hoyt added the comment: Guido, it's true that in almost all cases you get the speedup (no system call), and it's very much worth using. But the idea with the docs being non-committal is because being specific would make the docs fairly complex. I believe it's as follows for is_file/is_dir/is_symlink: * no system call required on Windows or Unix if the entry is not a symlink * unless you're on Unix with some different file system (maybe a network FS?) where d_type is DT_UNKNOWN * some other edge case which I've probably forgotten :-) Do you think the docs should try to make this more specific? -- ___ Python tracker <http://bugs.python.org/issue26032> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue25596] Use scandir() to speed up the glob module
Ben Hoyt added the comment: Nice! Good to see scandir() getting used to speed up other functions, functions that folks are using a ton already, like glob(). -- ___ Python tracker <http://bugs.python.org/issue25596> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26111] On Windows, os.scandir will keep a handle on the directory until the iterator is exhausted
Changes by Ben Hoyt : -- nosy: +benhoyt ___ Python tracker <http://bugs.python.org/issue26111> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26248] Improve scandir DirEntry docs, especially re symlinks and caching
New submission from Ben Hoyt: Per Guido's comment about DirEntry documentation on Issue 26032, especially http://bugs.python.org/issue26032#msg257665, it'd be good to improve the docs of the scandir DirEntry methods with regard to symlinks and caching. Attaching my stab at a documentation fix. Changes here are: 1) Clarify that the return values of is_dir()/is_file()/etc are cached separately for follow_symlinks True and False. 2) Be more specific about when the functions require a system call, and how it relates to caching and follow_symlinks. 3) DRY up common stuff between is_dir and is_file by saying "Caching, system calls made, and exceptions raised are as per is_dir" in is_file. 4) Tweak to the first paragraph of docs for is_dir/is_file to simplify: assume the follow_symlinks=True default, then note the follow_symlinks=False non-default case after. I think they're all improvements, though I'm not sure about #3. Is it better to just repeat those couple of paragraphs verbatim? I'm also not 100% sure about mentioning the DT_UNKNOWN thing. But you really can't get more specific about when system calls are required on Unix without mentioning it. -- assignee: docs@python components: Documentation files: direntry_doc_improvements.patch keywords: patch messages: 259282 nosy: benhoyt, docs@python, gvanrossum, haypo, serhiy.storchaka priority: normal severity: normal status: open title: Improve scandir DirEntry docs, especially re symlinks and caching versions: Python 3.5, Python 3.6 Added file: http://bugs.python.org/file41765/direntry_doc_improvements.patch ___ Python tracker <http://bugs.python.org/issue26248> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26032] Use scandir() to speed up pathlib globbing
Ben Hoyt added the comment: Guido, I've made some tweaks and improvements to the DirEntry docs here: http://bugs.python.org/issue26248 -- the idea is to fix the issues you mentioned to clarify when system calls are required with symlinks, mentioning that the results are cached separately for follow_symlinks True and False, etc. -- ___ Python tracker <http://bugs.python.org/issue26032> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26248] Improve scandir DirEntry docs, especially re symlinks and caching
Ben Hoyt added the comment: Thanks, Victor! -- ___ Python tracker <http://bugs.python.org/issue26248> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26248] Improve scandir DirEntry docs, especially re symlinks and caching
Ben Hoyt added the comment: Seeing this has been merged (thanks Victor), can this issue be closed? Guido, are you happy with the changes given your comments at http://bugs.python.org/issue26032#msg257665? -- ___ Python tracker <http://bugs.python.org/issue26248> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13617] Reject embedded null characters in wchar* strings
Ben Hoyt added the comment: Note that this (or a very similar issue) also affects os.listdir() on Windows: os.listdir(bytes_path_with_nul) raises ValueError as expected, but os.listdir(unicode_path_with_nul) does not. Test case: >>> import os >>> os.mkdir('foo') >>> os.listdir(b'foo\x00zzz') Traceback (most recent call last): File "", line 1, in ValueError: listdir: embedded null character in path >>> os.listdir('foo\x00zzz') [] However, this is not the case on Linux, as there both calls raise an appropriate ValueError. This needs to be fixed in posixmodule.c's path_converter() function. I'm in the middle of implementing PEP 471 (os.scandir), so don't want to create a proper patch right now, but the fix is to add these lines in posixmodule.c path_converter() after the if (length > 32767) {...} block: if ((size_t)length != wcslen(wide)) { FORMAT_EXCEPTION(PyExc_ValueError, "embedded null character in %s"); Py_DECREF(unicode); return 0; } We should also add test to test_os.py like the following: def test_listdir_nul_in_path(self): self.assertRaises(ValueError, os.listdir, 'y\x00z') self.assertRaises(ValueError, os.listdir, b'y\x00z') -- nosy: +benhoyt ___ Python tracker <http://bugs.python.org/issue13617> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Okay, here's the next version of the patch. I've updated scandir-2.patch with a number of modifications and several fixes to the C code. This includes Victor's scandir-6.patch test suite (with minor mods) and my original documentation. Note that I haven't looked at either the documentation or the tests too closely, but I'm uploading the patch in any case so Victor (or others) can start reviewing the C code. -- Added file: http://bugs.python.org/file38259/scandir-7.patch ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23535] os.path.join() wrong concatenation of "C:" on Windows
Ben Hoyt added the comment: Sorry, but this is operating as designed and documented. See the docs here: https://docs.python.org/3.4/library/os.path.html#os.path.join "On Windows ... since there is a current directory for each drive, os.path.join("c:", "foo") represents a path relative to the current directory on drive C: (c:foo), not c:\foo. So I think this issue should be closed as not a bug. -- nosy: +benhoyt ___ Python tracker <http://bugs.python.org/issue23535> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Updated version of the patch after Victor's code review of scandir-7 and a couple of his comments offline. Full commit log is available on my GitHub project in posixmodule_scandir_main.c if anyone's interested. Again, I haven't looked closely at the docs or tests yet, but hopefully the C code is getting pretty close now! KNOWN ISSUE: There's a reference leak in the POSIX version (found with "python -m test -R 3:2 test_os"). I'm a Windows guy and this is my first serious Python C extension code, so I'm not sure the best way to track this down. Would someone like to either point out the bug :-) or help me with how to track this down? -- ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Oops, I'm sorry re previous comment -- looks like I forgot to attach scandir-8.patch. Now attached. Please re-read my previous comment and review. :-) -- Added file: http://bugs.python.org/file38374/scandir-8.patch ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23605] Use the new os.scandir() function in os.walk()
Ben Hoyt added the comment: Attaching a first cut at this -- basically the implementation I use for walk() in scandir.py on GitHub. One thing that's really weird to me: are the os.walk() unit tests actually being run? In test_os.py, I notice everything's in WalkTest.setUp, which is kinda weird -- and it's not actually running. At least when I add a "1/0" inside setUp() to force an error and run "python -m test test_os" I don't get any errors... -- keywords: +patch nosy: +benhoyt Added file: http://bugs.python.org/file38385/os_walk_1.patch ___ Python tracker <http://bugs.python.org/issue23605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23619] Python 3.5.0a2 installer fails
Ben Hoyt added the comment: Same exact issue here. I didn't have a Python 3.5 alpha 1 previously installed, and I tried running the installer normally and also (after uninstalling) with right-click, "Run as administrator". Both do the same thing for me: pop up a dialog box at the end of installation that says "The program can't start because api-ms-win-crt-runtime-l1-1-0.dll is missing from your computer..." I also get the same dialog box when I run "C:\Program Files\Python 3.5\python.exe" -- nosy: +benhoyt ___ Python tracker <http://bugs.python.org/issue23619> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Hi Victor. I'm attaching a patch to the What's New doc. I know it's a minor thing, but I really read the What's New in Python x.y page whenever it comes out, so making this as specific as possible is good. The changes are: 1. Be more specific and flesh it out slightly: for example, instead of just saying "PEP 471 adds a new directory iteration function", say very briefly why it was added (to speed up os.walk) and what it achieves. 2. Move "PEP and implementation by..." under correct correct PEP -- it's currently under PEP 475. 3. Be more specific (as per point 1) under os.scandir(), and link to issue 4. Add a note about the os.walk() performance improvement under Optimizations. I realize the changes to os.walk() haven't been committed yet, but hopefully they will be quite soon. Any feedback? Are you happy to commit these changes? I intend to do some review of the scandir/DirEntry docs as well. I'll send those in the next few days. -- Added file: http://bugs.python.org/file38421/whatsnew.patch ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22524] PEP 471 implementation: os.scandir() directory scanning function
Ben Hoyt added the comment: Thanks. Will do! -- ___ Python tracker <http://bugs.python.org/issue22524> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23605] Use the new os.scandir() function in os.walk()
Ben Hoyt added the comment: @Scott Dial: just a response about this benchmark: note that this benchmark isn't really valid, as it says "Using slower ctypes version of scandir", which is the slow all-Python version. You want it to be saying "Using Python 3.5's builtin os.scandir()". -- ___ Python tracker <http://bugs.python.org/issue23605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23605] Use the new os.scandir() function in os.walk()
Ben Hoyt added the comment: To Victor and Serhiy: 1) Serhiy's point about not needing to build the symlinks set when followlinks is True is a good one, because it'll never get used. Just a "if not followlinks: ..." around that try/except would do it. Though this is a small optimization, as I expect 95% of people use os.walk() with followlinks=False, which is the default. 2) Much as I don't want to admit it :-), I think Serhiy has a point regarding the change in behaviour. Can we accept this tiny change? This happens because the previous implementation of os.walk() calls path.islink(), which does a real syscall, after the yield returns. So if the caller modifies "dirnames" and adds a symlink to a directory it won't be in the symlinks set. Or if (as Serhiy's example shows) they change a symlink-to-a-directory to a directory the new implementation doesn't do another syscall to check, so differs from the old one -- in Serhiy's example, the old os.walk() will recurse into the changed symlink-to-a-directory-that's-now-a-directory, whereas the new os.walk() won't recurse. Arguably the new behaviour is just as good here, but it is different. And Serhiy's function unsymlink() is a reasonable example of how this might play out. The os.walk() docs explicitly allow modifying "dirnames" -- not just removing and reordering, but also adding entries, which could surface the difference in behaviour: "the caller can modify the dirnames list in-place ... even to inform walk() about directories the caller creates or renames before it resumes walk() again". See here: https://docs.python.org/3.4/library/os.html#os.walk So I do think it's a real issue. Will this really be an issue in practice? I don't know; I'm tempted to think not. Obviously if we have to call stat/islink again, it defeats half of the purpose of scandir. But I don't see a way around it if we want to keep the exact old os.walk() behaviour. We could fix the behaviour if a directory/symlink was added to "dirnames" fairly easily by testing whether it changed, but I don't see a way to fix the unsymlinks() example without a syscall. Thoughts? -- ___ Python tracker <http://bugs.python.org/issue23605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23605] Use the new os.scandir() function in os.walk()
Ben Hoyt added the comment: Note specifically in the unsymlink() example Serhiy gave, you probably don't *want* os.walk() to recurse into a symlink-to-a-directory-that's-changed-to-a-directory, and you probably haven't thought about it doing so, so maybe the new behaviour is fine? -- ___ Python tracker <http://bugs.python.org/issue23605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23605] Use the new os.scandir() function in os.walk()
Ben Hoyt added the comment: Thanks, Victor. I haven't quite grokked all the changes here -- it's gotten somewhat more complicated with the scandir_it and manual next() call -- but I ran some benchmarks (via a hacked version of my scandir project's benchmark.py). The results were surprising, and in a good way: Dev version in hg (no extra islink syscall): Windows: 13.1x as fast (68.8x as fast in funky caching mode) Linux: 7.8x as fast With Victor's fast_bottom_up patch (100% correct behaviour): Windows: 9.4x as fast (50.2x as fast in funky caching mode) Linux: 6.5x as fast So os.walk() will still be 10x as fast on Windows if you apply this patch, and 6x as fast on my Linux VM. I haven't dug too deeply to know quite why the numbers are this good, especially on Linux, but that's what I'm seeing, which is great! -- ___ Python tracker <http://bugs.python.org/issue23605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23605] Use the new os.scandir() function in os.walk()
Ben Hoyt added the comment: > I don't understand your benchmark. Do you mean that os.walk() is slower > with fast_bottom-up.patch because islink() is called or because I replaced > "for entry in scandir(top):" with "entry = next(scandir_it)"? No, sorry, I was making two separate comments: 1) the code's gotten quite a bit more complex (and if it needs to be that way for correctness, I'm okay with that), and 2) I'm surprised at how fast it still is. > Are you testing the top-bottom or bottom-up? My benchmark.py calls os.walk() with topdown=True, which is the default. I was testing the Python 3.4 version of os.walk() via listdir against your fast_bottom-up.patch. I'm keen to look into this a bit further, but it won't be today. -- ___ Python tracker <http://bugs.python.org/issue23605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23605] Use the new os.scandir() function in os.walk()
Ben Hoyt added the comment: Victor, great work on pushing this out, especially with the "modifying the directories" fix. (And thanks Serhiy for pushing on correctness here.) Couple of comments/questions about your new os.walk() implementation. 1) The new implementation is more complex. Of course, most of this is necessary due to the topdown directory issue. However, one thing I'm not sure about is why you create scandir_it manually and use a while True loop, adding complexity and making it require two versions of the error handling. Why not a simple "for entry in scandir(top): ..." with a try/except OSError around the whole loop? I could well be missing something here though. 2) In this commit http://bugs.python.org/review/23605/diff/14181/Lib/os.py -- which is not the final one, I don't quite understand the catch_oserror thing. Presumably this turned out to be unnecessary, as it's not in the final version? 3) Really minor thing: in one of the comments, you misspell "symbolik". Should be "symbolic". -- ___ Python tracker <http://bugs.python.org/issue23605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23605] Use the new os.scandir() function in os.walk()
Ben Hoyt added the comment: Thanks for the explanation (and the comment fix). > What's your point about complexity? Would you like to drop os.scandir > changes in os.walk(), or do you have a simpler version to propose? No, not at all! I was just noting it and trying to brainstorm any ways to simplify it (while keeping the current behavior). But I'm not sure there is a good simplification, so that's fine. -- ___ Python tracker <http://bugs.python.org/issue23605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23605] Use the new os.scandir() function in os.walk()
Ben Hoyt added the comment: Yep, I'm good -- go ahead and close! -- ___ Python tracker <http://bugs.python.org/issue23605> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24013] Improve os.scandir() and DirEntry documentation
New submission from Ben Hoyt: Victor Stinner's documentation for os.scandir and DirEntry is a great start (https://docs.python.org/dev/library/os.html#os.scandir), however there are a few mistakes in it, and a few ways I think it could be improved. Attaching a patch with the following overall changes: 1. Tweak the "see also" note on os.listdir() to mention *why* you'd want to use scandir -- namely that it includes file attribute info, and performance. 2. Move that str/bytes note in the scandir() docs down a bit, as I think that's really a detail and the other stuff is more important. 3. More details on why you'd want to use scandir -- to "increase the performance of code that also needs file type or file" -- and be more specific about what system calls are and are not required. 4. Replace "POSIX" with "Unix", per the rest of the os module docs when talking about generic Unix-like operating systems. It seems "POSIX" is used specifically when talking about the POSIX standard. 5. Remove half-true statement in DirEntry docs, "if a file is deleted between calling scandir and stat, a FileNotFoundError can be raised" -- but the method docs state that they specifically don't raise FileNotFoundError. 6. Removed "unfortunately, the behaviour of errors depends on the platform". This is always the case and doesn't add anything. 7. Tried to simplify and clarify the is_dir() and is_file() docs. Not sure I've succeeded here. This is hard! 8. Added "Availability: Unix, Windows." to scandir docs like listdir and most other os functions have. 9. Remove the see also about how "the listdir function returns the names of the directory entries" as that's already stated/implied above. -- assignee: docs@python components: Documentation files: scandir_doc_tweaks.patch keywords: patch messages: 241560 nosy: benhoyt, docs@python, haypo, serhiy.storchaka priority: normal severity: normal status: open title: Improve os.scandir() and DirEntry documentation versions: Python 3.5 Added file: http://bugs.python.org/file39131/scandir_doc_tweaks.patch ___ Python tracker <http://bugs.python.org/issue24013> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com