[issue38656] mimetypes for python 3.7.5 fails to detect matroska video
David K. Hess added the comment: Hi, I'm the author of the commit that's been fingered. Some comments about the behavior being reported First, as pointed out by @xtreak, indeed the mimetypes module uses mimetypes files present on the platform to add to the built in list of mimetypes. In this case, "video/x-mastroska" and ".mkv" are not found in the mimetypes module and were never there - they are coming from the host OS. Also, for better or worse, the mimetypes module has an internal "init" method that does more than just instantiates a MimeTypes instance for default use: https://github.com/python/cpython/blob/5c0c325453a175350e3c18ebb10cc10c37f9595c/Lib/mimetypes.py#L345 It also loads in these system files (and also Windows Registry entries on Win32) into a fresh MimeTypes instance. So, addressing what @The Compiler is seeing, properly resetting the mimetypes module really involves calling mimetypes.init(). By historical design, instantiating a MimeTypes class instance directly will not use host OS system mime type files. As to why this commit is causing a change in the observed behavior, the problem that was corrected in this commit was that the mimetypes module had non-deterministic behavior related to initialization. In the original init code, the module level mime types tables are changed (really corrupted) after first load and you can never reinitialize the module back to a known good state (i.e. to original module defaults without information from the host OS system). So, realistically, the behavior currently observed is the correct behavior given the presence and historical nature of the init function. The fact that a fresh MimeTypes instance without having been init()'d or with no filenames provided is returning an OS entry prior to this commit is really part of the initialization bug which was fixed. Regarding the ranger bug, the main thing is you should not use a MimeTypes instance directly unless you run it through the same initializations that the init code does. Anyway, that's my perspective having waded through all of that during the original BPO. I don't claim it's the correct one but that's where we are at. -- ___ Python tracker <https://bugs.python.org/issue38656> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38656] mimetypes for python 3.7.5 fails to detect matroska video
David K. Hess added the comment: The documentation you quoted does read to me as compatible? The database it is referring to is the one hardcoded in the module – not the one assembled from that and the host OS. But, maybe this is just the vagaries of language and perspective at play. Anyway I do agree it is an unexpected behavior change from the perspective of a user of the MimeTypes class directly. To get the best context for this change, it's useful to run through the long history of the issue that drove it: https://bugs.python.org/issue4963 Note, that discussion never touched on the use case of instantiating a MimeTypes class directly and there are apparently no test cases covering this particular scenario either. With no awareness of this perspective/use case it didn't get directly addressed. Perhaps all MimeTypes instances should auto-load system files unless a new __init__ param selects for this new "clean" behavior? -- ___ Python tracker <https://bugs.python.org/issue38656> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40139] mimetypes module racy
David K. Hess added the comment: I’m not sure I can shed any light on this particular bug, but I would say that based on my dealings with this module, it is definitely not thread-safe. That means that if you are going to have multiple threads accessing it simultaneously, you really should have a mutex around that access ensuring only one thread is running through the code in this module at a time. Now in reality, asyncio and other cooperatively scheduled multi-processing packages like gevent are not going to unpredictably yield control to another thread like true threads will. So, in this particular case, since the init code doesn’t use async or await, I don’t think there is a chance of an initialization race bug there. As to the bug witnessed, the only thing I can suggest is to add a considerable amount of debugging that logs the argument to guess_type and prints out the mimetype module’s internal state if and when this happens again. My best guess based on the amount of work that method does to inspect the passed in url, is that it has something to do with the url itself. -- ___ Python tracker <https://bugs.python.org/issue40139> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38656] mimetypes for python 3.7.5 fails to detect matroska video
David K. Hess added the comment: @michael-lazar a documentation change seems the path of least resistance given the complicated history of this module. +1 from me. -- ___ Python tracker <https://bugs.python.org/issue38656> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4963] mimetypes.guess_extension result changes after mimetypes.init()
David K. Hess added the comment: Thank you Steve! Nice to see this one make it across the finish line. -- ___ Python tracker <https://bugs.python.org/issue4963> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4963] mimetypes.guess_extension result changes after mimetypes.init()
Changes by David K. Hess : -- pull_requests: +3096 ___ Python tracker <http://bugs.python.org/issue4963> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4963] mimetypes.guess_extension result changes after mimetypes.init()
David K. Hess added the comment: FYI, PR opened: https://github.com/python/cpython/pull/3062 -- ___ Python tracker <http://bugs.python.org/issue4963> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4963] mimetypes.guess_extension result changes after mimetypes.init()
David K. Hess added the comment: Ok, I followed @r.david.murray's advice and decided to take a shot at this. First, I noticed that I couldn't reproduce the non-deterministic behavior that I reported above on the latest code (i.e. pre-3.7). After doing some research it appears this was the sequence of events: 1) Pre-3.3, hashing was stable and this wasn't a problem. 2) Hash randomization became the default in version 3.3 and this non-determinism showed up. 3) A new dict implementation was introduced in 3.6 and key orders became stable between runs and this non-determinism was gone. However, as the notes on the new dict implementation indicate, this ordering should not be relied upon. I also looked at some other issues: * 6626 - The patch here basically rewrote the module. I agreed with the last comment on that issue that it probably doesn't need that. * 24527 - Related to the .init() problems discussed here in r.david.murray's excellent analysis of the init behavior. * 1043134 - Where the preferred extension issue was addressed via a proposed new map. My approach with this patch is to address the init problem, the non-determinism and the preferred extension issue. For the init, I made two changes: 1) I added new references to the initial values of the maps so they could be retained between init() calls. I also modified MimeTypes.__init__ to refer to these. 2) I modified the init() function to check the files argument as r.david.murray suggested. If it is supplied, then the existing database is used and the files are added to it. If it is not supplied, then the module reinitializes from scratch. I'll update the documentation to reflect this if the commit passes muster. For the non-determinism and preferred extension, I changed the two extension type maps to be OrderedDicts. I then sorted the entries to the OrderedDict constructor by mime type and then placed the preferred extension as the first extension to be processed. This guarantees that it will be the extension returned for guess_type. The OrderedDict also guarantees that guess_all_extensions will always build and return the same value. The commit can be reviewed here: https://github.com/davidkhess/cpython/commit/ecabb1cb57e7e066a693653f485f2f687dcc7f6b I'll open a PR if and when this approach gets enough positive feedback. -- ___ Python tracker <http://bugs.python.org/issue4963> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4963] mimetypes.guess_extension result changes after mimetypes.init()
David K. Hess added the comment: Pushed more commits so here's a branch compare: https://github.com/python/cpython/compare/master...davidkhess:fix-issue-4963 -- ___ Python tracker <http://bugs.python.org/issue4963> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4963] mimetypes.guess_extension result changes after mimetypes.init()
David K. Hess added the comment: Are there any committers watching this issue that are able to review the PR? https://github.com/python/cpython/pull/3062 It's close to 6 months old now with no action on it. I'm willing to help but doing so and then having the PR gather dust is pretty discouraging. Thanks in advance! -- ___ Python tracker <https://bugs.python.org/issue4963> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com