[issue45150] Add a file_digest() function in hashlib
Aur Saraf added the comment: Tarek, Are you still working on this? Would you like me to take over? Aur -- nosy: +Aur.Saraf ___ Python tracker <https://bugs.python.org/issue45150> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45150] Add a file_digest() function in hashlib
Aur Saraf added the comment: OK, I'll give it a go. -- ___ Python tracker <https://bugs.python.org/issue45150> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45150] Add a file_digest() function in hashlib
Aur Saraf added the comment: PR contains a draft implementation, would appreciate some review before I implement the same interface on all builtin hashes as well as OpenSSL hashes. -- ___ Python tracker <https://bugs.python.org/issue45150> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45150] Add a file_digest() function in hashlib
Aur Saraf added the comment: The rationale behind `from_raw_file()` and the special treatment of non-buffered IO is that there is no `read_buffer()` API or other clean way to say "I want to read just what's currently in the buffer so that from now on I could read directly from the file descriptor without harm". If you want to read from a buffered file object, sure, just call `from_file()`. If you want to ensure you'll get full performance benefits, call `from_raw_file()`. If you pass an eligible file object to `from_file()` you'll get the benefits anyway, because why not. -- ___ Python tracker <https://bugs.python.org/issue45150> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45150] Add a file_digest() function in hashlib
Aur Saraf added the comment: Forgot an important warning: this is the first time I write C code against the Python API, and I didn't thoroughly read the guide (or at all, to be honest). I think I did a good job, but please suspect my code of noob errors. I'm especially not confident that it's OK to not do any special handling of signals. Can read() return 0 if it was interrupted by a signal? This will stop the hash calculation midway and behave as if it succeeded. Sounds suspiciously like something we don't want. Also, I probably should support signals because such a long operation is something the user definitely might want to interrupt? May I have some guidance please? Would it be enough to copy the code from fileutils.c _Py_Read() and addi an outer loop so we can do many reads with the GIL released and still call PyErr_CheckSignals when needed with the GIL taken? -- ___ Python tracker <https://bugs.python.org/issue45150> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45150] Add a file_digest() function in hashlib
Aur Saraf added the comment: Added an attempt to handle signals. I don't think it's working, because when I press Ctrl+C while hashing a long file, it only raises KeyboardInterrupt after waiting the amount of time it usually takes the C code to return, but maybe that's not a good test? -- ___ Python tracker <https://bugs.python.org/issue45150> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45150] Add a file_digest() function in hashlib
Aur Saraf added the comment: I don't think HMAC of a file is a common enough use case to support, but I have absolutely no problem conceding this point, the cost of supporting it is very low. I/O in C is a world of pain in general. In the specific case of `io.RawIOBase` objects (non-buffered binary files) to my understanding it's not _that_ terrible (am I right? Does my I/O code work as-is?). To my understanding, providing a fast path *just for this case* that calculates the hash without taking the GIL for every chunk would be very nice to have for many use cases. Now, we could just be happy with `file_digest()` having an `if` for `isinstance(io.RawIOBase)` that chooses a fast code path silently. But since non-buffered binary files are so hard to tell apart from other types of file-like objects, as a user of this code I would like to have a way to say "I want the fast path, please raise if I accidentally passed the wrong things and got the regular path". We could have `file_digest('sha256', open(path, 'rb', buffered=0), ensure_fast_io=True)`, but I think for this use case `raw_file_digest('sha256', open(path, 'rb', buffered=0))` is cleaner. In all other cases you just call `file_digest()`, probably get the Python I/O and not the C I/O, and are still happy to have that loop written for you by someone who knows what they're doing. For the same reason I think the fast path should only support hash names and not constructors/functions/etc', which would complicate it because new-object-can-be-accessed-without-GIL wouldn't necessarily apply. Does this make sense? -- ___ Python tracker <https://bugs.python.org/issue45150> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com