[issue24301] gzip module failing to decompress valid compressed file
Ruben Vorderman added the comment: ping -- ___ Python tracker <https://bugs.python.org/issue24301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue46267] gzip.compress incorrectly ignores level parameter
New submission from Ruben Vorderman : def compress(data, compresslevel=_COMPRESS_LEVEL_BEST, *, mtime=None): """Compress data in one shot and return the compressed string. compresslevel sets the compression level in range of 0-9. mtime can be used to set the modification time. The modification time is set to the current time by default. """ if mtime == 0: # Use zlib as it creates the header with 0 mtime by default. # This is faster and with less overhead. return zlib.compress(data, level=compresslevel, wbits=31) header = _create_simple_gzip_header(compresslevel, mtime) trailer = struct.pack(" <https://bugs.python.org/issue46267> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue46267] gzip.compress incorrectly ignores level parameter
Change by Ruben Vorderman : -- keywords: +patch pull_requests: +28622 stage: -> patch review pull_request: https://github.com/python/cpython/pull/30416 ___ Python tracker <https://bugs.python.org/issue46267> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue46267] gzip.compress incorrectly ignores level parameter
Ruben Vorderman added the comment: ping -- ___ Python tracker <https://bugs.python.org/issue46267> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43612] zlib.compress should have a wbits argument
New submission from Ruben Vorderman : zlib.compress can now only be used to output zlib blocks. Arguably `zlib.compress(my_data, level, wbits=-15)` is even more useful as it gives you a raw deflate block. That is quite interesting if you are writing your own file format and want to use compression, but like to use a different hash. Also gzip.compress(data, level, mtime) is extremely slow due to it instantiating a GzipFile object which then streams a bytes object. Explicitly not taking advantage of the fact that the bytes object is entirely in memory already (I will create another bug for this). zlib.compress(my_data, level, wbits=31) should be faster in all possible circumstances, but that option is not available now. -- components: Library (Lib) messages: 389437 nosy: rhpvorderman priority: normal severity: normal status: open title: zlib.compress should have a wbits argument versions: Python 3.10 ___ Python tracker <https://bugs.python.org/issue43612> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.
New submission from Ruben Vorderman : When working on python-isal which aims to provide faster drop-in replacements for the zlib and gzip modules I found that the gzip.compress and gzip.decompress are suboptimally implemented which hurts performance. gzip.compress and gzip.decompress both do the following things: - Instantiate a BytesIO object to mimick a file - Instantiate a GzipFile object to compress or read the file. That means there is way more Python code involved than strictly necessary. Also the 'data' is already fully in memory, but the data is streamed anyway. That is quite a waste. I propose the following: - The documentation should make it clear that zlib.decompress(... ,wbits=31) and zlib.compress(..., wbits=31) (after 43612 has been addressed), are both quicker but come with caveats. zlib.compress can not set mtime. zlib.decompress does not take multimember gzip into account. - For gzip.compress -> The GzipFile._write_gzip_header function should be moved to a module wide _gzip_header function that returns a bytes object. GzipFile._write_gzip_header can call this function. gzip.compress can also call this function to create a header. gzip.compress than calls zlib.compress(data, wbits=-15) (after 43612 has been fixed) to create a raw deflate block. A gzip trailer can be easily created by calling zlib.crc32(data) and len(data) & 0x and packing those into a struct. See for an example implementation here: https://github.com/pycompression/python-isal/blob/v0.8.0/src/isal/igzip.py#L242 -> For gzip.decompress it becomes quite more involved. A read_gzip_header function can be created, but the current implementation returns EOFErrors if the header is incomplete due to a truncated file instead of BadGzipFile errors. This makes it harder to implement something that is not a major break from current gzip.decompress. Apart from the header, the implementation is straightforward. Do a while true loop. All operations are performed in the loop. Validate the header and report the end of the header. Create a zlib.decompressobj(wbits=-15). Decompress all the data from the end of header. Flush. Extract the crc and length from the first 8 bytes of the unused data. data = decompobj.unused_data[8:]. if not data: break. For a reference implementation check here: https://github.com/pycompression/python-isal/blob/v0.8.0/src/isal/igzip.py#L300. Note that the decompress function is quite straightforward. Checking the header however while maintaining backwards compatibility with gzip.deco mpress is not so simple. And that brings to another point. Should non-descriptive EOFErrors be raised when reading the gzip header? Or throw informative BadGzipFile errors when the header is parsed. I tend towards the latter. For example BadGzipFile("Truncated header") instead of EOFError. Or at least EOFError("Truncated gzip header"). I am aware that confounds this issue with another issue, but these things are coupled in the implementation so both need to be solved at the same time. Given the headaches that gzip.decompress gives it might be easier to solve gzip.compress first in a first PR and do gzip.decompress later. -- messages: 389438 nosy: rhpvorderman priority: normal severity: normal status: open title: gzip.compress and gzip.decompress are sub-optimally implemented. ___ Python tracker <https://bugs.python.org/issue43613> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43612] zlib.compress should have a wbits argument
Change by Ruben Vorderman : -- keywords: +patch pull_requests: +23768 stage: -> patch review pull_request: https://github.com/python/cpython/pull/25011 ___ Python tracker <https://bugs.python.org/issue43612> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43612] zlib.compress should have a wbits argument
Change by Ruben Vorderman : -- type: -> enhancement ___ Python tracker <https://bugs.python.org/issue43612> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.
Change by Ruben Vorderman : -- type: -> performance ___ Python tracker <https://bugs.python.org/issue43613> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43621] gzip._GzipReader should only throw BadGzipFile errors
New submission from Ruben Vorderman : This is properly documented: https://docs.python.org/3/library/gzip.html#gzip.BadGzipFile . It now hrows EOFErrors when a stream is truncated. But this means that upstream both BadGzipFile and EOFError need to be catched in the exception handling when opening a gzip file for reading. When a gzip file is truncated it is also a "bad gzip file" in my opinion, so there is no reason to have an extra class of errors. Also it throws zlib.error's when zlib craches for some reason. This means there is some corruption in the raw deflate block. Well that means it is a "bad gzip file" as well and the error message should reflect that. This won't break people's code. If they are already catching EOFError zlib.error and BadGzipFile it changes nothing. If they only catch BadGzipFile, they will have less annoying errors that pop through. I can make the PR, but of course not without any feedback. I am curious what other people think. -- components: Library (Lib) messages: 389494 nosy: rhpvorderman priority: normal severity: normal status: open title: gzip._GzipReader should only throw BadGzipFile errors versions: Python 3.10, Python 3.6, Python 3.7, Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue43621> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43612] zlib.compress should have a wbits argument
Change by Ruben Vorderman : -- components: +Extension Modules -Library (Lib) ___ Python tracker <https://bugs.python.org/issue43612> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43621] gzip._GzipReader should only throw BadGzipFile errors
Change by Ruben Vorderman : -- type: -> behavior ___ Python tracker <https://bugs.python.org/issue43621> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.
Ruben Vorderman added the comment: I created bpo-43621 for the error issue. There should only be BadGzipFile. Once that is fixed, having only one error type will make it easier to implement some functions that are shared across the gzip.py codebase. -- ___ Python tracker <https://bugs.python.org/issue43613> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43612] zlib.compress should have a wbits argument
Change by Ruben Vorderman : -- versions: +Python 3.11 ___ Python tracker <https://bugs.python.org/issue43612> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43612] zlib.compress should have a wbits argument
Ruben Vorderman added the comment: A patch was created, but has not been reviewed yet. -- ___ Python tracker <https://bugs.python.org/issue43612> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.
Change by Ruben Vorderman : -- keywords: +patch pull_requests: +26386 stage: -> patch review pull_request: https://github.com/python/cpython/pull/27941 ___ Python tracker <https://bugs.python.org/issue43613> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43612] zlib.compress should have a wbits argument
Change by Ruben Vorderman : -- pull_requests: +26387 pull_request: https://github.com/python/cpython/pull/27941 ___ Python tracker <https://bugs.python.org/issue43612> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43612] zlib.compress should have a wbits argument
Ruben Vorderman added the comment: Thanks for the review, Lukasz! It was fun to create the PR and optimize the performance for gzip.py as well. -- ___ Python tracker <https://bugs.python.org/issue43612> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.
Change by Ruben Vorderman : -- resolution: -> fixed stage: patch review -> resolved status: open -> closed ___ Python tracker <https://bugs.python.org/issue43613> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.
Ruben Vorderman added the comment: Issue was solved by moving code from _GzipReader to separate functions and maintaining the same error structure. This solved the problem with maximum code reuse and full backwards compatibility. -- ___ Python tracker <https://bugs.python.org/issue43613> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45387] GzipFile.write should be buffered
New submission from Ruben Vorderman : Please consider the following code snippet: import gzip import sys with gzip.open(sys.argv[1], "rt") as in_file_h: with gzip.open(sys.argv[2], "wt", compresslevel=1) as out_file_h: for line in in_file_h: # Do processing on line here modified_line = line # End processing out_file_h.write(modified_line) This is very slow, due to write being called for every line. This is the current implementation of write: https://github.com/python/cpython/blob/c379bc5ec9012cf66424ef3d80612cf13ec51006/Lib/gzip.py#L272 It: - Checks if the file is not closed - Checks if the correct mode is set - Checks if the file is not closed (again, but in a different way) - Checks if the data is bytes, bytearray or something that supports the buffer protocol - Gets the length - Compresses the data - updates the size and offset - updates the checksum Doing this for every line written is very costly and creates a lot of overhead in Python calls. We spent a lot of time in Python and a lot less in the fast C zlib code that does the actual compression. This problem is already solved on the read side. A _GzipReader object is used for reading. This is put in an io.BufferedReader which is used as the underlying buffer for GzipFile.read. This way, lines are read quite fast from a GzipFile without the checksum etc. being updated on every line read. A similar solution should be written for write. I volunteer (I have done some other work on gzip.py already), although I cannot give an ETA at this time. -- messages: 403289 nosy: rhpvorderman priority: normal severity: normal status: open title: GzipFile.write should be buffered ___ Python tracker <https://bugs.python.org/issue45387> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45387] GzipFile.write should be buffered
Change by Ruben Vorderman : -- components: +Library (Lib) type: -> performance versions: +Python 3.10, Python 3.11, Python 3.6, Python 3.7, Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue45387> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45507] Small oversight in 3.11 gzip.decompress implementation with regards to backwards compatibility
New submission from Ruben Vorderman : A 'struct.error: unpack requires a buffer of 8 bytes' is thrown when a gzip trailer is truncated instead of an EOFError such as in the 3.10 and prior releases. -- components: Library (Lib) messages: 404165 nosy: rhpvorderman priority: normal severity: normal status: open title: Small oversight in 3.11 gzip.decompress implementation with regards to backwards compatibility type: behavior versions: Python 3.11 ___ Python tracker <https://bugs.python.org/issue45507> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45507] Small oversight in 3.11 gzip.decompress implementation with regards to backwards compatibility
Change by Ruben Vorderman : -- keywords: +patch pull_requests: +27296 stage: -> patch review pull_request: https://github.com/python/cpython/pull/29023 ___ Python tracker <https://bugs.python.org/issue45507> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45507] Small oversight in 3.11 gzip.decompress implementation with regards to backwards compatibility
Ruben Vorderman added the comment: It turns out there is a bug where FNAME and/or FCOMMENT flags are set in the header, but no error is thrown when NAME and COMMENT fields are missing. -- ___ Python tracker <https://bugs.python.org/issue45507> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45509] Gzip header corruption not properly checked.
New submission from Ruben Vorderman : The following headers are currently allowed while being wrong: - Headers with FCOMMENT flag set, but with incomplete or missing COMMENT bytes. - Headers with FNAME flag set, but with incomplete or missing NAME bytes - Headers with FHCRC set, the crc is read, but not checked. -- components: Library (Lib) messages: 404174 nosy: rhpvorderman priority: normal severity: normal status: open title: Gzip header corruption not properly checked. type: behavior versions: Python 3.11 ___ Python tracker <https://bugs.python.org/issue45509> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45509] Gzip header corruption not properly checked.
Change by Ruben Vorderman : -- keywords: +patch pull_requests: +27300 stage: -> patch review pull_request: https://github.com/python/cpython/pull/29028 ___ Python tracker <https://bugs.python.org/issue45509> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45507] Small oversight in 3.11 gzip.decompress implementation with regards to backwards compatibility
Change by Ruben Vorderman : -- pull_requests: +27301 pull_request: https://github.com/python/cpython/pull/29029 ___ Python tracker <https://bugs.python.org/issue45507> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45507] Small oversight in 3.11 gzip.decompress implementation with regards to backwards compatibility
Ruben Vorderman added the comment: bump. This is a regression introduced by https://github.com/python/cpython/pull/27941 -- ___ Python tracker <https://bugs.python.org/issue45507> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45509] Gzip header corruption not properly checked.
Ruben Vorderman added the comment: Bump. This is a bug that allows corrupted gzip files to be processed without error. Therefore I bump this issue in the hopes someone will review the PR. -- ___ Python tracker <https://bugs.python.org/issue45509> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45509] Gzip header corruption not properly checked.
Ruben Vorderman added the comment: Ping -- ___ Python tracker <https://bugs.python.org/issue45509> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45509] Gzip header corruption not properly checked.
Ruben Vorderman added the comment: 1. Quite a lot I tested it for the two most common use case. import timeit import statistics WITH_FNAME = """ from gzip import GzipFile, decompress import io fileobj = io.BytesIO() g = GzipFile(fileobj=fileobj, mode='wb', filename='compressable_file') g.write(b'') g.close() data=fileobj.getvalue() """ WITH_NO_FLAGS = """ from gzip import decompress import zlib data = zlib.compress(b'', wbits=31) """ def benchmark(name, setup, loops=1, runs=10): print(f"{name}") results = [timeit.timeit("decompress(data)", setup, number=loops) for _ in range(runs)] # Calculate microseconds results = [(result / loops) * 1_000_000 for result in results] print(f"average: {round(statistics.mean(results), 2)}, " f"range: {round(min(results), 2)}-{round(max(results),2)} " f"stdev: {round(statistics.stdev(results),2)}") if __name__ == "__main__": benchmark("with_fname", WITH_FNAME) benchmark("with_noflags", WITH_FNAME) BEFORE: with_fname average: 3.27, range: 3.21-3.36 stdev: 0.05 with_noflags average: 3.24, range: 3.14-3.37 stdev: 0.07 AFTER: with_fname average: 4.98, range: 4.85-5.14 stdev: 0.1 with_noflags average: 4.87, range: 4.69-5.05 stdev: 0.1 That is a dramatic increase in overhead. (Okay the decompressed data is empty, but still) 2. Haven't tested this yet. But the regression is quite unacceptable already. 3. Not that I know of. But if it is set, it is safe to assume they care. Nevertheless this is a bit of an edge-case. -- ___ Python tracker <https://bugs.python.org/issue45509> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45509] Gzip header corruption not properly checked.
Ruben Vorderman added the comment: I increased the performance of the patch. I added the file used for benchmarking. I also test the FHCRC changes now. The benchmark tests headers with different flags concatenated to a DEFLATE block with no data and a gzip trailer. The data is fed to gzip.decompress. Please note that this is the *worst-case* performance overhead. When there is actual data to decompress the overhead will get less. When GzipFile is used the overhead will get less as well. BEFORE (Current main branch): $ ./python benchmark_gzip_read_header.py with_fname average: 3.01, range: 2.9-4.79 stdev: 0.19 with_noflags average: 2.99, range: 2.93-3.04 stdev: 0.02 All flags (incl FHCRC) average: 3.13, range: 3.05-3.16 stdev: 0.02 After (bpo-45509 PR): with_fname average: 3.09, range: 3.01-4.63 stdev: 0.16 with_noflags average: 3.1, range: 3.03-3.38 stdev: 0.04 All flags (incl FHCRC) average: 4.09, range: 4.05-4.49 stdev: 0.04 An increase of .1 microsecond in the most common use cases. Roughly 3%. But now the FNAME field is correctly checked for truncation. With the FHCRC the overhead is increased by 33%. But this is worth it, because the header is now actually checked. As it should. -- Added file: https://bugs.python.org/file50458/benchmark_gzip_read_header.py ___ Python tracker <https://bugs.python.org/issue45509> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45875] gzip.decompress performance can be improved with memoryviews
New submission from Ruben Vorderman : The current implementation uses a lot of bytestring slicing. While it is much better than the 3.10 and earlier implementations, it can still be further improved by using memoryviews instead. Possibly. I will check this out. -- components: Library (Lib) messages: 406816 nosy: rhpvorderman priority: normal severity: normal status: open title: gzip.decompress performance can be improved with memoryviews type: performance versions: Python 3.11 ___ Python tracker <https://bugs.python.org/issue45875> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45875] gzip.decompress performance can be improved with memoryviews
Ruben Vorderman added the comment: Tried and failed. It seems that the overhead of creating a new memoryview object beats the performance gained by it. -- ___ Python tracker <https://bugs.python.org/issue45875> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45509] Gzip header corruption not properly checked.
Ruben Vorderman added the comment: I have found that using the timeit module provides more precise measurements: For a simple gzip header. (As returned by gzip.compress or zlib.compress with wbits=31) ./python -m timeit -s "import io; data = b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'; from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))' For a gzip header with FNAME. (Returned by gzip itself and by Python's GzipFile) ./python -m timeit -s "import io; data = b'\x1f\x8b\x08\x08j\x1a\x9ea\x02\xffcompressable_file\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'; from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))' For a gzip header with all flags set: ./python -m timeit -s 'import gzip, io; data = b"\x1f\x8b\x08\x1f\x00\x00\x00\x00\x00\xff\x05\x00extraname\x00comment\x00\xe9T"; from gzip import _read_gzip_header' '_read_gzip_header(io.BytesIO(data))' Since performance is most critical for in-memory compression and decompression, I now optimized for no flags. Before (current main): 50 loops, best of 5: 469 nsec per loop after (PR): 100 loops, best of 5: 390 nsec per loop For the most common case of only FNAME set: before: 20 loops, best of 5: 1.48 usec per loop after: 20 loops, best of 5: 1.45 usec per loop For the case where FCHRC is set: before: 20 loops, best of 5: 1.62 usec per loop after: 10 loops, best of 5: 2.43 usec per loop So this PR is now a clear win for decompressing anything that has been compressed with gzip.compress. It is neutral for normal file decompression. There is a performance cost associated with correctly checking the header, but that is expected. It is better than the alternative of not checking it. -- Added file: https://bugs.python.org/file50459/benchmark_gzip_read_header.py ___ Python tracker <https://bugs.python.org/issue45509> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.
New submission from Ruben Vorderman : Python now uses the excellent timsort for most (all?) of its sorting. But this is not the fastest sort available for one particular use case. If the number of possible values in the array is limited, it is possible to perform a counting sort: https://en.wikipedia.org/wiki/Counting_sort. In a counting sort each value maps to an integer corresponding to its relative value. The values are then counted by using key = map_to_key(value); count_array[key]. Then from this count_array a new array of sorted values can be constructed. (See the wikipedia article for a more detailed explanation). For the python bytes and bytesarray types this is extremely simple to implement. All 256 possible values are can be directly used as keys. Rough implementation: - Use buffer protocol to get pointer to bytes/bytesarray internal c-string - Declare a count_array: Py_ssize_t[256] count_array . (use memset to set it to 0) - Iterate over the c-string and add each value to the countarray. count_array[buffer[i]] += 1 - Allocate a new bytes(array) object, or in the case of bytesarray the sorting can be performed inplace when bytesarray.sort() is used. - Iterate over the count_array. Get the number of values for each key and use memset to insert the sequences of keys into position. The most obvious way to implement this algorithm will be as bytesarray.sort() where it is sorted inplace and as bytes.sort() which returns a new sorted bytes object. This is much much faster than using bytes(sorted(bytes)). I made a quick cython implementation for speed testing here: https://github.com/rhpvorderman/bytes_sort/blob/main/bytes_sort.pyx Currently to get a sorted bytestring one has to do bytes(sorted(my_bytes)). Test results: # First make sure there is no regression when sorting an empty string $ python -m timeit -c "from bytes_sort import bytes_sort" "bytes(sorted(b''))" 50 loops, best of 5: 560 nsec per loop $ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'')" 50 loops, best of 5: 565 nsec per loop # Test result for very small strings $ python -m timeit -c "from bytes_sort import bytes_sort" "bytes(sorted(b'abc'))" 50 loops, best of 5: 628 nsec per loop $ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'abc')" 50 loops, best of 5: 578 nsec per loop # Even on a very small already sorted string, a counting sort is faster. # Test with a proper string $ python -m timeit -c "from bytes_sort import bytes_sort" "bytes(sorted(b'Let\'s test a proper string now. One that has some value to be sorted.'))" 10 loops, best of 5: 2.32 usec per loop $ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'Let\'s test a proper string now. One that has some value to be sorted.')" 50 loops, best of 5: 674 nsec per loop More than three times faster! -- components: C API messages: 407032 nosy: rhpvorderman priority: normal severity: normal status: open title: Bytes and bytesarrays can be sorted with a much faster count sort. type: performance versions: Python 3.11 ___ Python tracker <https://bugs.python.org/issue45902> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.
Ruben Vorderman added the comment: Also I didn't know if this should be in Component C-API or Interpreter Core. But I guess this will be implemented as C-API calls PyBytes_Sort and PyByteArray_SortInplace so I figured C-API is the correct component here. -- ___ Python tracker <https://bugs.python.org/issue45902> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.
Ruben Vorderman added the comment: I changed the cython script a bit to use a more naive implementation without memset. Now it is always significantly faster than bytes(sorted(my_bytes)). $ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'')" 50 loops, best of 5: 495 nsec per loop $ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'abc')" 50 loops, best of 5: 519 nsec per loop $ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'Let\'s test a proper string now. One that has some value to be sorted.')" 50 loops, best of 5: 594 nsec per loop -- ___ Python tracker <https://bugs.python.org/issue45902> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.
Ruben Vorderman added the comment: Sorry for the spam. I see I made a typo in the timeit script. Next time I will be more dilligent when making these kinds of reports and triple checking it before hand, and sending it once. I used -c instead of -s and now all the setup time is also included. This confounds the results. The proper test commands should be: python -m timeit -s "from bytes_sort import bytes_sort, bytearray_sort_inplace" "bytes_sort(b'My string here')" python -m timeit "bytes(sorted(b'My string here'))" Using just sorted, to purely compare the sorting algorithms without the overhead of creating a new bytes object. python -m timeit "sorted(b'My string here')" Correct comparison results # String = b'' using bytes(sorted: 188 nsec using sorted: 108 nsec using byte_sort:125 nsec # Some overhead here, setting up the countarray # String = b'abc' using bytes(sorted: 252 nsec using sorted: 145 nsec using byte_sort:136 nsec # Overhead compared to sorted already negated when sorting 3 items(!) # String = b'Let\'s test a proper string now. One that has some value to be sorted.' using bytes(sorted: 1830 nsec (reported as 1.83 usec) using sorted: 1550 nsec (reported as 1.55 usec) using byte_sort: 220 nsec -- ___ Python tracker <https://bugs.python.org/issue45902> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.
Ruben Vorderman added the comment: I used it for the median calculation of FASTQ quality scores (https://en.wikipedia.org/wiki/FASTQ_format). But in the end I used the frequency table to calculate the median more quickly. So as you say, the frequency table turned out to be more useful. Having said that the usefulness depends on how many times 8-bit data is passed into sorted. (bytes,bytearrays, most python strings are 8-bit I believe). I raised this issue not because I want a .sort() method on bytes or bytearrays, but mostly because I think python's sorted function can be improved with regards to 8-bit data. I think it is an interesting thing to consider, depending on how often this occurs. For example: sorted(b'Let\'s test a proper string now. One that has some value to be sorted.') and list(bytes_sort(b'Let\'s test a proper string now. One that has some value to be sorted.')) This returns the same result (a list of integers). But the byte_sort implementation is 2.5 times faster. So sorted is not optimally implemented here. Since sorted is now basically throwing everything into list.sort an alternative codepath using bytes.sort can be considered. (If there are enough use cases for it). -- ___ Python tracker <https://bugs.python.org/issue45902> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24301] gzip module failing to decompress valid compressed file
Ruben Vorderman added the comment: >From the spec: https://datatracker.ietf.org/doc/html/rfc1952 2.2. File format A gzip file consists of a series of "members" (compressed data sets). The format of each member is specified in the following section. The members simply appear one after another in the file, with no additional information before, between, or after them. Gzip files with garbage after them are corrupted or not spec compliant. Therefore the gzip module should raise an error in this case. -- nosy: +rhpvorderman ___ Python tracker <https://bugs.python.org/issue24301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45875] gzip.decompress performance can be improved with memoryviews
Change by Ruben Vorderman : -- stage: -> resolved status: open -> closed ___ Python tracker <https://bugs.python.org/issue45875> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24301] gzip module failing to decompress valid compressed file
Ruben Vorderman added the comment: Whoops. Sorry, I spoke before my turn. If gzip implements it, it seems only logical that python's *gzip* module should too. I believe it can be fixed quite easily. The code should raise a warning though. I will make a PR. -- ___ Python tracker <https://bugs.python.org/issue24301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24301] gzip module failing to decompress valid compressed file
Change by Ruben Vorderman : -- keywords: +patch pull_requests: +28076 stage: -> patch review pull_request: https://github.com/python/cpython/pull/29847 ___ Python tracker <https://bugs.python.org/issue24301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43316] python -m gzip handles error incorrectly
New submission from Ruben Vorderman : `Python -m gzip -d myfile` will throw an error because myfile does not end in '.gz'. That is fair (even though a bit redundant, GzipFile contains a header check, so why bother checking the extension?). The problem is how this error is thrown. 1. Error is printed to stdout instead of stderr 2. Tool exits with exit 0. This is not the behaviour that is expected when using python -m gzip in a script. The error is even codified in a test: https://github.com/python/cpython/blob/1f433406bd46fbd00b88223ad64daea6bc9eaadc/Lib/test/test_gzip.py#L776 def test_decompress_infile_outfile_error(self): rc, out, err = assert_python_ok('-m', 'gzip', '-d', 'thisisatest.out') self.assertIn(b"filename doesn't end in .gz:", out) self.assertEqual(rc, 0) self.assertEqual(err, b'') This should be assert_python_failure, out and err should be swapped, and exit code should be something different than 0. >From the zen of python: Errors should never pass silently. I am willing to fix this in a PR, but first I would like some feedback on how to solve this exactly. I propose raising a ValueError("can not determine output filename: 'myfile' does not end in '.gz'"). -- components: Library (Lib) messages: 387622 nosy: rhpvorderman priority: normal severity: normal status: open title: python -m gzip handles error incorrectly versions: Python 3.10, Python 3.6, Python 3.7, Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue43316> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43317] python -m gzip could use a larger buffer
New submission from Ruben Vorderman : python -m gzip reads in chunks of 1024 bytes: https://github.com/python/cpython/blob/1f433406bd46fbd00b88223ad64daea6bc9eaadc/Lib/gzip.py#L599 This hurts performance somewhat. Using io.DEFAULT_BUFFER_SIZE will improve it. Also 'io.DEFAULT_BUFFER_SIZE' is better than: 'ARBITRARY_NUMBER_WITH_NO_COMMENT_EXPLAINING_WHY'. With 1024 blocks Decompression: $ hyperfine -r 10 -w 3 'cat ~/test/50reads.fastq.gz | ./prefix/bin/python3 -m gzip -d > /dev/null' Benchmark #1: cat ~/test/50reads.fastq.gz | ./prefix/bin/python3 -m gzip -d > /dev/null Time (mean ± σ): 926.9 ms ± 7.7 ms[User: 901.2 ms, System: 59.1 ms] Range (min … max): 913.3 ms … 939.4 ms10 runs Compression: $ hyperfine -r 10 -w 3 'cat ~/test/50reads.fastq | ./prefix/bin/python3 -m gzip --fast > /dev/null' Benchmark #1: cat ~/test/50reads.fastq | ./prefix/bin/python3 -m gzip --fast > /dev/null Time (mean ± σ): 2.514 s ± 0.030 s[User: 2.469 s, System: 0.125 s] Range (min … max):2.472 s … 2.563 s10 runs with io.DEFAULT_BUFFER_SIZE Decompression: $ hyperfine -r 10 -w 3 'cat ~/test/50reads.fastq.gz | ./prefix/bin/python3 -m gzip -d > /dev/null' Benchmark #1: cat ~/test/50reads.fastq.gz | ./prefix/bin/python3 -m gzip -d > /dev/null Time (mean ± σ): 839.9 ms ± 7.3 ms[User: 816.0 ms, System: 57.3 ms] Range (min … max): 830.1 ms … 851.3 ms10 runs Compression: $ hyperfine -r 10 -w 3 'cat ~/test/50reads.fastq | ./prefix/bin/python3 -m gzip --fast > /dev/null' Benchmark #1: cat ~/test/50reads.fastq | ./prefix/bin/python3 -m gzip --fast > /dev/null Time (mean ± σ): 2.275 s ± 0.024 s[User: 2.247 s, System: 0.096 s] Range (min … max):2.254 s … 2.322 s10 runs Speedups: - Decompression 840 / 927 = 0.906 ~= 9% reduction in runtime - Compression 2.275 / 2.514 = 0.905 ~= 9% reduction in runtime. It is not stellar, but it is a quite nice improvement for such a tiny change. -- components: Library (Lib) messages: 387624 nosy: rhpvorderman priority: normal severity: normal status: open title: python -m gzip could use a larger buffer versions: Python 3.10, Python 3.6, Python 3.7, Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue43317> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43316] python -m gzip handles error incorrectly
Change by Ruben Vorderman : -- type: -> behavior ___ Python tracker <https://bugs.python.org/issue43316> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43317] python -m gzip could use a larger buffer
Change by Ruben Vorderman : -- type: -> performance ___ Python tracker <https://bugs.python.org/issue43317> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43317] python -m gzip could use a larger buffer
Change by Ruben Vorderman : -- keywords: +patch pull_requests: +23430 stage: -> patch review pull_request: https://github.com/python/cpython/pull/24645 ___ Python tracker <https://bugs.python.org/issue43317> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43316] python -m gzip handles error incorrectly
Ruben Vorderman added the comment: That sounds perfect, I didn't think of that. I will make a PR. -- ___ Python tracker <https://bugs.python.org/issue43316> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43316] python -m gzip handles error incorrectly
Change by Ruben Vorderman : -- keywords: +patch pull_requests: +23432 stage: -> patch review pull_request: https://github.com/python/cpython/pull/24647 ___ Python tracker <https://bugs.python.org/issue43316> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)
New submission from Ruben Vorderman : The gzip file format is quite ubiquitous and so is its first (?) free/libre implementation zlib with the gzip command line tool. This uses the DEFLATE algorithm. Lately some faster algorithms (most notable zstd) have popped up which have better speed and compression ratio vs zlib. Unfortunately switching over to zstd will not be seemless. It is not compatible with zlib/gzip in any way. Luckily some developers have tried to implement DEFLATE in a faster way. Most notably libdeflate (https://github.com/ebiggers/libdeflate) and Intel's storage acceleration library (https://github.com/intel/isa-l). These libraries provide the libdeflate-gzip and igzip utilities respectively. These can compress and decompress the same gzip files. An igzip compressed file can be read with gzip and vice versa. To give an idea of the speed improvements that can be obtained. Here are some benchmarks. All benchmarks were done using hyperfine (https://github.com/sharkdp/hyperfine). The system was a Ryzen 5 3600 with 2x16GB DDR4-3200 memory. Operating system Debian 10. All benchmarks were performed on a tmpfs which lives in memory to prevent IO bottlenecks. The test file was a 5 million read FASTQ file of 1.6 GB (https://en.wikipedia.org/wiki/FASTQ_format). These type of files are common in bioinformatics at 100+ GB sizes so are a good real-world benchmark. I benchmarked pigz on one thread as well, as it implements zlib but in a faster way than gzip. Zstd was benchmarked as a comparison. Versions: gzip 1.9 (provided by debian) pigz 2.4 (provided by debian) igzip 2.25.0 (provided by debian) libdeflate-gzip 1.6 (compiled by conda-build with the recipe here: https://github.com/conda-forge/libdeflate-feedstock/pull/4) zstd 1.3.8 (provided by debian) By default level 1 is chosen for all compression benchmarks. Time is average over 10 runs. COMPRESSION programtime size memory gzip 23.5 seconds 657M 1.5M pigz (one thread) 22.2 seconds 658M 2.4M libdeflate-gzip10.1 seconds 623M 1.6G (reads entire file in memory) igzip 4.6 seconds620M 3.5M zstd (to .zst) 6.1 seconds584M 12.1M Decompression. All programs decompressed the file created using gzip -1. (Even zstd which can also decompress gzip). DECOMPRESSION programtime memory gzip 10.5 seconds 744K pigz (one-thread) 6.7 seconds1.2M libdeflate-gzip3.6 seconds2.2G (reads in mem before writing) igzip 3.3 seconds3.6M zstd (from .gz)6.4 seconds2.2M zstd (from .zst) 2.3 seconds3.1M As shown from the above benchmarks, using Intel's Storage Acceleration Libraries may improve performance quite substantially. Offering very fast compression and decompression. This gets igzip in the zstd ballpark in terms of speed while still offering backwards compatibility with gzip. Intel's Storage Acceleration Libraries (isa-l) come with a bsd-3-clause license, so there should be no licensing issues when using that code inside of CPython. -- components: Library (Lib) messages: 375533 nosy: rhpvorderman priority: normal severity: normal status: open title: Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l) versions: Python 3.10 ___ Python tracker <https://bugs.python.org/issue41566> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)
Ruben Vorderman added the comment: This has to be in a PEP. I am sorry I missplaced it on the bugtracker. -- resolution: -> not a bug stage: -> resolved status: open -> closed ___ Python tracker <https://bugs.python.org/issue41566> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)
Ruben Vorderman added the comment: nasm or yasm will work. I only have experience building it with nasm. But yes that is indeed a dependency. Personally I do not see the problem with adding nasm as a build dependency, as it opens up possibilities for even more performance optimizations in python by moving parts of the code to Assembly. But I can imagine that there might be complications with updating the build system. Libdeflate does not have this problem as it entirely in C. So it could be interesting to use that. I think the unfortunate use of memory is due to the libdeflate-gzip coding, and not necessarily because of the library. Samtools is a tool that uses libdeflate library to great effect and its memory usage is fine. > No, the bug tracker seems fine for this. Well one thing is that libdeflate and isa-l use different compression ratio's for the levels. isa-l does not support anything above 3. libdeflate supports 1-12. So it is not a one-to-one mapping. -- ___ Python tracker <https://bugs.python.org/issue41566> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)
Ruben Vorderman added the comment: > That might be an option then. CPython could use the existing library if it is > available. Dynamic linking indeed seems like a great option here! Users who care about this will probably have the 'isal' and 'libdeflateO' packages installed on their machine anyway. I packaged isa-l on conda-forge, so it is available via the anaconda/miniconda installer. -- ___ Python tracker <https://bugs.python.org/issue41566> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)
Ruben Vorderman added the comment: I just find out that libdeflate does not support streaming: https://github.com/ebiggers/libdeflate/issues/73 . I should have read the manual better. So that explains the memory usage. Because of that I don't think it is suitable for usage in CPython. -- ___ Python tracker <https://bugs.python.org/issue41566> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41586] Allow to set pipe size on subprocess.Popen.
New submission from Ruben Vorderman : Pipes block if reading from an empty pipe or when writing to a full pipe. When this happens the program waiting for the pipe still uses a lot of CPU cycles when waiting for the pipe to stop blocking. I found this while working with xopen. A library that pipes data into an external gzip process. (This is more efficient than using python's gzip module, because the subprocess escapes the GIL, so your main algorithm can fully utilize one CPU core while the compression is offloaded to another). It turns out that increasing the pipe size on Linux from the default of 64KB to the maximum allowed pipe size in /proc/sys/fs/max-pipe-size (1024KB) drastically improves performance: https://github.com/marcelm/xopen/issues/35. TLDR: full utilization of CPU cores, a 40%+ decrease in wall-clock time and a 20% decrease in the number of compute seconds (indicating that 20% was wasted waiting on blocking pipes). However, doing this with subprocess is quite involved as it is now. 1. You have to find out which constants to use in fcntl for setting the pipesize (these constants are not in python). 2. You have to start the Popen process with routing stdout to subprocess.Pipe. 3. You have to get my_popen_process.stdout.fileno() 4. Use fcntl.fcntl to modify the pipe size. It would be much easier to do `subprocess.Popen(args, pipesize=1024 *1024)` for example. I am currently working on a PR implementing this. It will also make F_GETPIPE_SZ and F_SETPIPE_SZ available to the fcntl module. -- components: Library (Lib) messages: 375636 nosy: rhpvorderman priority: normal severity: normal status: open title: Allow to set pipe size on subprocess.Popen. versions: Python 3.10 ___ Python tracker <https://bugs.python.org/issue41586> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41586] Allow to set pipe size on subprocess.Popen.
Change by Ruben Vorderman : -- keywords: +patch pull_requests: +21035 stage: -> patch review pull_request: https://github.com/python/cpython/pull/21921 ___ Python tracker <https://bugs.python.org/issue41586> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)
Ruben Vorderman added the comment: > Within the stdlib, I'd focus only on using things that can be used in a 100% > api compatible way with the existing modules. > Otherwise creating a new module and putting it up on PyPI to expose the > functionality from the libraries you want makes sense and will be easier to > make available to everyone on existing Python versions rather than waiting on > getting something into CPython. Agreed. 100% backwards compatibility is more important than speed. And getting it available to users is faster as a module than implementing it in CPython. I already have a PR open in the xopen module to implement the use of the igzip program. Implementing a module like the gzip module in CPython will be more work, but I will certainly consider it. Thanks for the suggestion! > There is a caveat to using any of these: how well maintained and security > vetted are all of the code paths in the implementation? zlib proper gets > massive security attention. Its low rate of change and staleness are a > feature. I didn't consider that. So I looked at the CVE page for ZLIB. The latest issues are from 2017. Before that 2005. This is the 2017 report: https://wiki.mozilla.org/images/0/09/Zlib-report.pdf. Note how it states that the old compiler support etc. are a basis for vulnerabilities. Precisely zlib-ng did get rid of these parts. On the other hand, Mozilla notes that Zlib is a great candidate for periodic security audits, precisely for the same reasons you mention. > FWIW I tend to avoid software provided by Intel given any other choice. I know the feeling. They rightfully have a very bad reputation for things they did in the past. But this particular code is open source and compilable with open source compilers. Part of it is written in Assembly, to get the speed advantage. I benchmarked it on my AMD processor and I too get enormous speed advantages out of it. > even less about Intels self serving arm-ignoring oddity. They *do* provide instructions to build for arm. Right on their README. https://github.com/intel/isa-l#building-isa-l. I think it is very unfair to be so dismissive just because Intel pays the developers. This is good work, which speeds up bioinformatics workloads, which in turn helps us to help more patients. On the whole I think the arguments to make a module are very strong. So I think that is the appropriate way forward. I'd love everyone to switch to more efficient deflate algorithms, but CPython may not be the right project to drive this change. At least this discussion is now here as a reference for other people who are curious about improving this performance aspect. -- ___ Python tracker <https://bugs.python.org/issue41566> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)
Ruben Vorderman added the comment: > If you take this route, please don't write it directly against the CPython > C-API (as you would for a CPython stdlib module). Thanks for reminding me of this. I was planning to take the laziest route possible anyway, reusing as much code from cpython as possible. I will probably start with a minimal implementation of these base classes https://github.com/python/cpython/blob/master/Lib/_compression.py using cython, using the gzip implementation in cpython as guidance. -- ___ Python tracker <https://bugs.python.org/issue41566> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)
Ruben Vorderman added the comment: Hi, thanks all for the comments and the help. I have created the bindings using Cython. The project is still a work in progress as of this moment. I leave the link here for future reference. Special thanks for the Cython developers for enabling these bindings. https://github.com/rhpvorderman/python-isal -- resolution: -> third party stage: needs patch -> resolved status: open -> closed ___ Python tracker <https://bugs.python.org/issue41566> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com