How does this compare with tar.gz? Gzip and zip have a a 32K window so it won't recognize common text in widely separated files. Most newer compressors have larger windows.
ZPAQ groups files by extension and compresses in 16 or 64 MB blocks depending on compression level. It uses LZ77, BWT, or context mixing, all of which will recognize duplicate strings within the same block. It also deduplicates across blocks by using a rolling hash function to mark fragment boundaries and storing SHA1 hashes. When used for incremental backups, it compares hashes and stores duplicates (same hash) as pointers to the old fragments. ZPAQ is append only so you can roll it back to extract old versions of the same file. ZPAQ fragments average 64K but this is an option. Smaller fragments find more matches, but require more space to store the 20-byte hashes. It computers a rolling hash that marks a boundary with probability 2^-16 (16 bits are 0). The hash window is variable size containing the last 32 bytes not predicted by an order 1 model (a 256 byte table). It works like this: If x is predicted then h=(h+x)*314159265; Else h=(h+x)*271828182; Where x is the next byte and arithmetic is modulo 2^32 (h is an unsigned int). Even numbers shift out the 32nd newest bit. You can find ZPAQ including a description of the compression algorithms at http://mattmahoney.net/dc/zpaq.html On Wed, May 27, 2020, 11:12 AM stefan.reich.maker.of.eye via AGI < [email protected]> wrote: > Ha, fun exercise. Not much of a point as I will explain below, but here > goes. Compression can be vastly sped up with some simple tweaks BTW. > > Compression done [971490 ms] > > Archive /home/stefan/linecomp-demo/enwik8.lc stats: > 98315K of text compressed into 38126K > > > So worse than gzip, which is to be expected - my compressor excels at > multiple similar files, not one huge file. > > If you supply only one file, there isn't really anything to exploit for > the algorithm and it ends up duplicating gzip's work in a less efficient > way. > > Also my format is line-based which is great for comparing source code > revisions.. and rather randomly good or bad for other stuff. > *Artificial General Intelligence List <https://agi.topicbox.com/latest>* > / AGI / see discussions <https://agi.topicbox.com/groups/agi> + > participants <https://agi.topicbox.com/groups/agi/members> + delivery > options <https://agi.topicbox.com/groups/agi/subscription> Permalink > <https://agi.topicbox.com/groups/agi/Tb2cf064c700f181c-M8ee11ae28237192789c31266> > ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tb2cf064c700f181c-Mefc3cccf303f9abc05eeeefc Delivery options: https://agi.topicbox.com/groups/agi/subscription
