On Jul 9 2007, Domingos Soares wrote: >Hi, > >> It might be interesting to focus on compression algorithms which are >> optimized for particular workloads and data types, an Oracle database for >> example. > >Yes, I agree. That is what I meant when I said "The study might be >extended to the analysis of data in specific applications (e.g. web >servers, mail servers and others) in order to develop compression >schemes for specific environments...". However, I was not considering >it as a major task, but a minor one. How important such a feature >would be to opensolaris?
Some specific cases where you could find extra compression would be: -differencing multiple versions of documents (doc/xls/html) (this type of delta compression is currently possible using SmartVersion from http://www.smartversion.com/ I haven't seen delta compression in other non-backup related compression tools; as I understand it, ZFS snapshots are filesystem-wide deltas) -media types known to be further recompressible: some older AVI and QuickTime video actually compress quick well using ZIP or RAR. The RAR format itself has a multimedia compression option to enable algorithms that work better on multimedia content. >> It might be worthwhile to have some sort of adaptive compression whereby >> ZFS could choose a compression algorithm based on its detection of the >> type of data being stored. > > That's definitely a great idea. I'm just afraid that would be a bit >hard to identify the data type of a given block or set of blocks in >order to adapt the compression algorithm to it. At the file level it >would be pretty easy in most cases, but at the block level we don't >have a clue about what kind of data are inside the block. The >identification process would depend on some statistical properties of >the data and I don't know how hard it would be to scan the blocks and >process them on a reasonable amount of time, and the whole thing must >be done before the compression really starts. Wouldn't ZFS's being an integrated filesystem make it easier for it to identify the file types vs. a standard block device with a filesystem overlaid upon it? I read in another post that with compression enabled, ZFS attempts to compress the data and stores it compressed if it compresses enough. As far as identifying the file type/data type how about: 1.) ZFS block compression system reads the ZFS file table to identify which blocks are the beginning of files (or for new writes, the block compression system is notified that file.ext is being written on block #### (e.g. block 9,000,201). 2.) ZFS block compression system reads block ####, identifies the file type probably based on the file header and applies the most appropriate compression format, or if none found, the default An approach for maximal compression: The algorithm selection could be 1.) attempt to compress using BWT, store compressed if BWT works better than no compression 2.) when CPU is otherwise idle, use 10% of spare cpu cycles to "walk the disk", trying to recompress each block with each of the various supported compression algorithms, ultimately storing that block in the most space efficient compression format. This technique would result in a file system that tends to compact its data ever more tightly as the data sits in it. It could be compared to 'settling' flakes in a cereal box...the contents may have had a lot of 'air space' before shipment, but are now 'compressed'. The recompression step might even be part of a period disk scrubbing step meant to check and recheck previously written data to make sure the sector it is sitting on isn't going bad. An aging (2002) but thorough comparison of many archivers/algorithms is Jeff Gilchrist's Archive Comparison Test: http://compression.ca/ http://compression.ca/act/act-win.html _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss