On Jul 9 2007, Domingos Soares wrote:

>Hi,
>
>> It might be interesting to focus on compression algorithms which are
>> optimized for particular workloads and data types, an Oracle database for
>> example.
>
>Yes, I agree. That is what I meant when I said "The study might be
>extended to the analysis of data in specific applications (e.g. web
>servers, mail servers and others) in order to develop compression
>schemes for specific environments...". However, I was not considering
>it as a major task, but a minor one. How important such a feature
>would be to opensolaris?

Some specific cases where you could find extra compression would be:
-differencing multiple versions of documents (doc/xls/html) (this type of 
delta compression is currently possible using SmartVersion from 
http://www.smartversion.com/ I haven't seen delta compression in other 
non-backup related compression tools; as I understand it, ZFS snapshots are 
filesystem-wide deltas)

-media types known to be further recompressible: some older AVI and 
QuickTime video actually compress quick well using ZIP or RAR. The RAR 
format itself has a multimedia compression option to enable algorithms that 
work better on multimedia content.

>> It might be worthwhile to have some sort of adaptive compression whereby
>> ZFS could choose a compression algorithm based on its detection of the
>> type of data being stored.
>
>  That's definitely a great idea. I'm just afraid that would be a bit
>hard to identify the data type of a given block or set of blocks in
>order to adapt the compression algorithm to it. At the file level it
>would be pretty easy in most cases, but at the block level we don't
>have a clue about what kind of data are inside the block. The
>identification process would depend on some statistical properties of
>the data and I don't know how hard it would be to scan the blocks and
>process them on a reasonable amount of time, and the whole thing must
>be done before the compression really starts.

Wouldn't ZFS's being an integrated filesystem make it easier for it to 
identify the file types vs. a standard block device with a filesystem 
overlaid upon it?

I read in another post that with compression enabled, ZFS attempts to 
compress the data and stores it compressed if it compresses enough. As far 
as identifying the file type/data type how about:
1.) ZFS block compression system reads the ZFS file table to identify which 
blocks are the beginning of files (or for new writes, the block compression 
system is notified that file.ext is being written on block #### (e.g. block 
9,000,201).
2.) ZFS block compression system reads block ####, identifies the file type 
probably based on the file header and applies the most appropriate 
compression format, or if none found, the default

An approach for maximal compression:
The algorithm selection could be
1.) attempt to compress using BWT, store compressed if BWT works better 
than no compression
2.) when CPU is otherwise idle, use 10% of spare cpu cycles to "walk the 
disk", trying to recompress each block with each of the various supported 
compression algorithms, ultimately storing that block in the most space 
efficient compression format.

This technique would result in a file system that tends to compact its data 
ever more tightly as the data sits in it. It could be compared to 
'settling' flakes in a cereal box...the contents may have had a lot of 'air 
space' before shipment, but are now 'compressed'. The recompression step 
might even be part of a period disk scrubbing step meant to check and 
recheck previously written data to make sure the sector it is sitting on 
isn't going bad.

An aging (2002) but thorough comparison of many archivers/algorithms is 
Jeff Gilchrist's Archive Comparison Test: http://compression.ca/ 
http://compression.ca/act/act-win.html

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to