On Nov 2, 2006, at 08:30, Robert Nelson wrote:
Landon,I've changed the code so that the encryption code prefixes the data blockwith a block length prior to encryption.The decryption code accumulates data until a full data block is decryptedbefore passing it along to the decompression code.The code now works for all four scenarios with encryption and compression: none, encryption, compression, and encryption + compression. Unfortunatelythe code is no longer compatible for previously encrypted backups. I could add some more code to make the encryption only case work likebefore. However, since this is a new feature in 1.39 and there shouldn't be a lot of existing backups, I would prefer to invalidate the previous backupsand keep the code simpler.Also I think we should have a design rule that says any data filters like encryption, compression, etc must maintain the original buffer boundaries.This will allow us to define arbitrary, dynamically extensible filter stacksin the future. What do you think?
I was thinking about this on the way to work. My original assumption was that Bacula used the zlib streaming API to maintain state during file compression/decompression, but this is not the case. Reality is something more like this:
Backup: - Set up the zlib stream context- For each file block (not each file), compress the block via deflate (stream, Z_FINISH); and reinitialize the stream. - After all files (and blocks) are compressed, destroy the stream context
Restore: - For each block, call "uncompress()", which does not handle streaming.This is a unfortunate -- reinitializing the stream for each block significantly degrades compression efficiency, as 1) block boundaries are dynamic and may be set arbitrarily, 2) the LZ77 algorithm may cross block boundaries, referring up to 32k of previous input data. (http://www.gzip.org/zlib/rfc-deflate.html#overview), 3) The huffman coding context comprises the entire block, 4) There's no need to limit zlib block size to bacula's block size.
The next question is this -- as we *should* stream the data, does it make sense to enforce downstream block boundaries in the upstream filter? I'm siding in favor requiring streaming support, and thus allowing the individual filter implementor to worry about their own block buffering, since they can far better encapsulate necessary state and implementation -- and most already do.
The one other thing I am unsure of is whether the zlib streaming API correctly handles streams that have been written as per above -- each bacula data block as an independent 'stream'. If zlib DOES handle this, it should be possible to modify the backup and restore implementation to use the stream API correctly while maintaining backwards compatibility. This would fix the encryption problem AND increase compression efficiency.
With my extremely large database backups, I sure wouldn't mind increased compression efficiency =)
Some documentation on the zlib API is available here (I had a little difficulty googling this): http://www.freestandards.org/spec/booksets/LSB-Core-generic/LSB-Core- generic/libzman.html
Cheers, Landon
PGP.sig
Description: This is a digitally signed message part
------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users