On Wed, Oct 02, 2013 at 09:40:03PM -0700, Ben Pfaff wrote: > On Wed, Oct 02, 2013 at 01:03:27PM -0400, Hugo Alejandro wrote: > > A few days ago I was recruited to work in the analysis of large surveys, > > what caught my attention is the use of the format *. zsav above *.sav. > > > > Apparently this file format supports higher compression ratio and is more > > efficient with large databases to reduce their size on disk and be faster to > > compress-decompress to create a ZIP file (or other format) with a *.sav file > > . > > > > This file type is very recent, included in SPSS version 21 and improved in > > the > > current version 22. > > This is very interesting. Thank you for bringing this to our > attention. > > The .zsav file format appears to be the same as .sav format up to the > data portion of the file, except that the "magic" at the beginning of > the file is $FL3 instead of $FL2. > > The data portion of the file starts at offset 837 (0x345). Its > contents, with my speculation about their meaning, is: > > 00000345 45 03 00 00 00 00 00 00 - Byte offset of this block, 0x345. > 0000034d 14 07 00 00 00 00 00 00 - byte offset of the next block, 0x714. > 00000355 30 00 00 00 00 00 00 00 - Length of next block's header, 0x30 bytes. > > It is followed by 951 (0x3b7) bytes of data compressed with the > "deflate" algorithm. When inflated, these expand to 1120 (0x460) bytes > that exactly match the data portion of the original physiology.sav, > which starts at offset 729 (0x2d9) in the original file. > > The file ends with an additional 48 (0x30) bytes starting at offset 1812 > (0x714). Their contents, with my speculation about their meaning, are: > > 00000714 9c ff ff ff ff ff ff ff - Value -100, dunno why (compression bias?) > 0000071c 00 00 00 00 00 00 00 00 - ? > 00000724 00 f0 3f 00 01 00 00 00 - ? > 0000072c 45 03 00 00 00 00 00 00 - Starting offset of previous block, 0x345. > 00000734 5d 03 00 00 00 00 00 00 - Starting offset of data block, 0x35d. > 0000073c 60 04 00 00 - Inflated data size, 0x460 bytes. > 00000740 b7 03 00 00 - Compressed data size, 0x3b7 bytes. > > From here, I think that the next step would have to be to look at both > the .sav and .zsav versions of files. I would be most interested in > larger files (say, 1 MB in size), because I think that it is likely that > some of the mysteries above would be cleared up if there were more > compressed blocks in the file (or perhaps we would find out that there > is only ever a single compressed block).
Some of this matches up, but some of it is weird: 000035a 5a 03 00 00 00 00 00 00 - Byte offset of this block, 0x35a 0000362 12 94 03 00 00 00 00 00 - Byte offset of the next block, 0x39412. 000036a 48 00 00 00 00 00 00 00 - Length of next block's header, 0x48 bytes. ...then compressed data, then... 0039412 9c ff ff ff ff ff ff ff - Value -100, dunno why (compression bias?) 003941a 00 00 00 00 00 00 00 00 - ? 0039422 00 f0 3f 00 02 00 00 00 - ? 003942a 5a 03 00 00 00 00 00 00 - Starting offset of previous block, 0x35a. 0039432 72 03 00 00 00 00 00 00 - Starting offset of data block, 0x372. 003943a 00 f0 3f 00 - Inflated data size, 0x3ff000 bytes. 003943e 49 7c 03 00 - Compressed data size, 0x37c49 bytes. 0039442 5a f3 3f 00 00 00 00 00 - 0x3ff35a = 0x35a + 0x3ff000 = current byte offset if no compression 003944a bb 7f 03 00 00 00 00 00 - ? 0039452 00 bf 06 00 - ? 0039456 57 14 00 00 - ? In particular, when I decompress the compressed data block, only the beginning of it looks the same as in the not-compressed version of the file. There is something weird going on. Before I go to a lot of trouble to try to chase that down, would you mind making sure for me that both versions of the file really have the same data in them? Thanks, Ben. _______________________________________________ Pspp-users mailing list Pspp-users@gnu.org https://lists.gnu.org/mailman/listinfo/pspp-users