Hi Jason! I guess this question would better echo on the Parquet mailing list https://parquet.apache.org/community/
Very interesting remark though. I looked into it and didn't find any obvious explanation. The entire size of the file is taken up by the "data" column as storing df[['data']] yields the same parquet file size. But I have no idea how HDF5 gets a compression ratio 2x better. I guess they must be applying some kind of encoding that suits your data very well. Good luck! Remi Le lun. 2 nov. 2020 à 21:41, Jason Sachs <jmsa...@gmail.com> a écrit : > Hi all-- > > I've been getting started with Parquet as a storage alternative to HDF5 and > it has a lot of attractive quantities including compression flexibility > efficiency. > > But I'm stumped for storage efficiency in Parquet with one type of data > that I have. > > This is a large series of "ragged" packets arriving as a stream, where each > packet consists of up to 255 bytes of binary data. The vast majority of the > packets have lengths between 96 and 112 bytes. I need to store each of them > with a 64-bit timestamp. > > I can get a good storage efficiency with HDF5 with the following table > schema using pytables: > > class StoredPacket(pt.IsDescription): > timetick = pt.UInt64Col(pos=0) > length = pt.UInt16Col(pos=1) > data = pt.UInt8Col(pos=2,shape=(255,)) > > This stores packet data as an array of uint8 with length 255. I zero-pad > the packet to length 255 and store the length as well in a separate column. > > I have created a sample file in a Github gist: > https://gist.github.com/jason-sachs/aa6dbdaced806bb76bc7a347dfc303dc (see > test1.h5) along with a Python script convert_test1.py that converts it to a > Pandas DataFrame and stores it via Parquet. But the Parquet files are > almost twice as large as the .h5 file no matter what storage technique I > use; brotli is best but slow, and zstd is almost as good as brotli but much > faster. > > Any suggestions on how I might improve storage efficiency in Parquet? I > have a lot of flexibility with how I can store the data; my only > requirement is that I can retrieve the data packets quickly from the > storage file. I offer this sample file as a test case. > > (py3) C:\tmp\git\dv\test-h5-gist>python convert_test1.py > Table overview: > timetick length data > 0 16 99 b'\x00\x00\x00\x98:B\x1a\xbev\x90\xb2\x00\x00\... > 1 32 99 b'\x01\x08\x00\xbf:\x8b\x1a{r=\xb2\x88\x00\t\x... > 2 48 99 b'\x02\x10\x00\xe7:\x9c\x1c\x1at:\xb3\x10\x01\... > 3 64 99 b"\x03\x18\x00\x0f;\x16\x1bOt|\xb2\x98\x01\x19... > 4 80 99 b'\x04 \x007;c\x1b\xddt~\xb2 \x02!\x00<;x\x1a\... > ... ... ... ... > 16413 262080 99 b'{\xd8\xff\x1d+\xe6\xc5H)r\xc1X\xfd\xd9\xff +... > 16414 262096 99 b'|\xe0\xff6+g\xc5A,\x0c\xc3\xe0\xfd\xe1\xff9+... > 16415 262112 99 b'}\xe8\xffN+\xd3\xc4")D\xc2h\xfe\xe9\xffQ+M\x... > 16416 262128 99 b"~\xf0\xffg+=\xc5E';\xc2\xf0\xfe\xf1\xffj+\xf... > 16417 262144 99 b"\x7f\xf8\xff\x81+\x13\xc4\xdd'\x15\xc2x\xff\... > > [16418 rows x 3 columns] > > Packets with tags >= 128: > timetick length data > 179 2864 36 b"\xca'Twas brillig, and the slithy toves\x00\... > 307 4896 35 b'\xca Did gyre and gimble in the wabe:\x00\x... > 340 5408 30 b'\xcaAll mimsy were the borogoves,\x00\x00\x0... > 362 5744 31 b'\xca And the mome raths outgrabe.\x00\x00\x... > 651 10352 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00... > 1403 22368 32 b'\xca"Beware the Jabberwock, my son!\x00\x00\... > 1741 27760 44 b'\xca The jaws that bite, the claws that cat... > 2115 33728 33 b'\xcaBeware the Jubjub bird, and shun\x00\x00... > 2162 34464 30 b'\xca The frumious Bandersnatch!"\x00\x00\x0... > 2278 36304 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00... > 2405 38320 34 b'\xcaHe took his vorpal sword in hand:\x00\x0... > 2675 42624 41 b'\xca Long time the manxome foe he sought --... > 2896 46144 33 b'\xcaSo rested he by the Tumtum tree,\x00\x00... > 3611 57568 31 b'\xca And stood awhile in thought.\x00\x00\x... > 4089 65200 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00... > 5231 83456 36 b'\xcaAnd, as in uffish thought he stood,\x00\... > 5236 83520 38 b'\xca The Jabberwock, with eyes of flame,\x0... > 5427 86560 40 b'\xcaCame whiffling through the tulgey wood,\... > 6904 110176 26 b'\xca And burbled as it came!\x00\x00\x00\x0... > 7003 111744 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00... > 7286 116256 44 b'\xcaOne, two! One, two! And through and thro... > 8226 131280 39 b'\xca The vorpal blade went snicker-snack!\x... > 8370 133568 35 b'\xcaHe left it dead, and with its head\x00\x... > 8849 141216 27 b'\xca He went galumphing back.\x00\x00\x00\x... > 10326 164832 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00... > 11867 189472 37 b'\xca"And, has thou slain the Jabberwock?\x00... > 12392 197856 35 b'\xca Come to my arms, my beamish boy!\x00\x... > 12936 206544 34 b"\xcaO frabjous day! Callooh! Callay!'\x00\x0... > 13794 220256 26 b'\xca He chortled in his joy.\x00\x00\x00\x0... > 13905 222016 1 b'\xca\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00... > 14690 234560 36 b"\xca'Twas brillig, and the slithy toves\x00\... > 15317 244576 35 b'\xca Did gyre and gimble in the wabe;\x00\x... > 15840 252928 30 b'\xcaAll mimsy were the borogoves,\x00\x00\x0... > 16339 260896 31 b'\xca And the mome raths outgrabe.\x00\x00\x... > > (py3) C:\tmp\git\dv\test-h5-gist>ls -l test1.* > -rw-rw-rw- 1 user group 908773 Nov 2 13:07 test1.h5 > -rw-rw-rw- 1 user group 1611025 Nov 2 13:35 test1.pq > > (py3) C:\tmp\git\dv\test-h5-gist>h5ls -v -r test1.h5 > Opened "test1.h5" with sec2 driver. > / Group > Attribute: CLASS scalar > Type: 5-byte null-terminated UTF-8 string > Data: "GROUP" > Attribute: PYTABLES_FORMAT_VERSION scalar > Type: 3-byte null-terminated UTF-8 string > Data: "2.1" > Attribute: TITLE null > Type: 1-byte null-terminated UTF-8 string > > Attribute: VERSION scalar > Type: 3-byte null-terminated UTF-8 string > Data: "1.0" > Location: 1:96 > Links: 1 > /data Group > Attribute: CLASS scalar > Type: 5-byte null-terminated UTF-8 string > Data: "GROUP" > Attribute: TITLE null > Type: 1-byte null-terminated UTF-8 string > > Attribute: VERSION scalar > Type: 3-byte null-terminated UTF-8 string > Data: "1.0" > Location: 1:1024 > Links: 1 > /data/packets Dataset {16418/Inf} > Attribute: CLASS scalar > Type: 5-byte null-terminated UTF-8 string > Data: "TABLE" > Attribute: FIELD_0_FILL scalar > Type: native unsigned long long > Data: 0 > Attribute: FIELD_0_NAME scalar > Type: 8-byte null-terminated UTF-8 string > Data: "timetick" > Attribute: FIELD_1_FILL scalar > Type: native unsigned short > Data: 0 > Attribute: FIELD_1_NAME scalar > Type: 6-byte null-terminated UTF-8 string > Data: "length" > Attribute: FIELD_2_FILL scalar > Type: native unsigned char > Data: 0 > Attribute: FIELD_2_NAME scalar > Type: 4-byte null-terminated UTF-8 string > Data: "data" > Attribute: NROWS scalar > Type: native long long > Data: 16418 > Attribute: TITLE null > Type: 1-byte null-terminated UTF-8 string > > Attribute: VERSION scalar > Type: 3-byte null-terminated UTF-8 string > Data: "2.7" > Location: 1:2216 > Links: 1 > Chunks: {247} 65455 bytes > Storage: 4350770 logical bytes, 899061 allocated bytes, 483.92% > utilization > Filter-0: shuffle-2 OPT {265} > Filter-1: deflate-1 OPT {5} > Type: struct { > "timetick" +0 native unsigned long long > "length" +8 native unsigned short > "data" +10 [255] native unsigned char > } 265 bytes >