Sorry, I realized I was a bit inarticulate in my reply. I meant the
data page HEADERS (the metadata). The actual encoded structure of the
data pages should be the same in V2 files. But if the Thrift header is
say 16 bytes in V1, it's at least 32 bytes in V2
On Mon, May 21, 2018 at 7:10 PM, Wes McK
hi Feras,
Given the very high compression ratio with your data, it's completely
possible that the difference in size is coming from the larger V2 data
pages. Compare DataPageHeader with DataPageHeaderV2 in parquet.thrift
https://github.com/apache/parquet-cpp/blob/master/src/parquet/parquet.thrift#
Hi Wes,
The raw file in CSV is about a gig. Gzipped is about 50mb and the most I
could compress it with parquet V1 was 21mb and V2 (same settings) about
25mb. It's quite surprising that it changes how the data is encoded between
versions, given that Uwe said "The only difference between the two v
hi Feras,
How large are the files? For small files, differences in metadata
could impact the file size more significantly. I would be surprised if
this were the case with larger files, though (I'm not sure what
fraction of a column chunk consists of data page headers vs. actual
data in practice)
Hi Uwe,
I'm quite confused by the findings, Im attaching a bunch of files
corresponding to the version and library generating the files.
On the first topic of DELTA_BINARY_PACKED. It seems it's something not well
supported on the Java side as well or my implementation is off, but I just
copied ov
Hello Feras,
`DELTA_BINARY_PACKED` is at the moment only implemented in parquet-cpp on the
read path. The necessary encoder implementation for this code is missing at the
moment.
The change in file size is something I also don't understand. The only
difference between the two versions is that