Re: PyArrow and Parquet DELTA_BINARY_PACKED

2018-05-21 Thread Wes McKinney
Sorry, I realized I was a bit inarticulate in my reply. I meant the data page HEADERS (the metadata). The actual encoded structure of the data pages should be the same in V2 files. But if the Thrift header is say 16 bytes in V1, it's at least 32 bytes in V2 On Mon, May 21, 2018 at 7:10 PM, Wes McK

Re: PyArrow and Parquet DELTA_BINARY_PACKED

2018-05-21 Thread Wes McKinney
hi Feras, Given the very high compression ratio with your data, it's completely possible that the difference in size is coming from the larger V2 data pages. Compare DataPageHeader with DataPageHeaderV2 in parquet.thrift https://github.com/apache/parquet-cpp/blob/master/src/parquet/parquet.thrift#

Re: PyArrow and Parquet DELTA_BINARY_PACKED

2018-05-18 Thread Feras Salim
Hi Wes, The raw file in CSV is about a gig. Gzipped is about 50mb and the most I could compress it with parquet V1 was 21mb and V2 (same settings) about 25mb. It's quite surprising that it changes how the data is encoded between versions, given that Uwe said "The only difference between the two v

Re: PyArrow and Parquet DELTA_BINARY_PACKED

2018-05-18 Thread Wes McKinney
hi Feras, How large are the files? For small files, differences in metadata could impact the file size more significantly. I would be surprised if this were the case with larger files, though (I'm not sure what fraction of a column chunk consists of data page headers vs. actual data in practice)

Re: PyArrow and Parquet DELTA_BINARY_PACKED

2018-05-14 Thread Feras Salim
Hi Uwe, I'm quite confused by the findings, Im attaching a bunch of files corresponding to the version and library generating the files. On the first topic of DELTA_BINARY_PACKED. It seems it's something not well supported on the Java side as well or my implementation is off, but I just copied ov

Re: PyArrow and Parquet DELTA_BINARY_PACKED

2018-05-13 Thread Uwe L. Korn
Hello Feras, `DELTA_BINARY_PACKED` is at the moment only implemented in parquet-cpp on the read path. The necessary encoder implementation for this code is missing at the moment. The change in file size is something I also don't understand. The only difference between the two versions is that