hi Feras, How large are the files? For small files, differences in metadata could impact the file size more significantly. I would be surprised if this were the case with larger files, though (I'm not sure what fraction of a column chunk consists of data page headers vs. actual data in practice)
- Wes On Tue, May 15, 2018 at 12:17 AM, Feras Salim <fer...@gmail.com> wrote: > Hi Uwe, > > I'm quite confused by the findings, Im attaching a bunch of files > corresponding to the version and library generating the files. > > On the first topic of DELTA_BINARY_PACKED. It seems it's something not well > supported on the Java side as well or my implementation is off, but I just > copied over the "CsvParquetWriter.java". I created a sample encoder based on > parquet-mr and it seems if dictionary is used, it falls back to PLAIN > instead of DELTA when the dict gets too big. Regardless what I do I can't > make it use DELTA for the attached schema. > > In terms of size difference I see the issue in the resulting metadata, but > not the root cause. You will see the code is identical with just the > addition of version="2.0". This results in changing the output file metadata > from "ENC:PLAIN_DICTIONARY,PLAIN,RLE" to "ENC:RLE,PLAIN" hence increasing > the size quite substantially. > > Let me know if there's anything else I can provide to help debug this one. > The second part is not critical since I can just use v1 for now, but good to > figure out why the output changes. The first part is a bit more pressing for > me since I really want to assess the difference between RLE and > DELTA_BINARY_PACKED on monotonically increasing values like a timestamp, > ticking at a constant rate. > > On Sun, May 13, 2018 at 11:58 AM, Uwe L. Korn <uw...@xhochy.com> wrote: >> >> Hello Feras, >> >> `DELTA_BINARY_PACKED` is at the moment only implemented in parquet-cpp on >> the read path. The necessary encoder implementation for this code is missing >> at the moment. >> >> The change in file size is something I also don't understand. The only >> difference between the two versions is that with version 2, we encode uint32 >> columns in version 1 as INT64 whereas in version two, we can encode them as >> UINT32. This type was not available in version 1. It would be nice, if you >> could narrow down the issue to e.g. the column which causes the increase in >> size. You might also use the Java parquet-tools or parquet-cli to inspect >> the size statistics of the parts of the individual Parquet file. >> >> Uwe >> >> On Fri, May 11, 2018, at 3:07 AM, Feras Salim wrote: >> > Hi, I was wondering if I'm missing something or currently the >> > `DELTA_BINARY_PACKED` is only available for reading when it comes to >> > parquet files, I can't find a way for the writer to encode timestamp >> > data >> > with `DELTA_BINARY_PACKED`, furthermore I seem to get about 10% increase >> > in >> > final file size when I change from ver 1 to ver 2 without changing >> > anything >> > else about the schema or data. > >