Hello Wes/all, A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without resorting to other alternatives. Is this something that could be done in Parquet over the next few months? I have a lot of experience with file formats/storage layer internals and can contribute for Parquet C++.
-Brian On 4/5/19, 3:44 PM, "Wes McKinney" <wesmck...@gmail.com> wrote: EXTERNAL hi Brian, Just to comment from the C++ side -- the 64-bit issue is a limitation of the Parquet format itself and not related to the C++ implementation. It would be possibly interesting to add a LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing doing much the same in Apache Arrow for in-memory) - Wes On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > > I don't think that's what you would want to do. Parquet will eventually > compress large values, but not after making defensive copies and attempting > to encode them. In the end, it will be a lot more overhead, plus the work > to make it possible. I think you'd be much better of compressing before > storing in Parquet if you expect good compression rates. > > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <brian.bow...@sas.com> wrote: > > > My hope is that these large ByteArray values will encode/compress to a > > fraction of their original size. FWIW, cpp/src/parquet/ > > column_writer.cc/.h has int64_t offset and length fields all over the > > place. > > > > External file references to BLOBS is doable but not the elegant, > > integrated solution I was hoping for. > > > > -Brian > > > > On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com> wrote: > > > > *EXTERNAL* > > Looks like we will need a new encoding for this: > > https://github.com/apache/parquet-format/blob/master/Encodings.md > > > > That doc specifies that the plain encoding uses a 4-byte length. That's > > not going to be a quick fix. > > > > Now that I'm thinking about this a bit more, does it make sense to support > > byte arrays that are more than 2GB? That's far larger than the size of a > > row group, let alone a page. This would completely break memory management > > in the JVM implementation. > > > > Can you solve this problem using a BLOB type that references an external > > file with the gigantic values? Seems to me that values this large should go > > in separate files, not in a Parquet file where it would destroy any benefit > > from using the format. > > > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <brian.bow...@sas.com> wrote: > > > >> Hello Ryan, > >> > >> Looks like it's limited by both the Parquet implementation and the Thrift > >> message methods. Am I missing anything? > >> > >> From cpp/src/parquet/types.h > >> > >> struct ByteArray { > >> ByteArray() : len(0), ptr(NULLPTR) {} > >> ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {} > >> uint32_t len; > >> const uint8_t* ptr; > >> }; > >> > >> From cpp/src/parquet/thrift.h > >> > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* > >> deserialized_msg) { > >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* > >> out) > >> > >> -Brian > >> > >> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote: > >> > >> EXTERNAL > >> > >> Hi Brian, > >> > >> This seems like something we should allow. What imposes the current > >> limit? > >> Is it in the thrift format, or just the implementations? > >> > >> On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <brian.bow...@sas.com> > >> wrote: > >> > >> > All, > >> > > >> > SAS requires support for storing varying-length character and > >> binary blobs > >> > with a 2^64 max length in Parquet. Currently, the ByteArray len > >> field is > >> > a unint32_t. Looks this the will require incrementing the Parquet > >> file > >> > format version and changing ByteArray len to uint64_t. > >> > > >> > Have there been any requests for this or other Parquet developments > >> that > >> > require file format versioning changes? > >> > > >> > I realize this a non-trivial ask. Thanks for considering it. > >> > > >> > -Brian > >> > > >> > >> > >> -- > >> Ryan Blue > >> Software Engineer > >> Netflix > >> > >> > >> > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > > > > -- > Ryan Blue > Software Engineer > Netflix