Looks like we will need a new encoding for this: https://github.com/apache/parquet-format/blob/master/Encodings.md
That doc specifies that the plain encoding uses a 4-byte length. That's not going to be a quick fix. Now that I'm thinking about this a bit more, does it make sense to support byte arrays that are more than 2GB? That's far larger than the size of a row group, let alone a page. This would completely break memory management in the JVM implementation. Can you solve this problem using a BLOB type that references an external file with the gigantic values? Seems to me that values this large should go in separate files, not in a Parquet file where it would destroy any benefit from using the format. On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <brian.bow...@sas.com> wrote: > Hello Ryan, > > Looks like it's limited by both the Parquet implementation and the Thrift > message methods. Am I missing anything? > > From cpp/src/parquet/types.h > > struct ByteArray { > ByteArray() : len(0), ptr(NULLPTR) {} > ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {} > uint32_t len; > const uint8_t* ptr; > }; > > From cpp/src/parquet/thrift.h > > inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T* > deserialized_msg) { > inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out) > > -Brian > > On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote: > > EXTERNAL > > Hi Brian, > > This seems like something we should allow. What imposes the current > limit? > Is it in the thrift format, or just the implementations? > > On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <brian.bow...@sas.com> > wrote: > > > All, > > > > SAS requires support for storing varying-length character and binary > blobs > > with a 2^64 max length in Parquet. Currently, the ByteArray len > field is > > a unint32_t. Looks this the will require incrementing the Parquet > file > > format version and changing ByteArray len to uint64_t. > > > > Have there been any requests for this or other Parquet developments > that > > require file format versioning changes? > > > > I realize this a non-trivial ask. Thanks for considering it. > > > > -Brian > > > > > -- > Ryan Blue > Software Engineer > Netflix > > > -- Ryan Blue Software Engineer Netflix