Looks like we will need a new encoding for this:
https://github.com/apache/parquet-format/blob/master/Encodings.md

That doc specifies that the plain encoding uses a 4-byte length. That's not
going to be a quick fix.

Now that I'm thinking about this a bit more, does it make sense to support
byte arrays that are more than 2GB? That's far larger than the size of a
row group, let alone a page. This would completely break memory management
in the JVM implementation.

Can you solve this problem using a BLOB type that references an external
file with the gigantic values? Seems to me that values this large should go
in separate files, not in a Parquet file where it would destroy any benefit
from using the format.

On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <brian.bow...@sas.com> wrote:

> Hello Ryan,
>
> Looks like it's limited by both the Parquet implementation and the Thrift
> message methods.  Am I missing anything?
>
> From cpp/src/parquet/types.h
>
> struct ByteArray {
>   ByteArray() : len(0), ptr(NULLPTR) {}
>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
>   uint32_t len;
>   const uint8_t* ptr;
> };
>
> From cpp/src/parquet/thrift.h
>
> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
> deserialized_msg) {
> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream* out)
>
> -Brian
>
> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
>
>     EXTERNAL
>
>     Hi Brian,
>
>     This seems like something we should allow. What imposes the current
> limit?
>     Is it in the thrift format, or just the implementations?
>
>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <brian.bow...@sas.com>
> wrote:
>
>     > All,
>     >
>     > SAS requires support for storing varying-length character and binary
> blobs
>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
> field is
>     > a unint32_t.   Looks this the will require incrementing the Parquet
> file
>     > format version and changing ByteArray len to uint64_t.
>     >
>     > Have there been any requests for this or other Parquet developments
> that
>     > require file format versioning changes?
>     >
>     > I realize this a non-trivial ask.  Thanks for considering it.
>     >
>     > -Brian
>     >
>
>
>     --
>     Ryan Blue
>     Software Engineer
>     Netflix
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to