Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-26 Thread Brian Bowman
Hello Wes, Thanks for the info! I'm working to better understand Parquet/Arrow design and development processes. No hurry for LARGE_BYTE_ARRAY. -Brian On 4/26/19, 11:14 AM, "Wes McKinney" wrote: EXTERNAL hi Brian, I doubt that such a change could be made on a short

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-26 Thread Wes McKinney
hi Brian, I doubt that such a change could be made on a short time horizon. Collecting feedback and building consensus (if it is even possible) with stakeholders would take some time. The appropriate place to have the discussion is here on the mailing list, though Thanks On Mon, Apr 8, 2019 at 1

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-08 Thread Brian Bowman
Hello Wes/all, A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without resorting to other alternatives. Is this something that could be done in Parquet over the next few months? I have a lot of experience with file formats/storage layer internals and can contribute for Parquet

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
Thanks Ryan, After further pondering this, I came to similar conclusions. Compress the data before putting it into a Parquet ByteArray and if that’s not feasible reference it in an external/persisted data structure Another alternative is to create one or more “shadow columns” to store the over

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Wes McKinney
hi Brian, Just to comment from the C++ side -- the 64-bit issue is a limitation of the Parquet format itself and not related to the C++ implementation. It would be possibly interesting to add a LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing doing much the same in Apache Arrow

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Ryan Blue
I don't think that's what you would want to do. Parquet will eventually compress large values, but not after making defensive copies and attempting to encode them. In the end, it will be a lot more overhead, plus the work to make it possible. I think you'd be much better of compressing before stori

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
My hope is that these large ByteArray values will encode/compress to a fraction of their original size. FWIW, cpp/src/parquet/column_writer.cc/.h has int64_t offset and length fields all over the place. External file references to BLOBS is doable but not the elegant

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Ryan Blue
Looks like we will need a new encoding for this: https://github.com/apache/parquet-format/blob/master/Encodings.md That doc specifies that the plain encoding uses a 4-byte length. That's not going to be a quick fix. Now that I'm thinking about this a bit more, does it make sense to support byte a

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
Hello Ryan, Looks like it's limited by both the Parquet implementation and the Thrift message methods. Am I missing anything? From cpp/src/parquet/types.h struct ByteArray { ByteArray() : len(0), ptr(NULLPTR) {} ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {} uint32_

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Ryan Blue
Hi Brian, This seems like something we should allow. What imposes the current limit? Is it in the thrift format, or just the implementations? On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman wrote: > All, > > SAS requires support for storing varying-length character and binary blobs > with a 2^64 m