Hello Wes/all,

A new LARGE_BYTE_ARRAY type in Parquet would satisfy SAS' needs without 
resorting to other alternatives.  Is this something that could be done in 
Parquet over the next few months?  I have a lot of experience with file 
formats/storage layer internals and can contribute for Parquet C++.

-Brian

On 4/5/19, 3:44 PM, "Wes McKinney" <wesmck...@gmail.com> wrote:

    EXTERNAL
    
    hi Brian,
    
    Just to comment from the C++ side -- the 64-bit issue is a limitation
    of the Parquet format itself and not related to the C++
    implementation. It would be possibly interesting to add a
    LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing
    doing much the same in Apache Arrow for in-memory)
    
    - Wes
    
    On Fri, Apr 5, 2019 at 2:11 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
    >
    > I don't think that's what you would want to do. Parquet will eventually
    > compress large values, but not after making defensive copies and 
attempting
    > to encode them. In the end, it will be a lot more overhead, plus the work
    > to make it possible. I think you'd be much better of compressing before
    > storing in Parquet if you expect good compression rates.
    >
    > On Fri, Apr 5, 2019 at 11:29 AM Brian Bowman <brian.bow...@sas.com> wrote:
    >
    > > My hope is that these large ByteArray values will encode/compress to a
    > > fraction of their original size.  FWIW, cpp/src/parquet/
    > > column_writer.cc/.h has int64_t offset and length fields all over the
    > > place.
    > >
    > > External file references to BLOBS is doable but not the elegant,
    > > integrated solution I was hoping for.
    > >
    > > -Brian
    > >
    > > On Apr 5, 2019, at 1:53 PM, Ryan Blue <rb...@netflix.com> wrote:
    > >
    > > *EXTERNAL*
    > > Looks like we will need a new encoding for this:
    > > https://github.com/apache/parquet-format/blob/master/Encodings.md
    > >
    > > That doc specifies that the plain encoding uses a 4-byte length. That's
    > > not going to be a quick fix.
    > >
    > > Now that I'm thinking about this a bit more, does it make sense to 
support
    > > byte arrays that are more than 2GB? That's far larger than the size of a
    > > row group, let alone a page. This would completely break memory 
management
    > > in the JVM implementation.
    > >
    > > Can you solve this problem using a BLOB type that references an external
    > > file with the gigantic values? Seems to me that values this large 
should go
    > > in separate files, not in a Parquet file where it would destroy any 
benefit
    > > from using the format.
    > >
    > > On Fri, Apr 5, 2019 at 10:43 AM Brian Bowman <brian.bow...@sas.com> 
wrote:
    > >
    > >> Hello Ryan,
    > >>
    > >> Looks like it's limited by both the Parquet implementation and the 
Thrift
    > >> message methods.  Am I missing anything?
    > >>
    > >> From cpp/src/parquet/types.h
    > >>
    > >> struct ByteArray {
    > >>   ByteArray() : len(0), ptr(NULLPTR) {}
    > >>   ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}
    > >>   uint32_t len;
    > >>   const uint8_t* ptr;
    > >> };
    > >>
    > >> From cpp/src/parquet/thrift.h
    > >>
    > >> inline void DeserializeThriftMsg(const uint8_t* buf, uint32_t* len, T*
    > >> deserialized_msg) {
    > >> inline int64_t SerializeThriftMsg(T* obj, uint32_t len, OutputStream*
    > >> out)
    > >>
    > >> -Brian
    > >>
    > >> On 4/5/19, 1:32 PM, "Ryan Blue" <rb...@netflix.com.INVALID> wrote:
    > >>
    > >>     EXTERNAL
    > >>
    > >>     Hi Brian,
    > >>
    > >>     This seems like something we should allow. What imposes the current
    > >> limit?
    > >>     Is it in the thrift format, or just the implementations?
    > >>
    > >>     On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman <brian.bow...@sas.com>
    > >> wrote:
    > >>
    > >>     > All,
    > >>     >
    > >>     > SAS requires support for storing varying-length character and
    > >> binary blobs
    > >>     > with a 2^64 max length in Parquet.   Currently, the ByteArray len
    > >> field is
    > >>     > a unint32_t.   Looks this the will require incrementing the 
Parquet
    > >> file
    > >>     > format version and changing ByteArray len to uint64_t.
    > >>     >
    > >>     > Have there been any requests for this or other Parquet 
developments
    > >> that
    > >>     > require file format versioning changes?
    > >>     >
    > >>     > I realize this a non-trivial ask.  Thanks for considering it.
    > >>     >
    > >>     > -Brian
    > >>     >
    > >>
    > >>
    > >>     --
    > >>     Ryan Blue
    > >>     Software Engineer
    > >>     Netflix
    > >>
    > >>
    > >>
    > >
    > > --
    > > Ryan Blue
    > > Software Engineer
    > > Netflix
    > >
    > >
    >
    > --
    > Ryan Blue
    > Software Engineer
    > Netflix
    

Reply via email to