Re: [VOTE] Proposed changes to Arrow Flight protocol

2019-04-05 Thread Kouhei Sutou
+1 (binding) In "[VOTE] Proposed changes to Arrow Flight protocol" on Tue, 2 Apr 2019 19:05:27 -0500, Wes McKinney wrote: > Hi, > > David Li has proposed to make the following additions or changes > to the Flight gRPC service definition [1] and general design, as explained in > greater de

Re: [VOTE] Add new DurationInterval Type to Arrow Format

2019-04-05 Thread Kouhei Sutou
+1 (binding) In "[VOTE] Add new DurationInterval Type to Arrow Format" on Wed, 3 Apr 2019 07:59:56 -0700, Jacques Nadeau wrote: > I'd like to propose a change to the Arrow format to support a new duration > type. Details below. Threads on mailing list around discussion. > > > // An absol

Re: [VOTE] Add new DurationInterval Type to Arrow Format

2019-04-05 Thread Micah Kornfield
I think this needs another PMC member to way in? Would mind taking a look? On Wed, Apr 3, 2019 at 9:21 AM Jacques Nadeau wrote: > Yes, copy and paste error: > > +1 to add the new type (binding) > > On Wed, Apr 3, 2019 at 8:36 AM Wes McKinney wrote: > > > +1 (binding) to add the new type > > >

[jira] [Created] (ARROW-5130) Segfault when importing TensorFlow after Pyarrow

2019-04-05 Thread Travis Addair (JIRA)
Travis Addair created ARROW-5130: Summary: Segfault when importing TensorFlow after Pyarrow Key: ARROW-5130 URL: https://issues.apache.org/jira/browse/ARROW-5130 Project: Apache Arrow Issue T

Re: [VOTE] Proposed changes to Arrow Flight protocol

2019-04-05 Thread Wes McKinney
hi, We still need another PMC to look at the 4 proposals, since 2 of them do not have the requisite votes. Thanks On Thu, Apr 4, 2019 at 1:28 PM Wes McKinney wrote: > > Could some other PMC members have a look at these proposals? 2 out of > the 4 have the requisite 3 votes, while 2 need another

[DRAFT] Apache Arrow ASF Board Report April 2019

2019-04-05 Thread Wes McKinney
## Description: Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
Thanks Ryan, After further pondering this, I came to similar conclusions. Compress the data before putting it into a Parquet ByteArray and if that’s not feasible reference it in an external/persisted data structure Another alternative is to create one or more “shadow columns” to store the over

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Wes McKinney
hi Brian, Just to comment from the C++ side -- the 64-bit issue is a limitation of the Parquet format itself and not related to the C++ implementation. It would be possibly interesting to add a LARGE_BYTE_ARRAY type with 64-bit offset encoding (we are discussing doing much the same in Apache Arrow

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Ryan Blue
I don't think that's what you would want to do. Parquet will eventually compress large values, but not after making defensive copies and attempting to encode them. In the end, it will be a lot more overhead, plus the work to make it possible. I think you'd be much better of compressing before stori

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
My hope is that these large ByteArray values will encode/compress to a fraction of their original size. FWIW, cpp/src/parquet/column_writer.cc/.h has int64_t offset and length fields all over the place. External file references to BLOBS is doable but not the elegant

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Ryan Blue
Looks like we will need a new encoding for this: https://github.com/apache/parquet-format/blob/master/Encodings.md That doc specifies that the plain encoding uses a 4-byte length. That's not going to be a quick fix. Now that I'm thinking about this a bit more, does it make sense to support byte a

[jira] [Created] (ARROW-5129) Column writer bug: check dictionary encoder when adding a new data page

2019-04-05 Thread Ivan Sadikov (JIRA)
Ivan Sadikov created ARROW-5129: --- Summary: Column writer bug: check dictionary encoder when adding a new data page Key: ARROW-5129 URL: https://issues.apache.org/jira/browse/ARROW-5129 Project: Apache A

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
Hello Ryan, Looks like it's limited by both the Parquet implementation and the Thrift message methods. Am I missing anything? From cpp/src/parquet/types.h struct ByteArray { ByteArray() : len(0), ptr(NULLPTR) {} ByteArray(uint32_t len, const uint8_t* ptr) : len(len), ptr(ptr) {} uint32_

Re: Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Ryan Blue
Hi Brian, This seems like something we should allow. What imposes the current limit? Is it in the thrift format, or just the implementations? On Fri, Apr 5, 2019 at 10:23 AM Brian Bowman wrote: > All, > > SAS requires support for storing varying-length character and binary blobs > with a 2^64 m

Need 64-bit Integer length for Parquet ByteArray Type

2019-04-05 Thread Brian Bowman
All, SAS requires support for storing varying-length character and binary blobs with a 2^64 max length in Parquet. Currently, the ByteArray len field is a unint32_t. Looks this the will require incrementing the Parquet file format version and changing ByteArray len to uint64_t. Have there

[jira] [Created] (ARROW-5128) [Packaging][CentOS][Conda] Numpy not found in nightly builds

2019-04-05 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-5128: -- Summary: [Packaging][CentOS][Conda] Numpy not found in nightly builds Key: ARROW-5128 URL: https://issues.apache.org/jira/browse/ARROW-5128 Project: Apache Arrow

[jira] [Created] (ARROW-5127) [Rust] [Parquet] Add page iterator

2019-04-05 Thread Renjie Liu (JIRA)
Renjie Liu created ARROW-5127: - Summary: [Rust] [Parquet] Add page iterator Key: ARROW-5127 URL: https://issues.apache.org/jira/browse/ARROW-5127 Project: Apache Arrow Issue Type: Sub-task

[jira] [Created] (ARROW-5126) [Rust] [Parquet] Convert parquet column desc to arrow data type

2019-04-05 Thread Renjie Liu (JIRA)
Renjie Liu created ARROW-5126: - Summary: [Rust] [Parquet] Convert parquet column desc to arrow data type Key: ARROW-5126 URL: https://issues.apache.org/jira/browse/ARROW-5126 Project: Apache Arrow

[jira] [Created] (ARROW-5125) [Python] Cannot roundtrip extreme dates through pyarrow

2019-04-05 Thread Max Bolingbroke (JIRA)
Max Bolingbroke created ARROW-5125: -- Summary: [Python] Cannot roundtrip extreme dates through pyarrow Key: ARROW-5125 URL: https://issues.apache.org/jira/browse/ARROW-5125 Project: Apache Arrow