Re: [Discuss][C++] Hashing floating point numbers

2019-02-25 Thread Tim Armstrong
Hi Micah, We have run into some of these issues on Impala in various guises, including hash tables and min/max stats in parquet. Treating +0/-0 as indistinguishable for purposes of equality and grouping makes the most sense and avoids most pitfalls. NaN is messier. I don't think there's necessar

Re: Question about pyarrow array representation.

2019-02-25 Thread Wes McKinney
hi Peng, Here is a minimal reproduction of the issue you're having: In [38]: arr = np.empty(2, dtype=object) In [39]: arr[0] = np.array([1, 2]) In [40]: arr[1] = np.array([2, 3]) In [41]: arr2 = np.empty(2, dtype=object) In [42]: arr2[0] = arr In [43]: arr2[1] = arr In [45]: pa.array(arr2)

[Discuss][Java] Codebase Housekeeping?

2019-02-25 Thread Micah Kornfield
Hi Java Arrow-Developers, I've been looking more into the java code base and I was wondering if people think any of the following might be worthwhile (or are strictly against them). My java infrastructure knowledge is a little stale, so if a suggestion I make is absolutely ridiculous I apologize.

[jira] [Created] (ARROW-4681) [Rust] [DataFusion] Implement parallel query execution using threads

2019-02-25 Thread Andy Grove (JIRA)
Andy Grove created ARROW-4681: - Summary: [Rust] [DataFusion] Implement parallel query execution using threads Key: ARROW-4681 URL: https://issues.apache.org/jira/browse/ARROW-4681 Project: Apache Arrow

[Flight] A few questions

2019-02-25 Thread Micah Kornfield
I apologize I'm a little late on chiming in on flight but I had some questions/comments that a quick search of the mailing list didn't seem to turn up anything and I didn't see comment on the initial pull request [1] 1. What is meant by "sidecar patterns" [2] on the data buffer bytes? 2. Was usi

[jira] [Created] (ARROW-4680) [CI] [Rust] Travis CI builds fail with latest Rust 1.34.0-nightly (2019-02-25)

2019-02-25 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-4680: - Summary: [CI] [Rust] Travis CI builds fail with latest Rust 1.34.0-nightly (2019-02-25) Key: ARROW-4680 URL: https://issues.apache.org/jira/browse/ARROW-4680 Projec

[Discuss][C++] Hashing floating point numbers

2019-02-25 Thread Micah Kornfield
Implementing compute kernels that depend on hashing has raised a couple of edge cases that are worth discussing. In particular the following points need to be resolved (I opened a JIRA [1] to track the fixes). In particular: 1. How to handle -0.0 and 0.0? - Option 1: Collapse to a single value

Re: [C++] Help with windows build failure

2019-02-25 Thread Micah Kornfield
The issue I'm blocked on is getting boost installed properly. I've included all of the steps I've run below, if anyone has some thoughts or the magical script to build and install the appropriate boost libraries appropriate for the Static_Crt_Build i would greatly appreciate it. With a Windows 10

[jira] [Created] (ARROW-4679) [Rust] [DataFusion] Implement in-memory DataSource

2019-02-25 Thread Andy Grove (JIRA)
Andy Grove created ARROW-4679: - Summary: [Rust] [DataFusion] Implement in-memory DataSource Key: ARROW-4679 URL: https://issues.apache.org/jira/browse/ARROW-4679 Project: Apache Arrow Issue Type:

[jira] [Created] (ARROW-4678) [Rust] Minimize unstable feature usage

2019-02-25 Thread Steven Fackler (JIRA)
Steven Fackler created ARROW-4678: - Summary: [Rust] Minimize unstable feature usage Key: ARROW-4678 URL: https://issues.apache.org/jira/browse/ARROW-4678 Project: Apache Arrow Issue Type: Imp

[jira] [Created] (ARROW-4677) [Python] serialization does not consider ndarray endianness

2019-02-25 Thread Gabe Joseph (JIRA)
Gabe Joseph created ARROW-4677: -- Summary: [Python] serialization does not consider ndarray endianness Key: ARROW-4677 URL: https://issues.apache.org/jira/browse/ARROW-4677 Project: Apache Arrow

Re: Passing user-defined "extension" types in the Arrow protocol

2019-02-25 Thread Wes McKinney
On Mon, Feb 25, 2019 at 5:36 PM Antoine Pitrou wrote: > > Does it also roundtrip through e.g. Pandas conversion? No. Any Arrow metadata is lost when you call to_pandas() (because pandas objects don't have the ability to preserve any column-level metadata, only the physical data type). The metadat

Re: Passing user-defined "extension" types in the Arrow protocol

2019-02-25 Thread Antoine Pitrou
Le 26/02/2019 à 00:32, Wes McKinney a écrit : > hi folks, > > I recently wrote a patch to propose a C++ API for user-defined "extension" > types > > https://github.com/apache/arrow/pull/3694 > > The idea is that an extension type wraps a pre-existing Arrow type. > For example a UUIDType can b

Passing user-defined "extension" types in the Arrow protocol

2019-02-25 Thread Wes McKinney
hi folks, I recently wrote a patch to propose a C++ API for user-defined "extension" types https://github.com/apache/arrow/pull/3694 The idea is that an extension type wraps a pre-existing Arrow type. For example a UUIDType can be represented as FixedSizeBinary(16). The intent is that Arrow cons

Nightly binary packages

2019-02-25 Thread Krisztián Szűcs
Hi, Currently We have nightly package builds, currently under my github account, which is not really visible. It would be great to make them available for developer purposes, and additionally it'd test the binary scripts too. The nightly packages are produced the same way like it is documented in

[jira] [Created] (ARROW-4676) [C++] Add support for debug build with MinGW

2019-02-25 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-4676: --- Summary: [C++] Add support for debug build with MinGW Key: ARROW-4676 URL: https://issues.apache.org/jira/browse/ARROW-4676 Project: Apache Arrow Issue Type: I

[jira] [Created] (ARROW-4675) [Python] Error serializing bool ndarray in py2 and deserializing in py3

2019-02-25 Thread Gabe Joseph (JIRA)
Gabe Joseph created ARROW-4675: -- Summary: [Python] Error serializing bool ndarray in py2 and deserializing in py3 Key: ARROW-4675 URL: https://issues.apache.org/jira/browse/ARROW-4675 Project: Apache Arr

[jira] [Created] (ARROW-4674) [JS] Update arrow2csv to new Row API

2019-02-25 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4674: -- Summary: [JS] Update arrow2csv to new Row API Key: ARROW-4674 URL: https://issues.apache.org/jira/browse/ARROW-4674 Project: Apache Arrow Issue Type: Bug

Parquet Shared Library Versioning

2019-02-25 Thread Hatem Helal
Hi all, I’d like to discuss the versioning of the parquet shared libs that are built when you use -DARROW_PARQUET=ON. My observation is that back when parquet-cpp was a separate project the shared libs were versioned using the parquet-cpp version number (e.g 1.4.0). Since moving to a single r

Re: [jira] [Created] (ARROW-4673) [C++] Implement AssertDatumEquals

2019-02-25 Thread Micah Kornfield
It might be nice to do this as a Gmock matcher instead of a separate macro On Monday, February 25, 2019, Francois Saint-Jacques (JIRA) wrote: > Francois Saint-Jacques created ARROW-4673: > - > > Summary: [C++] Implement AssertDatumEquals >

[jira] [Created] (ARROW-4673) [C++] Implement AssertDatumEquals

2019-02-25 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-4673: - Summary: [C++] Implement AssertDatumEquals Key: ARROW-4673 URL: https://issues.apache.org/jira/browse/ARROW-4673 Project: Apache Arrow Issu

Re: [DISCUSSION] Representing Map datatype using ValueVectors

2019-02-25 Thread Igor Guzenko
Thanks for quick response, I'll update the discussion in case of progress. On Mon, Feb 25, 2019 at 6:01 PM Wes McKinney wrote: > > hi Igor, > > We have Map as a top-level logical data type in the columnar metadata: > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L55 > > There is

Re: Developing a "dataset" API / framework for Arrow C++ users

2019-02-25 Thread Wes McKinney
hi Joel and Uwe, yes, feedback from the Iceberg community would be useful about what kinds of APIs are required to be able to interact well with table formats like Iceberg. As Uwe says, the objective of the C++ code I am proposing to develop is to have appropriate C++ APIs for interacting with dif

Re: [DISCUSSION] Representing Map datatype using ValueVectors

2019-02-25 Thread Ravindra Pindikura
> On Feb 25, 2019, at 8:02 PM, Ihor Huzenko wrote: > > Hello Arrow Team, > > My name is Igor Guzenko. I'm currently working on task related to > complex types in Apache Drill [1], and bumped into an issue that Drill > hasn't > appropriate vector for representing canonical (java-like) Map data

Re: [DISCUSSION] Representing Map datatype using ValueVectors

2019-02-25 Thread Wes McKinney
hi Igor, We have Map as a top-level logical data type in the columnar metadata: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L55 There isn't anything more than this right now. We have not implemented container types in Java or C++ yet, for the Map type, but I don't view it to be

[DISCUSSION] Representing Map datatype using ValueVectors

2019-02-25 Thread Ihor Huzenko
Hello Arrow Team, My name is Igor Guzenko. I'm currently working on task related to complex types in Apache Drill [1], and bumped into an issue that Drill hasn't appropriate vector for representing canonical (java-like) Map datatype [2]. So I'm looking for inspiration how the efficient columnar ma

[RESULT][VOTE] Release Apache Arrow 0.12.1 RC0

2019-02-25 Thread Uwe L. Korn
With +6 (+4 binding) the vote passes. I will upload the artifacts soon. On Mon, Feb 25, 2019, at 11:28 AM, Hatem Helal wrote: > +1 (non-binding) > > Built on macOS 10.13 and ran unittests. > > > On 2/24/19, 1:43 PM, "Wes McKinney" wrote: > > +1 (binding) > > Verified release can

Re: [VOTE] Release Apache Arrow 0.12.1 RC0

2019-02-25 Thread Hatem Helal
+1 (non-binding) Built on macOS 10.13 and ran unittests. On 2/24/19, 1:43 PM, "Wes McKinney" wrote: +1 (binding) Verified release candidate with Windows 10 MSVC 2015 On Fri, Feb 22, 2019 at 4:14 PM Kouhei Sutou wrote: > > +1 (binding) > > I ran the follo

Re: Developing a "dataset" API / framework for Arrow C++ users

2019-02-25 Thread Uwe L. Korn
Hello, this should definitely be shared with the Apache Iceberg community (cc'ed). The title of the document may be a bit confusing. What is proposed in there is actually constructing the building blocks in C++ that are required for supporting Python/C++/.. implementations for things like Icebe

Re: Developing a "dataset" API / framework for Arrow C++ users

2019-02-25 Thread Joel Pfaff
Hello, Thanks for the write-up. Have you considered sharing this document with the Apache Iceberg community? My feeling is that there are some shared goals here between the two projects. And while their implementation is in Java, their spec is language agnostic. Regards, Joel On Sun, Feb 24,