Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-30 Thread Wes McKinney
hi Hatem, Thanks for commenting. I am not sure your solution will work reliably because code is written against arrow::DictionaryType with the presumption that the dictionary is known and static, and can be obtained by invoking DictionaryType::dictionary. In the variable dictionary case, the dict

[DISCUSS][C++][Proposal] Threading engine for Arrow

2019-04-30 Thread Malakhov, Anton
Hi dear Arrow developers, Antoine, I'd like to kick off the discussion of the threading engine that Arrow can use underneath for implementing multicore parallelism for execution nodes, kernels, and/or all the functions, which can be optimized this way. I've documented some ideas on Arrow's Confl

Re: How about inet4/inet6/macaddr data types?

2019-04-30 Thread Wes McKinney
hi Kohei, I'm awaiting community feedback about the approach to implementing extension types, whether the approach that I've used (using the following keys in custom_metadata [1]) is the one that we want to use longer-term. This certainly seems like a good time to have that discussion. If there is

Re: [VOTE] Add new DurationInterval Type to Arrow Format

2019-04-30 Thread Micah Kornfield
OK, I think https://github.com/apache/arrow/pull/3644 is no ready to review. It includes Java implementation of DurationInterval and C++ implementations of DurationInterval and the original interval types. I added documentation to Schema.fbs regarding the original interval types (TL;DR; YEAR_MONT

Re: [VOTE] Add new DurationInterval Type to Arrow Format

2019-04-30 Thread Micah Kornfield
Sorry for the type OK, I think https://github.com/apache/arrow/pull/3644 is now ready to review. On Tue, Apr 30, 2019 at 4:56 PM Micah Kornfield wrote: > OK, I think https://github.com/apache/arrow/pull/3644 is no ready to > review. > > It includes Java implementation of DurationInterval and C++

Re: How about inet4/inet6/macaddr data types?

2019-04-30 Thread Kohei KaiGai
Hello Wes, @ktou also introduced me your work. As long as the custom_metadata format to declare the custom datatype is well defined in the specification or document somewhere, independent from the library implementation, it looks to me sufficient. Does your UUID example use FixedSizeBinary raw-dat

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-30 Thread Hatem Helal
Hi Wes, Thanks for the detailed writeup and I think this an important problem to solve. I spent some time thinking about this when working on ARROW-3769 and came to a similar conclusion that the current dictionary type was limiting when doing partial reads of parquet files. I'm not sure if th

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-04-30 Thread Parth Chandra
FWIW, in Drill's Value Vector code, we found that bounds checking was a major performance bottleneck in operators that wrote to vectors. Scans, as a result, we particularly affected. Another bottleneck was the zeroing of vectors. There were many unnecessary bounds checks. For example in a varchar v

[jira] [Created] (ARROW-5246) [Go] use Go-1.12 in CI

2019-04-30 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-5246: -- Summary: [Go] use Go-1.12 in CI Key: ARROW-5246 URL: https://issues.apache.org/jira/browse/ARROW-5246 Project: Apache Arrow Issue Type: Bug Com

[jira] [Created] (ARROW-5245) [C++][CI] Unpin cmake_format

2019-04-30 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-5245: -- Summary: [C++][CI] Unpin cmake_format Key: ARROW-5245 URL: https://issues.apache.org/jira/browse/ARROW-5245 Project: Apache Arrow Issue Type: Bug

[jira] [Created] (ARROW-5244) [C++] Review experimental / unstable APIs

2019-04-30 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5244: - Summary: [C++] Review experimental / unstable APIs Key: ARROW-5244 URL: https://issues.apache.org/jira/browse/ARROW-5244 Project: Apache Arrow Issue Type:

Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-30 Thread Wes McKinney
hi Shawn, The first step would be to write a more formalized requirements / specification document for discussion, but it is definitely no small project. Ultimately as they say "code settles arguments" so creating implementations based on a design document will help move along the process. I'd li

Re: How about inet4/inet6/macaddr data types?

2019-04-30 Thread Wes McKinney
hi Kohei, Since the introduction of arrow::ExtensionType in ARROW-585 [1] we have a well-defined method of creating new data types without having to manually interact with the custom_metadata Schema information. Can you have a look at that and see if it meets your requirements? This can be a usefu

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-30 Thread Bobby Evans
I wanted to give everyone a heads up that I have updated the SPIP at https://issues.apache.org/jira/browse/SPARK-27396 Please take a look and add any comments you might have to the JIRA. I reduced the scope of the SPIP to just the non-controversial parts. In the background, I will be trying to wo

[jira] [Created] (ARROW-5243) [Java][Gandiva] Add test for decimal compare functions

2019-04-30 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-5243: - Summary: [Java][Gandiva] Add test for decimal compare functions Key: ARROW-5243 URL: https://issues.apache.org/jira/browse/ARROW-5243 Project: Apache Arrow