Re: Use arrow as a general data serialization framework in distributed stream data processing

2019-04-29 Thread Shawn Yang
Hi Micah, Thank you for your information about in-memory row-oriented standard. After days of work, I find that it is exactly the thing we need now. I looked into the discuss you mentioned. It seems no one takes up the work. Is there anything I can do to speed up us having in-memory row-oriented st

Re: How about inet4/inet6/macaddr data types?

2019-04-29 Thread Kohei KaiGai
Hello, It is an proposition to add new logical types for the Apache Arrow data format. As Melik-Adamyan said, it is quite easy to convert 5-bytes FixedSizeBinary to PostgreSQL's inet data type by the Arrow_Fdw module (an extension of PostgreSQL responsible to data conversion), however, it is not

Re: How about inet4/inet6/macaddr data types?

2019-04-29 Thread Micah Kornfield
Hi KaiGai Kohei, Can you clarify if you are looking for advice on modelling these types or proposing to add new logical types to the Arrow specification? Thanks, Micah On Monday, April 29, 2019, Kohei KaiGai wrote: > Hello folks, > > How about your opinions about network address types support i

RE: How about inet4/inet6/macaddr data types?

2019-04-29 Thread Melik-Adamyan, Areg
If you want to store it and manipulate the best format is integers (or binary) - it will allow all the fast operations of masking, subnet querying, etc. but text representation will require conversion. It highly depends on the use-case, but conversion to pgSQL's inet or cidr from integer is ver

[jira] [Created] (ARROW-5242) Arrow doesn't compile cleanly with Visual Studio 2017 Update 9 or later due to narrowing

2019-04-29 Thread Billy Robert O'Neal III (JIRA)
Billy Robert O'Neal III created ARROW-5242: -- Summary: Arrow doesn't compile cleanly with Visual Studio 2017 Update 9 or later due to narrowing Key: ARROW-5242 URL: https://issues.apache.org/jira/browse/AR

How about inet4/inet6/macaddr data types?

2019-04-29 Thread Kohei KaiGai
Hello folks, How about your opinions about network address types support in Apache Arrow data format? Network address always appears at network logs massively generated by any network facilities, and it is a significant information when people analyze their backward logs. I'm working on Apache Ar

Re: [Contribution][Proposal] Use Contributors file and Signed-Off-By Process for Arrow

2019-04-29 Thread Wes McKinney
AFAIK no one has been employing systematic IP scanning tools; generally when there is code reuse in a pull request it is fairly obvious. It would be interesting to know how large, mature Apache projects (Apache Hadoop, Apache Spark, etc.) have approached this problem. On Mon, Apr 29, 2019 at 5:13

RE: [Contribution][Proposal] Use Contributors file and Signed-Off-By Process for Arrow

2019-04-29 Thread Melik-Adamyan, Areg
HI Wes, thanks for the reply. How do the committers and PMC check the IP currently? Is there any standard tool for it that you use? > -Original Message- > From: Wes McKinney [mailto:wesmck...@gmail.com] > Sent: Monday, April 29, 2019 4:39 PM > To: dev@arrow.apache.org > Subject: Re: [Cont

[jira] [Created] (ARROW-5241) [Python] Add option to disable writing statistics

2019-04-29 Thread Deepak Majeti (JIRA)
Deepak Majeti created ARROW-5241: Summary: [Python] Add option to disable writing statistics Key: ARROW-5241 URL: https://issues.apache.org/jira/browse/ARROW-5241 Project: Apache Arrow Issue

Re: [Contribution][Proposal] Use Contributors file and Signed-Off-By Process for Arrow

2019-04-29 Thread Wes McKinney
hi Areg, I think this is a question for ASF Legal and not Apache Arrow directly. Some contributors submit a ICLA or CCLA to the project, but broadly it is the responsibility of the Committers and PMC members to steward IP in the project, and one of the parts of the release process is to verify tha

[jira] [Created] (ARROW-5240) [C++][CI] cmake_format 0.5.0 appears to fail the build

2019-04-29 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-5240: -- Summary: [C++][CI] cmake_format 0.5.0 appears to fail the build Key: ARROW-5240 URL: https://issues.apache.org/jira/browse/ARROW-5240 Project: Apache Arrow

[Contribution][Proposal] Use Contributors file and Signed-Off-By Process for Arrow

2019-04-29 Thread Melik-Adamyan, Areg
To avoid contamination of the Arrow code with wrong licensed code, which can be accidentally included into arrow, including GPL code, and track the contributions maintainers needs to check actually whether committer has signed the ICLA or CCLA, and listed in the contributors file - which we do n

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-29 Thread Wes McKinney
On Mon, Apr 29, 2019 at 2:59 PM Micah Kornfield wrote: > > > > > > * The _actual_ dictionary values for a particular Array must be stored > > > somewhere and lifetime managed. I propose to put these as a single > > > entry in ArrayData::child_data [4]. An alternative to this would be to > > > modi

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-29 Thread Micah Kornfield
> > > * The _actual_ dictionary values for a particular Array must be stored > > somewhere and lifetime managed. I propose to put these as a single > > entry in ArrayData::child_data [4]. An alternative to this would be to > > modify ArrayData to have a dictionary field that would be unused > > exc

Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-29 Thread Antoine Pitrou
Hi Wes, Le 29/04/2019 à 20:10, Wes McKinney a écrit : > > * Receiving a record batch schema without the dictionaries attached > (e.g. in Arrow Flight), see also experimental patch [2] Note that this was finally done in a separate PR, and only required changes in the IPC implementation. > Here

[jira] [Created] (ARROW-5239) Add support for interval types in javascript

2019-04-29 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-5239: -- Summary: Add support for interval types in javascript Key: ARROW-5239 URL: https://issues.apache.org/jira/browse/ARROW-5239 Project: Apache Arrow Issue T

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-04-29 Thread Wes McKinney
I'm also curious which APIs are particularly problematic for performance. In ARROW-1833 [1] and some related discussions there was the suggestion of adding methods like getUnsafe, so this would be like get(i) [2] but without checking the validity bitmap [1] : https://issues.apache.org/jira/browse/

[DISCUSS][C++] Static versus variable Arrow dictionary encoding

2019-04-29 Thread Wes McKinney
hi all, There have been many discussions in passing on various issues and JIRA tickets over the last months and years about how to manage dictionary-encoded columnar arrays in-memory in C++. Here's a list of some problems we have encountered: * Dictionaries that may differ from one record batch t

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-04-29 Thread Micah Kornfield
Thanks for the design. Personally, I'm not a huge fan of creating a parallel classes for every vector type, this ends up being confusing for developers and adds a lot of boiler plate. I wonder if you could use a similar approach that the memory module uses for turning bounds checking on/off [1].

[jira] [Created] (ARROW-5238) [Python] Improve usability of pyarrow.dictionary function

2019-04-29 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5238: --- Summary: [Python] Improve usability of pyarrow.dictionary function Key: ARROW-5238 URL: https://issues.apache.org/jira/browse/ARROW-5238 Project: Apache Arrow

Re: [DISCUSS] C++ Filesystem abstraction

2019-04-29 Thread Wes McKinney
hi Antoine, Thank you for starting this discussion. I left some comments on the PR. I had been looking previously at TensorFlow's file system APIs ([1], and various implementations) for some possible guidance around this, though since Arrow is intended as development platform / reusable set of li

[DISCUSS] C++ Filesystem abstraction

2019-04-29 Thread Antoine Pitrou
Hello, For the datasets project (*), one requirement is for Arrow to grow a filesystem abstraction. The aim is to access various kinds of storage systems (local filesystem, S3, HadoopFS...) with a single API. Hopefully, the API can be made good enough to avoid inefficiencies. I've pushed a dra

[jira] [Created] (ARROW-5237) [Python] pandas_version key in pandas metadata no longer populated

2019-04-29 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5237: Summary: [Python] pandas_version key in pandas metadata no longer populated Key: ARROW-5237 URL: https://issues.apache.org/jira/browse/ARROW-5237 Proj

[jira] [Created] (ARROW-5236) hdfs.connect() is trying to load libjvm in windows

2019-04-29 Thread Kamaraju (JIRA)
Kamaraju created ARROW-5236: --- Summary: hdfs.connect() is trying to load libjvm in windows Key: ARROW-5236 URL: https://issues.apache.org/jira/browse/ARROW-5236 Project: Apache Arrow Issue Type: Bug

[jira] [Created] (ARROW-5235) [C++] RAPIDJSON_INCLUDE_DIR not set (Windows + Anaconda)

2019-04-29 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5235: - Summary: [C++] RAPIDJSON_INCLUDE_DIR not set (Windows + Anaconda) Key: ARROW-5235 URL: https://issues.apache.org/jira/browse/ARROW-5235 Project: Apache Arrow

[jira] [Created] (ARROW-5234) [Rust] [DataFusion] Create Python bindings for DataFusion

2019-04-29 Thread Andy Grove (JIRA)
Andy Grove created ARROW-5234: - Summary: [Rust] [DataFusion] Create Python bindings for DataFusion Key: ARROW-5234 URL: https://issues.apache.org/jira/browse/ARROW-5234 Project: Apache Arrow Issu

[jira] [Created] (ARROW-5233) [Go] migrate to new flatbuffers-v0.11.0

2019-04-29 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-5233: -- Summary: [Go] migrate to new flatbuffers-v0.11.0 Key: ARROW-5233 URL: https://issues.apache.org/jira/browse/ARROW-5233 Project: Apache Arrow Issue Type:

[jira] [Created] (ARROW-5232) [Java] value vector size increases rapidly in case of clear/setSafe loop

2019-04-29 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-5232: - Summary: [Java] value vector size increases rapidly in case of clear/setSafe loop Key: ARROW-5232 URL: https://issues.apache.org/jira/browse/ARROW-5232 Proj