Re: [pyarrow] Parquet page header size limit

2019-05-20 Thread shyam narayan singh
Thanks Micah and Wes. Will try to submit a PR in a day or two. Regards Shyam On Mon, May 20, 2019 at 10:46 PM Wes McKinney wrote: > Those instructions are a bit out of date after the monorepo merge, see > > > https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#apache-parq

Re: [Discuss][Format] Zero size record batches

2019-05-20 Thread Ravindra Pindikura
On Tue, May 21, 2019 at 10:35 AM Micah Kornfield wrote: > Today, the format docs are ambiguous on whether zero sized batches are > supported. Wes opened a PR [1] for empty record batches that shows C++ > handles them but Java and javascript fail to handle them. > > > I'd like to propose: > 1. M

Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-20 Thread Micah Kornfield
Hi Wes, It looks like comments are turned off on the doc, this intentional? Thanks, Micah On Mon, May 20, 2019 at 3:49 PM Wes McKinney wrote: > hi folks, > > I'm interested in starting to build a so-called "data frame" interface > as a moderately opinionated, higher-level usability layer for >

[Discuss][Format] Zero size record batches

2019-05-20 Thread Micah Kornfield
Today, the format docs are ambiguous on whether zero sized batches are supported. Wes opened a PR [1] for empty record batches that shows C++ handles them but Java and javascript fail to handle them. I'd like to propose: 1. Make it explicit in the format docs, that 0 size record batches are sup

Re: [DISCUSS][C++] Unaligned memory accesses (undefined behavior)

2019-05-20 Thread Micah Kornfield
Created https://jira.apache.org/jira/browse/ARROW-5380 to track turning fixing and turning on unaligned access warnings in UBSan https://jira.apache.org/jira/browse/ARROW-5365 tracks turning on ASAN and UBSAN in CI. Thanks, Micah On Fri, May 17, 2019 at 1:48 PM Antoine Pitrou wrote: > > Le 17

[jira] [Created] (ARROW-5380) [C++] Fix and enable UBSan for unaligned accesses.

2019-05-20 Thread Micah Kornfield (JIRA)
Micah Kornfield created ARROW-5380: -- Summary: [C++] Fix and enable UBSan for unaligned accesses. Key: ARROW-5380 URL: https://issues.apache.org/jira/browse/ARROW-5380 Project: Apache Arrow I

[Discuss][Format][Java] Finalizing Union Types

2019-05-20 Thread Micah Kornfield
In the past [1] there hasn't been agreement on the final requirements for union types. Briefly the two approaches that are currently advocated: 1. Limit unions to only contain one field of each individual type (e.g. you can't have two separate int32 fields). Java takes this approach. 2. General

Re: ARROW-4714: Providing JNI interface to Read ORC file via Arrow C++

2019-05-20 Thread Yurui Zhou
Hi Micah: Thanks for the response. According to our benchmark, the cpp-orc is on average 1% to 10% slower than the java-orc, While the on-heap to off-heap memory conversion overhead can easily outweigh such a performance difference. And we are currently also working on some performance improveme

[DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

2019-05-20 Thread Wes McKinney
hi folks, I'm interested in starting to build a so-called "data frame" interface as a moderately opinionated, higher-level usability layer for interacting with Arrow-based chunked in-memory data. I've had numerous discussions (mostly in-person) over the last few years about this and it feels to me

[jira] [Created] (ARROW-5379) [Python] support pandas' nullable Integer type in from_pandas

2019-05-20 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5379: Summary: [Python] support pandas' nullable Integer type in from_pandas Key: ARROW-5379 URL: https://issues.apache.org/jira/browse/ARROW-5379 Project:

Re: [Discuss] [Python] protocol for conversion to pyarrow Array

2019-05-20 Thread Joris Van den Bossche
Hi Wes, That indeeds seems as a good fit for the pandas ExtensionArray <-> Arrow conversion. I will look into it starting this week. Joris Op vr 17 mei 2019 om 00:28 schreef Wes McKinney : > hi Joris, > > Somewhat related to this, I want to also point out that we have C++ > extension types [1].

Re: [pyarrow] Parquet page header size limit

2019-05-20 Thread Wes McKinney
Those instructions are a bit out of date after the monorepo merge, see https://github.com/apache/arrow/blob/master/docs/source/developers/cpp.rst#apache-parquet-development On Mon, May 20, 2019 at 8:33 AM Micah Kornfield wrote: > > Hi Shyam, > https://github.com/apache/parquet-testing contains s

[jira] [Created] (ARROW-5378) [C++] Add local FileSystem implementation

2019-05-20 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5378: - Summary: [C++] Add local FileSystem implementation Key: ARROW-5378 URL: https://issues.apache.org/jira/browse/ARROW-5378 Project: Apache Arrow Issue Type:

Re: [pyarrow] Parquet page header size limit

2019-05-20 Thread Micah Kornfield
Hi Shyam, https://github.com/apache/parquet-testing contains stand alone test files. https://github.com/apache/arrow/blob/master/cpp/src/parquet/bloom_filter-test.cc is an example of how this is used (search for get_data_dir). https://github.com/apache/parquet-cpp/blob/master/README.md#testing

[jira] [Created] (ARROW-5377) [C++] Develop interface for writing a RecordBatch IPC stream into pre-allocated space (e.g. memory map) that avoids unnecessary serialization

2019-05-20 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5377: --- Summary: [C++] Develop interface for writing a RecordBatch IPC stream into pre-allocated space (e.g. memory map) that avoids unnecessary serialization Key: ARROW-5377 URL: https://

[jira] [Created] (ARROW-5376) [C++] Compile failure on gcc 5.4.0

2019-05-20 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5376: - Summary: [C++] Compile failure on gcc 5.4.0 Key: ARROW-5376 URL: https://issues.apache.org/jira/browse/ARROW-5376 Project: Apache Arrow Issue Type: Bug

Re: ARROW-4714: Providing JNI interface to Read ORC file via Arrow C++

2019-05-20 Thread Micah Kornfield
Hi Yurui, This is cool, I will try to leave some comments tonight. Reading the JIRA it references the conversion from on-heap to off heap memory being the performance issue. Now that Arrow Java can point at arbitrary memory do you know the performance delta between java-orc and cpp-orc? (I'm wo

Re: [pyarrow] Parquet page header size limit

2019-05-20 Thread shyam narayan singh
Hi Wes Sorry, this got out of my radar. I went ahead to dig the problem and filed the issue . We can track the error message as part of the different bug? Now, I have a parquet file that can be read by java reader but not pyarrow. I have the fix f

[jira] [Created] (ARROW-5375) [C++] Try to move out of public headers

2019-05-20 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5375: - Summary: [C++] Try to move out of public headers Key: ARROW-5375 URL: https://issues.apache.org/jira/browse/ARROW-5375 Project: Apache Arrow Issue Type: W

[jira] [Created] (ARROW-5374) [Python] pa.read_record_batch() doesn't work

2019-05-20 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5374: - Summary: [Python] pa.read_record_batch() doesn't work Key: ARROW-5374 URL: https://issues.apache.org/jira/browse/ARROW-5374 Project: Apache Arrow Issue Typ

ARROW-4714: Providing JNI interface to Read ORC file via Arrow C++

2019-05-20 Thread Yurui Zhou
Hi Guys: I just created a PR with WIP changes about adding JNI interface for reading orc files. All the major changes has been done and I would like some early feedback from the community. Feel free to take a look and leave your feedback. https://github.com/apache/arrow/pull/4348 Some clean u