[jira] [Created] (ARROW-2628) [Python] parquet.write_to_dataset is memory-hungry on large DataFrames

2018-05-21 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2628: --- Summary: [Python] parquet.write_to_dataset is memory-hungry on large DataFrames Key: ARROW-2628 URL: https://issues.apache.org/jira/browse/ARROW-2628 Project: Apache Ar

[jira] [Created] (ARROW-2627) [Python] Add option (or some equivalent) to toggle memory mapping functionality when using parquet.ParquetFile or other read entry points

2018-05-21 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2627: --- Summary: [Python] Add option (or some equivalent) to toggle memory mapping functionality when using parquet.ParquetFile or other read entry points Key: ARROW-2627 URL: https://issue

[jira] [Created] (ARROW-2626) pandas ArrowInvalid message should include failing column name

2018-05-21 Thread Louis Potok (JIRA)
Louis Potok created ARROW-2626: -- Summary: pandas ArrowInvalid message should include failing column name Key: ARROW-2626 URL: https://issues.apache.org/jira/browse/ARROW-2626 Project: Apache Arrow

[jira] [Created] (ARROW-2625) [Python] Serialize timedelta64 values from pandas to Arrow interval types

2018-05-21 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2625: --- Summary: [Python] Serialize timedelta64 values from pandas to Arrow interval types Key: ARROW-2625 URL: https://issues.apache.org/jira/browse/ARROW-2625 Project: Apache

[jira] [Created] (ARROW-2624) [Python] Random schema and data generator for Arrow conversion and Parquet testing

2018-05-21 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2624: --- Summary: [Python] Random schema and data generator for Arrow conversion and Parquet testing Key: ARROW-2624 URL: https://issues.apache.org/jira/browse/ARROW-2624 Projec

[jira] [Created] (ARROW-2623) [Doc] Add example of List with nested child type in format specification documents

2018-05-21 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2623: --- Summary: [Doc] Add example of List with nested child type in format specification documents Key: ARROW-2623 URL: https://issues.apache.org/jira/browse/ARROW-2623 Projec

Re: PyArrow and Parquet DELTA_BINARY_PACKED

2018-05-21 Thread Wes McKinney
Sorry, I realized I was a bit inarticulate in my reply. I meant the data page HEADERS (the metadata). The actual encoded structure of the data pages should be the same in V2 files. But if the Thrift header is say 16 bytes in V1, it's at least 32 bytes in V2 On Mon, May 21, 2018 at 7:10 PM, Wes McK

Re: PyArrow and Parquet DELTA_BINARY_PACKED

2018-05-21 Thread Wes McKinney
hi Feras, Given the very high compression ratio with your data, it's completely possible that the difference in size is coming from the larger V2 data pages. Compare DataPageHeader with DataPageHeaderV2 in parquet.thrift https://github.com/apache/parquet-cpp/blob/master/src/parquet/parquet.thrift#

Re: Language-independent and cross-language docs

2018-05-21 Thread Wes McKinney
Among other things, the columnar format specification files should probably make their way into this new documentation project. On Mon, May 21, 2018 at 5:19 PM, Wes McKinney wrote: > I don't think we should attempt to create a documentation "super > project" that includes the generated API refere

Re: Language-independent and cross-language docs

2018-05-21 Thread Wes McKinney
I don't think we should attempt to create a documentation "super project" that includes the generated API reference for all the libraries in Apache Arrow. I do think that creating a documentation "hub" project (with the low-level API docs being the "spokes") is a good idea. Currently, the Jekyll pr

Re: Proposed Arrow Graph representations

2018-05-21 Thread Wes McKinney
hi Josh, Yes, the standard process for importing externally-developed code is the Incubator IP clearance: http://incubator.apache.org/ip-clearance/. As an example, we recently received a Go codebase donation from InfluxData where there was a combination of ICLAs from the contributors and a softwar

Re: Proposed Arrow Graph representations

2018-05-21 Thread Joshua Patterson
Hi Wes, I'm sure we're going to run into this with libgdf/pygdf as well. Is there a systematic way we could do a transfer of IP? On 5/20/18, 7:05 PM, "Wes McKinney" wrote: hi Paul, This is a great discussion to get started. I will review the patch in some more detail and send

[jira] [Created] (ARROW-2622) [C++] Array methods IsNull and IsValid are not complementary

2018-05-21 Thread Thomas Buhrmann (JIRA)
Thomas Buhrmann created ARROW-2622: -- Summary: [C++] Array methods IsNull and IsValid are not complementary Key: ARROW-2622 URL: https://issues.apache.org/jira/browse/ARROW-2622 Project: Apache Arrow

[jira] [Created] (ARROW-2621) [Python/CI] Use pep8speaks for

2018-05-21 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-2621: -- Summary: [Python/CI] Use pep8speaks for Key: ARROW-2621 URL: https://issues.apache.org/jira/browse/ARROW-2621 Project: Apache Arrow Issue Type: Task