Re: [Format] Feature field in Schema

2020-07-17 Thread Wes McKinney
My position is that: * Features only needs to be set with the Schema message, it wouldn't be necessary or useful to set it for other message types * The metadata version may serve a purpose beyond indicating features (and it has in the past already) * Thus, it isn't necessarily inconsistent to hav

Re: [Format] Feature field in Schema

2020-07-17 Thread Micah Kornfield
I think this was overlooked. Schema made more sense to me because I was intending it to be at most once per stream. If we can come to agreement I can open a PR to change it. But we would need a new release candidate (this can't wait until next release) On Friday, July 17, 2020, Antoine Pitrou

Re: [Format] Feature field in Schema

2020-07-17 Thread Wes McKinney
Any of the dependent message types constituting an IPC stream or file are not interpretable without the Schema message, so the Schema basically "governs" the other messages. So it is sufficient to examine the features only once at schema resolution time, and attaching features to a RecordBatch mess

Re: [VOTE] Release Apache Arrow 1.0.0 - RC1

2020-07-17 Thread Neal Richardson
+1 (binding) In addition to the usual verification on https://github.com/apache/arrow/pull/7787, I've successfully staged the R binary artifacts on Windows ( https://github.com/r-windows/rtools-packages/pull/126), macOS ( https://github.com/autobrew/homebrew-core/pull/12), and Linux ( https://gith

[Format] Feature field in Schema

2020-07-17 Thread Antoine Pitrou
Hello, A bit too late, I noticed the new "features" field is defined on the Schema table, while the "version" field is defined on the Message table. Since both fields have closely related purposes (notify the reader of the conventions used in the stream), I'm a bit surprised that they're presen

Re: [jira] [Created] (ARROW-9516) [Rust][DataFusion] Refactor physical expressions to not care about their names nor indexes

2020-07-17 Thread Andy Grove
Thanks for the detailed write up. That all makes good sense to me. I am not sure that I had a good reason for having physical expressions determine their names. On Fri, Jul 17, 2020, 12:50 PM Jorge (Jira) wrote: > Jorge created ARROW-9516: > > > Summary:

Re: [VOTE] Release Apache Arrow 1.0.0 - RC1

2020-07-17 Thread Wes McKinney
I see the JS failures as well. I think it is a failure localized to newer Node versions since our JavaScript CI works fine. I don't think it should block the release given the lack of development activity in JavaScript [1] -- if any JS devs are concerned about publishing an artifact then we can ski

Re: Writing very large rowgroups to Apache Parquet

2020-07-17 Thread Micah Kornfield
I did a quick search in Parquet-MR and found at least one place where different files are explicitly forbidden [1]. I don't know if this blocks all reading or is a specific case (I'm not sure if writing is allowed for multiple columns). Like I said, it makes sense, but is potentially a big change

Re: Writing very large rowgroups to Apache Parquet

2020-07-17 Thread Jacques Nadeau
I believe the formal Parquet standard already allows a file per column. At least I remember it being discussed when the spec was first implemented. If you look at the thrift spec it actually allows for this: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L771

Re: [VOTE] Release Apache Arrow 1.0.0 - RC1

2020-07-17 Thread Ryan Murray
I've tested Java and it looks good. However the verify script keeps on bailing with protobuf related errors: 'cpp/build/orc_ep-prefix/src/orc_ep-build/c++/src/orc_proto.pb.cc' and friends cant find protobuf definitions. A bit odd as cmake can see protobuf headers and builds directly off master work

[NIGHTLY] Arrow Build Report for Job nightly-2020-07-17-0

2020-07-17 Thread Crossbow
Arrow Build Report for Job nightly-2020-07-17-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-17-0 Failed Tasks: - gandiva-jar-osx: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-17-0-travis-gandiva-jar-osx - gandiva-jar-x

Re: [VOTE] Release Apache Arrow 1.0.0 - RC1

2020-07-17 Thread Antoine Pitrou
+1 (binding). I tested on Ubuntu 18.04. * Wheels verification went fine. * Source verification went fine with CUDA enabled and TEST_INTEGRATION_JS=0 TEST_JS=0. I didn't test the binaries. Regards Antoine. Le 17/07/2020 à 03:41, Krisztián Szűcs a écrit : > Hi, > > I would like to propose t

Re: [VOTE] Release Apache Arrow 1.0.0 - RC1

2020-07-17 Thread Krisztián Szűcs
On Fri, Jul 17, 2020 at 10:32 AM Sutou Kouhei wrote: > > +0 (binding) > > I ran the followings on Debian GNU/Linux sid: > > * TEST_JS=0 \ > TEST_INTEGRATION_JS=0 \ > JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ > CUDA_TOOLKIT_ROOT=/usr \ > ARROW_CMAKE_OPTIONS="-DgRPC_SOU

Re: [VOTE] Release Apache Arrow 1.0.0 - RC1

2020-07-17 Thread Krisztián Szűcs
+1 (binding) Locally verified the source release, binaries and wheels on macOS 10.15.5. Everything has passed. Note: I had to use an older version of NodeJS v12.18.2 because the JS tests were failing with NodeJS 14.5.0. Also ran the crossbow verification jobs, and everything seems to work fine. S

Re: [VOTE] Release Apache Arrow 1.0.0 - RC1

2020-07-17 Thread Antoine Pitrou
Le 17/07/2020 à 10:32, Sutou Kouhei a écrit : > > * Python 3.8 wheel's tests are failed. 3.5, 3.6 and 3.7 > are passed. It seems that -larrow and -larrow_python for > Cython are failed. > > > /tmp/arrow-1.0.0.NlcPX/test-miniconda/envs/_verify_wheel-3.8/compiler_compat/ld: > ca

Re: [VOTE] Release Apache Arrow 1.0.0 - RC1

2020-07-17 Thread Sutou Kouhei
+0 (binding) I ran the followings on Debian GNU/Linux sid: * TEST_JS=0 \ TEST_INTEGRATION_JS=0 \ JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ CUDA_TOOLKIT_ROOT=/usr \ ARROW_CMAKE_OPTIONS="-DgRPC_SOURCE=BUNDLED -DBoost_NO_BOOST_CMAKE=ON" \ dev/release/verify-rele