Re: pyarrow kafka support

2020-07-21 Thread Micah Kornfield
Nothing exists in Arrow core to do this. You will need to manually decide on how to batch and serialize data into and out of Kafka. The recent discussion [1] on user@ on transferring data into and out of Redis provides some pointers on how to do this. Note that the "*fetch_pandas_all" *is somethi

Re: Java Arrow to C++ Arrow and vice versa

2020-07-21 Thread Micah Kornfield
> > Was there any particular reason for not writing Java Arrow as a JNI binding > for CPP Arrow? The Java code base originated from Apache Drill and was the first implementation of Arrow. There is value in having a pure java implementation separate from any C++ code base (JNI cannot be used in a

Re: Java Arrow to C++ Arrow and vice versa

2020-07-21 Thread Ji Liu
Hi Chathura, https://lists.apache.org/thread.html/5bf70a6f1a3fa3e543a92b3217e64465a3b761ca307e8114550f9d8b@%3Cdev.arrow.apache.org%3E has the relevant pointers. Thanks, Ji Liu Chathura Widanage 于2020年7月22日周三 上午3:03写道: > Hi all, > > Was there any particular reason for not writing Java Arrow

Re: [VOTE] Release Apache Arrow 1.0.0 - RC2

2020-07-21 Thread Sutou Kouhei
Hi, +1 (binding) I ran the followings on Debian GNU/Linux sid: * INSTALL_NODE=0 \ JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ CUDA_TOOLKIT_ROOT=/usr \ ARROW_CMAKE_OPTIONS="-DgRPC_SOURCE=BUNDLED -DBoost_NO_BOOST_CMAKE=ON" \ dev/release/verify-release-candidate.sh sou

Dummy Scalar in Filter and pass the scalar through Evaluate.

2020-07-21 Thread Gopinath Jaganmohan
Hi, I would like to do compile once and run many using Gandiva Filter, here only the scalar change for each run. Currently I had to recreate the entire filter before evaluate. If there is way to pass scalar to evaluate then it would drastically reduce the compile time and allow reuse the same c

Java Arrow to C++ Arrow and vice versa

2020-07-21 Thread Chathura Widanage
Hi all, Was there any particular reason for not writing Java Arrow as a JNI binding for CPP Arrow? What is the most straightforward and efficient way to convert a java arrow schema/table to a JNI backed C++ arrow schema/table? Regards, Chathura

pyarrow kafka support

2020-07-21 Thread Mehul Batra
Hi Arrow Community, Do we guys have any Api to ingest and process apache Kafka data fast using pyarrow/python just like we have to fetch_pandas_all to ingest and process snowflake data fast. Thanks, Mehul Batra [Pitney Bowes]

Introducing Cylon

2020-07-21 Thread Niranda Perera
Hi all, We would like to introduce Cylon to the Arrow community. It is an open-source, lean distributed data processing library using the Arrow data format underneath. It is developed in C++ with bindings to Java, and Python. It has an in-memory Table API that integrates with PyArrow Table API. Cy

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

2020-07-21 Thread Robert Nishihara
Hi all, Regarding Plasma, you're right we should have started this conversation earlier! The way it's being developed in Ray currently isn't useful as a standalone project. We realized that tighter integration with Ray's object lifetime tracking could be important, and removing IPCs and making it

Re: [VOTE] Release Apache Arrow 1.0.0 - RC2

2020-07-21 Thread Antoine Pitrou
+1 (binding) I tested the sources on Ubuntu 18.04, with CUDA enabled and TEST_INTEGRATION_JS=0 TEST_JS=0. Regards Antoine. Le 21/07/2020 à 04:07, Krisztián Szűcs a écrit : > Hi, > > I would like to propose the following release candidate (RC2) of Apache > Arrow version 1.0.0. This is a rele

Re: [VOTE] Release Apache Arrow 1.0.0 - RC2

2020-07-21 Thread Ryan Murray
+0 (non-binding) I verified source, release, binaries, integration tests for Python, C++, Java. All went fine except for a failed test in c++ Gandiva: [ FAILED ] TestProjector.TestDateTime Not sure if this is known or expected? On Tue, Jul 21, 2020 at 1:32 PM Andy Grove wrote: > +1 (bindi

Re: 1.0 release announcement blog post: help needed

2020-07-21 Thread Andy Grove
I created a PR to add Rust notes. https://github.com/nealrichardson/arrow-site/pull/5 On Mon, Jul 20, 2020 at 3:35 PM Andy Grove wrote: > I'll put something together for Rust today. > > On Mon, Jul 20, 2020 at 3:27 PM Sutou Kouhei wrote: > >> Hi, >> >> Sorry. I've filled the Ruby part. >> Than

Re: [VOTE] Release Apache Arrow 1.0.0 - RC2

2020-07-21 Thread Andy Grove
+1 (binding) on testing the Rust implementation only. I did notice that the release script is not updating all the versions correctly and I filed a JIRA [1]. This shouldn't prevent the release though since this one version number can be updated manually when we publish the crates. [1] https://is

[NIGHTLY] Arrow Build Report for Job nightly-2020-07-21-0

2020-07-21 Thread Crossbow
Arrow Build Report for Job nightly-2020-07-21-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0 Failed Tasks: - conda-win-vs2017-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-07-21-0-azure-conda-win-vs2017-py36 - co

Re [DISCUSS] Using direct memory size as a limit of populated off-heap buffers in Java

2020-07-21 Thread Hongze Zhang
Thanks for the inputs Micah. So it's clearer that we may need to use Bits.Java or not. If Netty is considered to be something optional so maybe it's more acceptable to just use Bits.java since Dataset module is built-in? This way we can treat all built-in off-heap memory allocation as direct

[DISCUSS] Execute dataset scan tasks in distributed system

2020-07-21 Thread Hongze Zhang
Hi all, Does anyone ever try using Arrow Dataset API in a distributed system? E.g. create scan tasks in machine 1, then send and execute these tasks from machine 2, 3, 4. So far I think a possible workaround is to: 1. Create Dataset on machine 1; 2. Call Scan(), collect all scan tasks from sca