Re: data-source UDFs

2022-06-03 Thread David Li
Thanks for the overview of the different extension points, it's nice to see this laid out. (It would be great to find a place in the docs for this, IMO, or possibly as a blog post?) Just to chime in quickly here: For databases/Flight, my hope is that integrating ADBC into Arrow Datasets will t

Re: data-source UDFs

2022-06-03 Thread Weston Pace
Efficiently reading from a data source is something that has a bit of complexity (parsing files, connecting to remote data sources, managing parallel reads, etc.) Ideally we don't want users to have to reinvent these things as they go. The datasets module in Arrow-C++ has a lot of code here alrea

Re: RecordBatchFileWriter with DictionaryType: Making sure the dictionary stays the same

2022-06-03 Thread Wes McKinney
There's a relevant Jira issue here (maybe some others), if someone wants to pick it up and write a kernel for it https://issues.apache.org/jira/browse/ARROW-4097 I think having an improved experience around this dictionary conformance/normalization problem would be valuable. On Tue, May 31, 2022

Re: [C++] Kernel function registry evolution

2022-06-03 Thread Wes McKinney
Thanks Sasha — this is helpful. I'm going to take a college try at just the scalar kernels and see what I can accomplish over the next few days — will attempt to get a PR up for review with the C++ tests passing. I'm expecting assorted workarounds for the various kernels that do zero-copy optimizat

Re: data-source UDFs

2022-06-03 Thread Li Jin
Actually, "UDF" might be the wrong terminology here - This is more of a "custom Python data source" than "Python user defined functions". (Although under the hood it can probably reuse lots of the UDF logic to execute the custom data source) On Fri, Jun 3, 2022 at 2:49 PM Li Jin wrote: > What Ya

Re: data-source UDFs

2022-06-03 Thread Li Jin
What Yaron is going for is really something similar to custom data source in Spark ( https://levelup.gitconnected.com/easy-guide-to-create-a-custom-read-data-source-in-apache-spark-3-194afdc9627a) that allows utilizing existing Python APIs that knows how to read data source as a stream of record ba

Re: data-source UDFs

2022-06-03 Thread Li Jin
> At the moment as we are not exposing the execution engine primitives to Python user, are you expecting to expose them by this approach. >From our side, these APIs are not directly exposed to the end user, but rather, primitives that allow us to build on top of. The end user would just do sth li

Re: [C++] Kernel function registry evolution

2022-06-03 Thread Sasha Krassovsky
Hi all, I’ve been thinking about some sort of refactoring of this registry for a while now, and I’ve written down some thoughts, please leave your comments. https://docs.google.com/document/d/1LAN9I_Y9cZaG2a84j1wLY8jSlK3gDXYMle-VtyFCAE8/edit?usp=sharing

Re: data-source UDFs

2022-06-03 Thread Vibhatha Abeykoon
First of all, this is a nice discussion, but I have a doubt. I have a question regarding the simplicity of things. At the moment as we are not exposing the execution engine primitives to Python user, are you expecting to expose them by this approach? On Fri, Jun 3, 2022 at 9:02 PM Yaron Gvili wr

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-03 Thread Tobias Zagorni
Am Freitag, dem 03.06.2022 um 09:32 -0700 schrieb Micah Kornfield: > > > > Thinking about compatibility with existing software, RLE could > > possibly > > even made an Extension Type that follows the layout of a struct of > > int32 and the encoded value type. I'm wondering wether this would > > be

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

2022-06-03 Thread Dewey Dunnington
Hi all, Based on the points raised above and a few adventures implementing some of this in related projects, I put together a brief design document proposing a scope and structure to perhaps solidify a few of these discussions: https://docs.google.com/document/d/11n7ICVZO8exZ-z3GRlI26VLzKPXlYlEz5x

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-03 Thread Micah Kornfield
> > Thinking about compatibility with existing software, RLE could possibly > even made an Extension Type that follows the layout of a struct of > int32 and the encoded value type. I'm wondering wether this would be > better for compatibility. I might be misunderstanding this proposal, but I don'

data-source UDFs

2022-06-03 Thread Yaron Gvili
Hi, I'm working on support for data-source UDFs and would like to get feedback about the design I have in mind for it. By support for data-source UDFs, at a basic level, I mean enabling a user to define using PyArrow APIs a record-batch-generating function implemented in Python that would be e

Re: [C++] Adding Run-Length Encoding to Arrow

2022-06-03 Thread Tobias Zagorni
> Well, Arrow C++ does not have a notion of encoding distinct from the > data type. Adding such a notion would risk breaking compatibility for > all existing software that hasn't been upgraded to dispatch based on > encoding. Thinking about compatibility with existing software, RLE could possibl

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

2022-06-03 Thread Hannes Mühleisen
Hello List, we at DuckDB are happy users of the Arrow C Data Interface and use it to feed SQL queries and also use it to provide query results in Arrow format again. It is particularly appealing to us that the interface is merely a (C) header file that we just ship with our source code [1]. Intern

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

2022-06-03 Thread Jonathan Keane
cc Hannes Mühleisen from DuckDB Labs -Jon On Tue, May 31, 2022 at 5:03 PM Wes McKinney wrote: > I'm also supportive of having a small vendorable C/C++ "Arrow > middleware" that provides: > > * Schemas and types > * Columnar data structures and minimal APIs to build them and iterate over > them

Re: [C++] Kernel function registry evolution

2022-06-03 Thread Weston Pace
That approach looks great and very much in line with some of the stuff we have in light_array.h so I think it's very compatible. If you have the time to push this refactoring through then go for it. Don't let anything I'm saying deter any ongoing efforts. I'm just advocating that we be open to a

Re: [DISC] (Python) Dropping support for manylinux2010

2022-06-03 Thread Raul Cumplido Dominguez
Hi, I don't think we followed this up. I've created a JIRA ticket to track it: https://issues.apache.org/jira/browse/ARROW-16747 Thanks, Raúl On Mon, May 9, 2022 at 2:53 PM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > +1 as well > > Joris > > On Thu, 5 May 2022 at 22:29, Sutou