Re: [DISC] Improving Arrow's database support

2022-05-31 Thread Wes McKinney
I think spinning up a new repository while this exploratory work progresses is a fine idea — perhaps apache/arrow-dbc / arrow-adbc or similar (the name can always be changed later). That would bubble up discussions in a way that's easier for people to follow (watching your fork isn't ideal!). If it

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Weston Pace
> I don't think replacing Scalar compute paths with dedicated paths for > RLE-encoded data would ever be a simplification. Also, when a kernel > hasn't been upgraded with a native path for RLE data, former Scalar > Datums would now be expanded to the full RLE-decoded version before > running the ke

Re: RecordBatchFileWriter with DictionaryType: Making sure the dictionary stays the same

2022-05-31 Thread Weston Pace
I don't think you are missing anything. The parquet encoding is baked into the data on the disk so re-encoding at some stage is inevitable. Re-encoding in python like you are doing is going to be inefficient. I think you will want to do the re-encoding in C++. Unfortunately, I don't think we have

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Wes McKinney
I haven't had a chance to look at the branch in detail, but if you can provide a pointer to a specification or other details about the proposed memory format for RLE (basically: what would be added to the columnar documentation as well as the Flatbuffers schema files), it would be helpful so it can

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

2022-05-31 Thread Wes McKinney
I'm also supportive of having a small vendorable C/C++ "Arrow middleware" that provides: * Schemas and types * Columnar data structures and minimal APIs to build them and iterate over them * C data interface * Minimal validation (at the level of Validate but not ValidateFull) I don't think it's g

Re: Merge a pull request with GitHub API

2022-05-31 Thread Sutou Kouhei
Hi, There are no objections. I've merged this: https://github.com/apache/arrow/pull/13184 Thanks, -- kou In <20220525.061541.194737838528371525@clear-code.com> "Re: Merge a pull request with GitHub API" on Wed, 25 May 2022 06:15:41 +0900 (JST), Sutou Kouhei wrote: > Hi, > > Do you

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Tobias Zagorni
Hi, Am Dienstag, dem 31.05.2022 um 21:12 +0200 schrieb Antoine Pitrou: > > Hi, > > Le 31/05/2022 à 20:24, Tobias Zagorni a écrit : > > Hi, I'm currently working on adding Run-Length encoding to arrow. I > > created a function to dictionary-encode arrays here (currently only > > for > > fixed le

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Antoine Pitrou
Le 31/05/2022 à 21:41, Micah Kornfield a écrit : I'm currently working on adding Run-Length encoding to arrow. Nice What are the intended use cases for this: - external engines want to provide run-length encoded data to work on using arrow? It is more than just external engines. Many p

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Micah Kornfield
> > I'm currently working on adding Run-Length encoding to arrow. Nice > What are the intended use cases for this: > - external engines want to provide run-length encoded data to work on > using arrow? > It is more than just external engines. Many popular file formats support RLE encoding. Bei

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Antoine Pitrou
Hi, Le 31/05/2022 à 20:24, Tobias Zagorni a écrit : Hi, I'm currently working on adding Run-Length encoding to arrow. I created a function to dictionary-encode arrays here (currently only for fixed length types): https://github.com/apache/arrow/compare/master...zagto:rle?expand=1 The general

RecordBatchFileWriter with DictionaryType: Making sure the dictionary stays the same

2022-05-31 Thread Niklas Bivald
Hi, Background: I have a need to optimize read speed for few-column lookups in large datasets. Currently I have the data in Plasma to have fast reading of it, but Plasma is cumbersome to manage when the data frequently changes (and “locks” the ram). Instead I’m trying to figure out a fast-enough a

[C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Tobias Zagorni
Hi, I'm currently working on adding Run-Length encoding to arrow. I created a function to dictionary-encode arrays here (currently only for fixed length types): https://github.com/apache/arrow/compare/master...zagto:rle?expand=1 The general idea is that RLE data will be a nested data type, with a

Re: [DISC] Improving Arrow's database support

2022-05-31 Thread David Li
Some updates: The proposal is being updated based on feedback from contributors to DuckDB and DBI. We've been using GitHub issues on the fork to discuss the API design and how to implement data ingestion/bound parameters: https://github.com/lidavidm/arrow/issues If anyone has suggestions/idea

Re: C++ Helpers for Row and Arrow conversions

2022-05-31 Thread Will Jones
For those interested, the PR for this new API is ready for review here: https://github.com/apache/arrow/pull/12775 On Wed, Apr 6, 2022 at 11:17 AM Will Jones wrote: > Hello, > > I've fleshed out the ideas in the doc in this draft PR: > https://github.com/apache/arrow/pull/12775 > > Feedback on t

Re: Arrow C-Data and DuckDB

2022-05-31 Thread Antoine Pitrou
For the record, https://github.com/apache/arrow/pull/13115 was merged with the proposed change. Regards Antoine. On Fri, 13 May 2022 17:48:21 +0200 Antoine Pitrou wrote: > I don't think this needs a vote, there is no functional change in the > spec, it's just an additional technical recomm