Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

2022-05-31 Thread Wes McKinney
I'm also supportive of having a small vendorable C/C++ "Arrow middleware" that provides: * Schemas and types * Columnar data structures and minimal APIs to build them and iterate over them * C data interface * Minimal validation (at the level of Validate but not ValidateFull) I don't think it's g

Re: [C++] Adding Run-Length Encoding to Arrow

2022-05-31 Thread Wes McKinney
I haven't had a chance to look at the branch in detail, but if you can provide a pointer to a specification or other details about the proposed memory format for RLE (basically: what would be added to the columnar documentation as well as the Flatbuffers schema files), it would be helpful so it can

Re: [DISC] Improving Arrow's database support

2022-05-31 Thread Wes McKinney
individually leverage the Arrow libraries). Of course, maintaining a parallel > build system, setting up releases, etc. is also a lot of work. > > -David > > On Tue, Apr 26, 2022, at 15:01, Wes McKinney wrote: > > I don't have major new things to add on this topic except that I

Re: [DISC] Improving Arrow's database support

2022-06-01 Thread Wes McKinney
I went ahead and created https://github.com/apache/arrow-adbc I directed issue comments / PRs to issues@ On Tue, May 31, 2022 at 8:49 PM Wes McKinney wrote: > > I think spinning up a new repository while this exploratory work > progresses is a fine idea — perhaps apache/arrow-dbc / a

Re: [Dev] Switch to token authentication for archery & merge script

2022-06-01 Thread Wes McKinney
hi Jacob — this sounds very reasonable and fixes a rough edge for maintainers running into captcha issues. Thanks Wes On Wed, Jun 1, 2022 at 6:44 AM Jacob Wujciak wrote: > > Hello Everyone, > > I would like to propose that we switch from basic authentication with JIRA > in the merge script and a

Re: [C++] Kernel function registry evolution

2022-06-02 Thread Wes McKinney
On this topic, I actually have started prototyping a new ScalarKernel exec interface that uses a non-owning, shared_ptr-free "ArraySpan" data structure based on some prior conversations https://github.com/wesm/arrow/blob/711fd5e5665c280540bbaf48a48ca1eca1b91bff/cpp/src/arrow/compute/exec.h#L163 ht

Re: [C++] Kernel function registry evolution

2022-06-03 Thread Wes McKinney
o, if we know we are also going to want to tweak the output > > interface (I don't know for sure if we will) then maybe it makes sense > > to pick a small set of kernels and incrementally improve that small > > set until we think we've made all the changes we are going to

Re: RecordBatchFileWriter with DictionaryType: Making sure the dictionary stays the same

2022-06-03 Thread Wes McKinney
There's a relevant Jira issue here (maybe some others), if someone wants to pick it up and write a kernel for it https://issues.apache.org/jira/browse/ARROW-4097 I think having an improved experience around this dictionary conformance/normalization problem would be valuable. On Tue, May 31, 2022

Re: [C++] Kernel function registry evolution

2022-06-05 Thread Wes McKinney
ould say it's a couple days away from being review-ready: https://github.com/apache/arrow/compare/master...wesm:lightweight-exec-batch I'll post a PR when I have something closer to a green build. We probably won't want to let this PR linger since it will cause conflicts with any

Re: [C++] Kernel function registry evolution

2022-06-06 Thread Wes McKinney
This is definitely only the first stage of cleanup and streamlining — I anticipate multiple rounds of refactoring (maybe not as invasive and painful as this one), and this patch I'm not sure will do a lot to alleviate bottom line expression evaluation overhead but it creates the environment (i.e.

Re: [C++] Kernel function registry evolution

2022-06-06 Thread Wes McKinney
ly as I can to have my initial patch ARROW-16756 ready which will unblock the next few projects here On Mon, Jun 6, 2022 at 10:35 AM Wes McKinney wrote: > > This is definitely only the first stage of cleanup and streamlining — > I anticipate multiple rounds of refactoring (maybe not a

Re: [C++] Kernel function registry evolution

2022-06-09 Thread Wes McKinney
I'm making good progress getting my branch PR-ready -- working through the compute-scalar-test suite and fixing the little things I broke. I hope I'll have it done by the end of the week. On Mon, Jun 6, 2022 at 3:21 PM Wes McKinney wrote: > > I created https://issues.apache.org/j

Re: [C++] Kernel function registry evolution

2022-06-10 Thread Wes McKinney
PR is up: https://github.com/apache/arrow/pull/13364 Look forward to getting this in since there's a bunch of follow on work that I'd like to get started on ASAP! On Thu, Jun 9, 2022 at 7:34 AM Wes McKinney wrote: > > I'm making good progress getting my branch PR-ready --

Re: [C++] Kernel function registry evolution

2022-06-13 Thread Wes McKinney
ll help us delete a lot of code I'll attach related Jiras to this umbrella issue: https://issues.apache.org/jira/browse/ARROW-16755 On Fri, Jun 10, 2022 at 12:56 PM Wes McKinney wrote: > > PR is up: https://github.com/apache/arrow/pull/13364 > > Look forward to getting this in since t

Re: [C++] Kernel function registry evolution

2022-06-29 Thread Wes McKinney
can address follow-on improvements like rewriting expression evaluation to utilize the span data structures to yield performance gains. On Mon, Jun 13, 2022 at 12:37 PM Wes McKinney wrote: > > I merged the PR a little while ago — thanks for David, Sasha for > helping review. If you have more com

Re: [C++] Kernel function registry evolution

2022-06-29 Thread Wes McKinney
te: > > > > > > Does boxing a scalar into an array actually build a buffer with the > > repeated value, or is it more efficient than that? > > > > > > Le 29/06/2022 à 17:57, Wes McKinney a écrit : > > > I'm working on my next PR which addresses the

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

2022-07-06 Thread Wes McKinney
). A lightweight, dependency-free library to help > >>>> constructing those would certainly be appreciated. What would also help > >>>> a > >>>> lot is validation code, Arrow structures are very delicate and one wrong > >>>> pointer

Re: Problem reading parquet written with pyarrow=2.0.0 using pyarrow=8.0.0 (when using use_dictionary with ParquetWriter)

2022-07-06 Thread Wes McKinney
hi — did you ever resolve this issue? We should try to identify what is causing this failure and see if it can be fixed for the 9.0.0 release. On Tue, Jun 14, 2022 at 8:18 AM Niklas Bivald wrote: > > Hi, > > I’m experiencing problem reading parquet files written with the > `use_dictionary=[]` op

Re: [C++] Adding Run-Length Encoding to Arrow

2022-07-08 Thread Wes McKinney
pache/arrow/pull/13330 > >> Encode/Decode functions for (currently fixed width types only) > >> > >> - https://github.com/apache/arrow/pull/1 > >> For updating docs > >> > >> Best, > >> Tobias > >> > >> Am Dienstag, d

[C++] Help with Parquet backward compatibility regression between 2.0.0 and 3.0.0

2022-07-17 Thread Wes McKinney
This patch caused Parquet files written with 2.0.0 to be unreadable in 3.0.0 onward https://github.com/apache/arrow/commit/ef0feb2c9c959681d8a105cbadc1ae6580789e69 This was reported on June 14 on dev@ and I git-bisected to the root cause: https://lists.apache.org/thread/wtbqozdhj2hwn6f0sps2j70lr

Re: [C++] Help with Parquet backward compatibility regression between 2.0.0 and 3.0.0

2022-07-17 Thread Wes McKinney
Jira issue for this: https://issues.apache.org/jira/browse/ARROW-17100 On Sun, Jul 17, 2022 at 8:54 PM Wes McKinney wrote: > > This patch caused Parquet files written with 2.0.0 to be unreadable in > 3.0.0 onward > > https://github.com/apache/arrow/commit/ef0feb2c9c959681d8a105cba

Re: Problem reading parquet written with pyarrow=2.0.0 using pyarrow=8.0.0 (when using use_dictionary with ParquetWriter)

2022-07-17 Thread Wes McKinney
hi -- I git-bisected and found the backwards-compat regression, and reported here https://issues.apache.org/jira/browse/ARROW-17100 On Wed, Jul 6, 2022 at 4:16 PM Wes McKinney wrote: > > hi — did you ever resolve this issue? We should try to identify what > is causing this failure and

Re: [C++] Help with Parquet backward compatibility regression between 2.0.0 and 3.0.0

2022-07-18 Thread Wes McKinney
On Mon, Jul 18, 2022 at 2:35 AM Antoine Pitrou wrote: > > > Le 18/07/2022 à 03:54, Wes McKinney a écrit : > > This patch caused Parquet files written with 2.0.0 to be unreadable in > > 3.0.0 onward > > > > https://github.com/apache/arrow/commit/ef0feb2

[C++] Moving from -O3 to -O2 optimization level in release builds

2022-07-20 Thread Wes McKinney
hi all, Antoine and I were digging into a weird issue where gcc in -O3 generated ~40KB of optimized code for a function which was less than 2KB in -O2, and where a "leaner" implementation (in PR 13654) was yet faster and smaller. You can see some of the discussion at https://github.com/apache/arr

Re: [C++] Moving from -O3 to -O2 optimization level in release builds

2022-07-21 Thread Wes McKinney
es selectively that > can be demonstrated to benefit from it (if anyone actually spends the time to > look into it). > > Sasha > > > On Jul 20, 2022, at 2:10 PM, Wes McKinney wrote: > > > > hi all, > > > > Antoine and I were digging into a weird issue w

Re: Proposal: renaming the 'master' branch to 'main'

2022-07-25 Thread Wes McKinney
hi all, Do you think we could make a push to make this happen after the 9.0.0 release goes out? Thanks Wes On Tue, Feb 15, 2022 at 2:32 PM Fiona La wrote: > > Thank you Antoine for bringing up the engineering work that is required to > enable this. And thank you Neal for sharing the link to th

Re: Help needed with PR #13659: Fixing build/unit test issues in msvc/win32

2022-07-25 Thread Wes McKinney
Suppressing the warnings on 32-bit MSVC sounds like a reasonable compromise. Is there an open PR for this (and what is the corresponding Jira issue so we don't lose track of it)? On Fri, Jul 22, 2022 at 1:23 PM Arkadiy Vertleyb (BLOOMBERG/ 120 PARK) wrote: > > Or live with the warnings. Or cast

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Wes McKinney
We had an e-mail thread about this in 2018 https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm I still think having a canonical in-memory row format (and libraries to transform to and from Arrow columnar format) is a good idea — but there is the risk of ending up in the tar pit of re

Re: [ARROW-17255] Logical JSON type in Arrow

2022-07-29 Thread Wes McKinney
This seems like a common-enough data type that having a first-class logical type would be a good idea (perhaps even more so than UUID!). Compute engines would be able to implement kernels that provide manipulations of JSON data similar to what you can do with jq or GraphQL. On Fri, Jul 29, 2022 at

Re: [ARROW-17255] Logical JSON type in Arrow

2022-07-29 Thread Wes McKinney
his (Disclaimer I'm a > > colleague of Padeep's) > > > > [1] https://arrow.apache.org/docs/format/Columnar.html#extension-types > > > > > > On Fri, Jul 29, 2022 at 3:19 PM Wes McKinney wrote: > > > > > This seems like a common-enoug

[DISCUSS][Format] Dynamic data encodings in the IPC format and C ABI

2022-07-29 Thread Wes McKinney
hi all, Since we've been recently discussing adding new data types, memory formats, or data encodings to Arrow, I wanted to bring up a more "big picture" question around how we could support data whose encodings may change throughout the lifetime of a data stream sent via the IPC format (e.g. over

Re: [DISCUSS][Format] Dynamic data encodings in the IPC format and C ABI

2022-07-30 Thread Wes McKinney
is V5, so if we added a new batch type allowing for encodings, sparseness, etc., then we would need to bump the MetadataVersion to V6, but libraries implementing V6 metadata should be able to operate in V5 compatibility mode (sending non-encoded data in the current IPC format). > > [1] &g

[DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

2022-07-30 Thread Wes McKinney
hi folks, I'm interested to start doing some work to implement the "StringView" memory layout that we previously discussed late last year [1] with supporting document [2]. Since there's quite a few details to work out, my objective would be to do the work in a feature branch focused on a few thin

Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

2022-08-01 Thread Wes McKinney
On Sun, Jul 31, 2022 at 8:05 AM Antoine Pitrou wrote: > > > Hi Wes, > > Le 31/07/2022 à 00:02, Wes McKinney a écrit : > > > > I understand there are still some aspects of this project that cause > > some squeamishness (like having arbitrary memory addresses embed

Re: [ARROW-17255] Logical JSON type in Arrow

2022-08-02 Thread Wes McKinney
I should add that since Parquet has JSON, BSON, and UUID types, that while UUID is just a simple fixed sized binary, that having the extension types so that the metadata flows through accurately to Parquet would be net beneficial: https://github.com/apache/parquet-format/blob/master/src/main/thrif

Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

2022-08-02 Thread Wes McKinney
On Tue, Aug 2, 2022 at 1:02 AM Antoine Pitrou wrote: > > > Le 01/08/2022 à 19:13, Wes McKinney a écrit : > > > > If we start placing restrictions on how the out-of-line string buffers > > are managed and externalized, it risks undermining the zero-copy > > int

Re: [FlightSQL][JDBC] Additional changes to the JDBC driver

2022-08-05 Thread Wes McKinney
If you want to merge the cleared IP into a new branch rather than master, that is fine, too. It's not necessary to land it in the main branch On Tue, Aug 2, 2022 at 4:18 PM David Li wrote: > > Would it be OK to get what's there into the main branch first? i.e., open a > PR from the apache/flight

Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

2022-08-05 Thread Wes McKinney
e at the very least some intermediate copies can be > skipped. > > Thanks, > Gosh > > On Tue, Aug 2, 2022, 2:49 PM Wes McKinney wrote: > > > On Tue, Aug 2, 2022 at 1:02 AM Antoine Pitrou wrote: > > > > > > > > > Le 01/08/2022 à 19:13, Wes McKinney a é

Re: [C++] Purpose of C++ bundled dependencies

2022-08-05 Thread Wes McKinney
The current libarrow_bundled_dependencies.a was created to address the problem of libarrow.a being "useless" (unable to be used to link with application code) if any dependencies were built by the Arrow build system (notably: this the case when using the default allocator jemalloc). I'm not sure wh

Re: Apache Software Foundation community survey 2022

2022-09-06 Thread Wes McKinney
hi Antoine — thank you for circulating this survey. Even though it takes a few minutes to complete I encourage community members to take the time to participate since data about community participation helps the ASF do better in the future. Thanks, Wes On Thu, Aug 25, 2022 at 2:10 AM Antoine Pitr

Re: [C++] Read Flight data source into Acero

2022-09-07 Thread Wes McKinney
This seems like something where there should be ready-to-go code in the Arrow codebase to feed any RecordBatchReader into Acero On Thu, Aug 18, 2022 at 12:15 PM Li Jin wrote: > > Thanks all. I will try this out. > > On Thu, Aug 18, 2022 at 9:06 AM Rok Mihevc wrote: > > > +1 for adding this eithe

Re: [ANNOUNCE] New Arrow PMC member: L. C. Hsieh

2022-09-07 Thread Wes McKinney
Congrats! On Mon, Sep 5, 2022 at 2:05 PM Raul Cumplido Dominguez wrote: > > Congratulations! > > El lun, 5 sept 2022, 20:05, Ian Joiner escribió: > > > Congrats L.C.! > > > > On Sat, Sep 3, 2022 at 5:39 PM Sutou Kouhei wrote: > > > > > The Project Management Committee (PMC) for Apache Arrow has

Re: [ANNOUNCE] New Arrow PMC member: Weston Pace

2022-09-08 Thread Wes McKinney
Congrats Weston!! On Tue, Sep 6, 2022 at 8:21 PM Krisztián Szűcs wrote: > > Congrats Weston! > > On Wed, Sep 7, 2022 at 1:41 AM Percy Camilo Triveño Aucahuasi > wrote: > > > > Great news! Congratulations Weston! > > > > On Tue, Sep 6, 2022 at 1:42 PM Andy Grove wrote: > > > > > Congrats Weston!

Re: Arrow Flight usage with graph databases

2022-09-08 Thread Wes McKinney
hi Bill — you can unsubscribe by e-mailing dev-unsubscr...@arrow.apache.org On Tue, Sep 6, 2022 at 2:40 PM Bill Zhao wrote: > > unsubscribe > > Valentyn Kahamlyk 于2022年7月18日周一 16:56写道: > > > > Hi All, > > > > I'm investigating the possibility of using Arrow Flight with graph > > databases, and

Re: DISCUSS: [Format] Rules and procedures for Canonical extension types

2022-09-08 Thread Wes McKinney
+1 to this proposal. It would be great to use the JSON type as a crash dummy to work out the kinks in the process, but I think there are meaningful benefits (Parquet round-tripping) to getting this work under way. On Wed, Aug 24, 2022 at 11:22 AM Antoine Pitrou wrote: > > > Le 17/08/2022 à 18:45,

Re: [VOTE] Substrait for Flight SQL

2022-09-09 Thread Wes McKinney
+1 (binding) On Thu, Sep 8, 2022 at 9:12 PM Jacques Nadeau wrote: > > My vote continues to be +1 > > On Thu, Sep 8, 2022 at 11:44 AM Neal Richardson > wrote: > > > +1 > > > > Neal > > > > On Thu, Sep 8, 2022 at 2:15 PM Ashish wrote: > > > > > +1 (non-binding) > > > > > > On Thu, Sep 8, 2022 at

Re: [ANNOUNCE] New Arrow committer: Remzi Yang

2022-09-10 Thread Wes McKinney
Congratulations! On Sat, Sep 10, 2022 at 7:12 AM Andrew Lamb wrote: > > On behalf of the Arrow PMC, I'm happy to announce that Remzi Yang > has accepted an invitation to become a committer on Apache > Arrow. Welcome, and thank you for your contributions! > > Andrew

Re: [ANNOUNCE] New Arrow PMC member: Raphael Taylor-Davies

2022-09-20 Thread Wes McKinney
Congratulations! On Tue, Sep 20, 2022 at 12:37 PM Ashish wrote: > > Congratulations !! > > On Tue, Sep 20, 2022 at 10:17 AM Ian Joiner wrote: > > > Congrats Raphael! > > > > On Mon, Sep 19, 2022 at 9:56 PM Sutou Kouhei wrote: > > > > > The Project Management Committee (PMC) for Apache Arrow has

Re: [Discuss] Deprecating Plasma

2022-09-26 Thread Wes McKinney
+1 On Thu, Sep 22, 2022 at 11:59 PM Sutou Kouhei wrote: > > +1 > > In > "[Discuss] Deprecating Plasma" on Thu, 22 Sep 2022 17:38:27 +0200, > Antoine Pitrou wrote: > > > > > Hello, > > > > The Plasma object store (*) hasn't received significant maintenance > > since at least 2020. The origin

Re: [DISCUSS] Python Wheel Size

2022-10-10 Thread Wes McKinney
We've discussed this in the past, I think. In addition to having many optional components enabled, the pyarrow wheel also includes the unit tests directory which is of growing size. I think if we made a pyarrow-slim wheel with support only for core Arrow (IPC, etc.) and Parquet file reading, it mig

Re: [ANNOUNCE] New Arrow PMC member: Nicola Crane

2022-10-27 Thread Wes McKinney
Congratulations! On Wed, Oct 26, 2022 at 4:56 PM Ian Joiner wrote: > > Congrats Nic! > > Ian > > On Tuesday, October 25, 2022, Sutou Kouhei wrote: > > > The Project Management Committee (PMC) for Apache Arrow has invited > > Nicola Crane to become a PMC member and we are pleased to announce > >

Re: Apache Arrow filesystem question

2022-10-27 Thread Wes McKinney
I definitely think it would be a good thing to have a C++ ADLS filesystem interface that is on par in quality with our S3 and GCS C++ interfaces — these should also provide material performance benefits to Python users over a pure-Python interface (I'm not sure if pyarrow's S3 interface via C++ has

Re: [DISCUSS][C++] Raw pointer string views

2023-09-28 Thread Wes McKinney
hi all, I'm just catching up on this thread after having taken a look at the format PRs, the C++ implementation PR, and this e-mail thread. So only my $0.02 from having spent a great deal less time on this project than others. The original motivation I had for bringing up the idea of adding the S

Re: CIDR 2024

2023-12-05 Thread Wes McKinney
I will also be there. On Mon, Dec 4, 2023 at 12:58 PM Tony Wang wrote: > I am > > Get Outlook for Android > > From: Curt Hagenlocher > Sent: Monday, December 4, 2023 12:53:00 PM > To: dev@arrow.apache.org > Subject: CIDR 2024 > > Who's g

Re: [DISCUSS] Conventions for transporting Arrow data over HTTP

2024-01-08 Thread Wes McKinney
hi all — I was just catching up on e-mail threads and wanted to give a few historical comments on this. When we were assembling the Arrow PMC and committing to do the project in 2015, standardizing Arrow-over-REST was always something that was on the TODO list — at that time we didn't have the IPC

Re: [VOTE] Accept donation of Comet Spark native engine

2024-01-27 Thread Wes McKinney
+1 (binding) On Sat, Jan 27, 2024 at 12:26 PM Micah Kornfield wrote: > +1 Binding > > On Sat, Jan 27, 2024 at 10:21 AM David Li wrote: > > > +1 (binding) > > > > On Sat, Jan 27, 2024, at 13:03, L. C. Hsieh wrote: > > > +1 (binding) > > > > > > On Sat, Jan 27, 2024 at 8:10 AM Andrew Lamb > > wr

Re: [DISCUSS] Status and future of @ApacheArrow Twitter account

2024-01-29 Thread Wes McKinney
Is there a different tool other than TweetDeck available that can synchronize posts that go out on different social channels (LinkedIn, Twitter, Mastodon, etc.)? I've heard of things like Hootsuite but that's pretty expensive and definitely overkill for an open source project, but perhaps there is

Re: [DISCUSS] Donation of a Spark native engine based on DataFusion & Arrow

2024-02-11 Thread Wes McKinney
Congrats all! It's great to see the Arrow+DataFusion ecosystem expand in this way and to bring the work under the ASF umbrella. On Sun, Feb 11, 2024 at 5:02 AM Andrew Lamb wrote: > As a follow up here the acceptance vote [1] has passed, the IP Clearance > Process is complete [2] and the code PR

Re: [VOTE] Protocol for Dissociated Arrow IPC Transports

2024-02-27 Thread Wes McKinney
Have there been efforts to proactively reach out to other third parties that might have an interest in this or be a potential user at some point? There are a lot of interested parties in Arrow that may not actively follow the mailing list. Seems like folks from the Dask, Ray, RAPIDS (especially fo

Re: [DISCUSS][RFC] Draft Proposal for new Top Level Project for DataFusion

2024-02-28 Thread Wes McKinney
I'd be happy to help. I think we will have to participate in PMC matters infrequently (should there be a difficult issue in the future, we could offer some perspective from cases in the past). On Wed, Feb 28, 2024 at 2:13 PM Andrew Lamb wrote: > Wes brought up a great point on the document[1] th

Re: [VOTE] Move Arrow DataFusion Subproject to new Top Level Apache Project

2024-03-01 Thread Wes McKinney
D, that the office of "Vice President, Apache DataFusion" be > > >> > and hereby is created, the person holding such office to > > >> > serve at the direction of the Board of Directors as the chair > > >> > of the Apache DataFusion Project, and to have primary re

Re: [ANNOUNCE] New Arrow committer: Bryce Mecum

2024-03-18 Thread Wes McKinney
Congrats! On Mon, Mar 18, 2024 at 12:15 PM James Duong wrote: > Congratulations Bryce! > > From: Dane Pitkin > Date: Monday, March 18, 2024 at 7:28 AM > To: dev@arrow.apache.org > Subject: Re: [ANNOUNCE] New Arrow committer: Bryce Mecum > Congratulations, Bryce!! > > On Mon, Mar 18, 2024 at 9:

Re: [ANNOUNCE] New Committer Joel Lubinitsky

2024-04-01 Thread Wes McKinney
Congrats! On Mon, Apr 1, 2024 at 11:01 AM Andrew Lamb wrote: > Congratulations Joel. > > On Mon, Apr 1, 2024 at 11:53 AM Raúl Cumplido > wrote: > > > Congratulations and welcome Joel! > > > > > > El lun, 1 abr 2024, 17:18, Kevin Gurney > > escribió: > > > > > Congratulations, Joel! > > > > > >

Re: Unsupported/Other Type

2024-04-10 Thread Wes McKinney
In the past we have discussed adding a canonical type for UUID and JSON. I still think this is a good idea and could improve ergonomics in downstream language bindings (e.g. by exposing JSON querying function or automatically boxing UUIDs in built-in UUID types, like the Python uuid library). Has a

Re: Fwd: PyArrow Using Parquet V2

2024-04-24 Thread Wes McKinney
I think there is confusion about the Parquet "V2" (including the V2 data pages, and other details) and the 2.x.y releases of the format library artifact. They aren't the same unfortunately. I don't think the V2 metadata structures (the data pages in particular, and new column encoding) is widely ad

Re: [VOTE][Format] JSON canonical extension type

2024-05-06 Thread Wes McKinney
+1 On Tue, Apr 30, 2024 at 4:03 PM Antoine Pitrou wrote: > +1 (binding) for the current proposal, i.e. with the RFC 8289 > requirement and the 3 current String types allowed. > > Regards > > Antoine. > > > Le 30/04/2024 à 19:26, Rok Mihevc a écrit : > > Hi all, thanks for the votes and comments

Re: [VOTE][Format] UUID canonical extension type

2024-05-06 Thread Wes McKinney
+1 On Tue, Apr 30, 2024 at 4:03 PM Antoine Pitrou wrote: > +1 (binding) > > > Le 19/04/2024 à 22:22, Rok Mihevc a écrit : > > Hi all, > > > > Following initial requests [1][2] and recent tangential ML discussion > [3] I > > would like to propose a vote to add language for UUID canonical extensio

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-05-29 Thread Wes McKinney
+1 (binding for Arrow and Parquet) On Wed, May 29, 2024 at 12:13 PM Raúl Cumplido wrote: > +1 (binding for Arrow) > > El mié, 29 may 2024, 18:15, Andy Grove escribió: > > > +1 (binding for Arrow). > > > > Thanks, > > > > Andy. > > > > On Wed, May 29, 2024 at 9:48 AM Alenka Frim > .invalid> > >

Re: Understanding possible synergies between arrow & zarr communities?

2024-07-10 Thread Wes McKinney
hi Carl, I agree that cross-collaboration and knowledge/tools sharing could be very helpful. Even though we've done a lot of engineering on low-level IO and memory management, there are probably still many aspects of the Parquet C++ reader (what powers pyarrow.parquet) that could be improved to do

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-22 Thread Wes McKinney
>From a historical perspective, if we had had extension types / canonical extension types, it would have made more sense to have the millisecond dates as an extension type. The goal of having the extra type was to avoid an unnecessary serialization in systems where there is a benefit to moving dat

Re: No replacement dictionaries supported in pyarrow?

2021-03-19 Thread Wes McKinney
I am also under the impression that the file format is supposed to support deltas, but not replacements. Is this not implemented in C++? On Thu, Mar 18, 2021 at 9:57 PM Nate Bauernfeind wrote: > If dictionary replacements were supported, then the IPC file format > couldn't guarantee random acces

Re: [DISCUSS] How to encode table_pivot information state in Arrow

2021-03-19 Thread Wes McKinney
> It seems that the schema changes to arrow is a custom solution for just > Perspective and it might be prudent to wait for Arrow 4 that will have a > standard way of representing this information. Arrow 4.0.0 is not going to have the pivot table structures you are looking for (speaking as the o

Re: No replacement dictionaries supported in pyarrow?

2021-03-19 Thread Wes McKinney
dictionaries” that existed at the time that the file was written are not recoverable, but that seems like an acceptable compromise. On Fri, Mar 19, 2021 at 10:34 AM Antoine Pitrou wrote: > > Le 19/03/2021 à 13:37, Wes McKinney a écrit : > > I am also under the impression that the file format

Re: No replacement dictionaries supported in pyarrow?

2021-03-19 Thread Wes McKinney
really matters? Intuitively, it seems to me > that if your data is really large, you may be better off with a more > space-optimized format such as Parquet. > > > Le 19/03/2021 à 19:49, Wes McKinney a écrit : > > Okay, let’s open an issue then to address that at some poi

Re: [DISCUSS] How to describe computation on Arrow data?

2021-03-19 Thread Wes McKinney
> I might be misunderstanding, but I think Weld [1] is another project > targeting the lower level components? Weld IR is _really_ low level (not an expert, but have read the papers), see [1] for more > Also, I think there was a little bit of effort to come up with a common > expression represe

Re: [DISCUSS] Improving Contributor Guidelines

2021-03-20 Thread Wes McKinney
"MINOR:" might be a better marker since "trivial" carries the connotation of "unimportant" to me (the dictionary says "of little value or importance"). On Sat, Mar 20, 2021 at 2:21 PM Micah Kornfield wrote: > > Opened https://issues.apache.org/jira/browse/ARROW-12034 for trivial PRs. > > I think

Re: [DISCUSS] Improving Contributor Guidelines

2021-03-20 Thread Wes McKinney
like the confluence site[1] has > had most of its content migrated into the project docs [2] already. > > [1] https://cwiki.apache.org/confluence/display/ARROW > [2] https://arrow.apache.org/docs/developers/contributing.html > > On Sat, Mar 20, 2021 at 2:31 PM Wes McKinne

Re: [VOTE] Accept donation of Rust Ballista project

2021-03-21 Thread Wes McKinney
+1 (binding) Since Ballista has ~40 contributors but AFAIK no corporation that needs to make a Software Grant, it might be worth consulting with the Incubator folks to see what kind of due diligence needs to be done so we are covering our bases. I doubt it will be practical (or even possible) to c

Re: sparse data array

2021-03-24 Thread Wes McKinney
The SparseTensor stuff is something else entirely (that's matrices where the entries are mostly 0) There isn't anything to help you right now aside from dictionary encoding — if your dictionary has 256 elements or less, you can use uint8 index type and thus have 1 byte per value. We've discussed i

Re: [C++] Dataset API simplification

2021-03-26 Thread Wes McKinney
I agree with making the decomposition of a fragment into tasks an internal detail of the scan implementation. It seems that we want to be moving toward a world of consuming a stream of Future> and not pushing the complexity of concurrency management (necessarily) onto the consumer. The nature of mu

Re: sparse data array

2021-03-27 Thread Wes McKinney
such encoding would address DataFusion's issue of > > > representing scalars / constant arrays: a constant array would be > > > represented as a repetition. Currently we just unpack (i.e. allocate) a > > > constant array when we want to transfer through a RecordBatch.

Re: [R][Rust][IPC] Attempting to pass RecordBatch from R to Rust via C ABI

2021-03-29 Thread Wes McKinney
If you are looking for true zero-copy R/Rust interop, then using the C interface is the way to go. You shouldn't need to depend on Python to have this, so we could need to refactor some things on the R side to compartmentalize anything relating to Python specifically. On Sun, Mar 28, 2021 at 10:04

Status of Arrow Julia implementation?

2021-03-30 Thread Wes McKinney
hi folks, I was very surprised today to learn that the Julia Arrow implementation has continued operating more or less like an independent open source project since the code donation last November: https://github.com/JuliaData/Arrow.jl/commits/main There may have been a misunderstanding about wh

Re: Status of Arrow Julia implementation?

2021-03-30 Thread Wes McKinney
t channels is not compatible. Building healthy open source communities is hard, but this way has been shown to work well, which is why I've spent the last 6 years working hard to bring people together to build this project and ecosystem! If you want to maintain a test harness here to verify a

Re: Status of Arrow Julia implementation?

2021-03-30 Thread Wes McKinney
independently because there wasn't enough development activity to justify it. [1]: https://www.mail-archive.com/dev@arrow.apache.org/msg05971.html On Tue, Mar 30, 2021 at 4:54 PM Wes McKinney wrote: > > hi Jacob, > > On Tue, Mar 30, 2021 at 4:18 PM Jacob Quinn wrote: > > > >

Re: [RESULT] [VOTE] Accept donation of Rust Ballista project

2021-03-31 Thread Wes McKinney
hi Andy — before you start an IP clearance vote, you need to add an entry on https://incubator.apache.org/ip-clearance/ and run through the clearance checklist, let me know if you have trouble and I can help you. Thanks On Wed, Mar 31, 2021 at 8:47 AM Andy Grove wrote: > > CLAs have been submitt

Re: Arrow sync call March 31 at 12:00 US/Eastern, 16:00 UTC

2021-03-31 Thread Wes McKinney
The Google Meet link is on dremio.com, so there must not be someone from the org to let people in. What do folks think about moving to Zoom for future meetings (which shouldn't have this problem)? On Wed, Mar 31, 2021 at 11:07 AM Jonathan Keane wrote: > > I'm experiencing the same here. > > On We

Re: Arrow sync call March 31 at 12:00 US/Eastern, 16:00 UTC

2021-03-31 Thread Wes McKinney
It does, but I would suggest that someone volunteer to host the call each week and send out a Zoom link for that week's call On Wed, Mar 31, 2021 at 11:11 AM Antoine Pitrou wrote: > > > I'm fine with Zoom. But doesn't need it a host as well? > > > Le 31/03/2

Re: [RESULT] [VOTE] Accept donation of Rust Ballista project

2021-04-01 Thread Wes McKinney
n we could just add this functionality under the > DataFusion branding perhaps. > > Thanks, > > Andy. > > [1] https://tmsearch.uspto.gov/bin/showfield?f=doc&state=4802:oty9vz.2.11 > [2] https://tmsearch.uspto.gov/bin/showfield?f=doc&state=4802:oty9vz.2.3 > [3] htt

Re: Status of Arrow Julia implementation?

2021-04-02 Thread Wes McKinney
ike, for example, showing how Julia integrates with the > archery test suite, once the work there is done. > > Best, > > -Jacob > > > > On Tue, Mar 30, 2021 at 4:10 PM Wes McKinney wrote: > > > Also, on the issue that there are no Julia-focused PMC members — note

Re: [Format][RFC] Introduce COMPLEX type for IntervalUnit

2021-04-04 Thread Wes McKinney
>> > >>>> > >>> Micah > >>>> > >>> > >>>> > >>> On 2021/02/18 04:30:55 Micah Kornfield wrote: > >>>> > >>>>> > >>>> > >>>>> I didn’t find any page/docu

Re: Status of Arrow Julia implementation?

2021-04-07 Thread Wes McKinney
t; work well going forward, and has, until this thread started and it was > pointed out that this process isn't viable. The pain points were discussed > with the initial code donation, but in my mind were resolved with the > development process that was decided upon. > >

Re: Rust sync meeting

2021-04-07 Thread Wes McKinney
I'm sorry to be the PMC worry wart around here, but I'm curious what is the plan (if any) with these repositories https://github.com/jorgecarleitao/arrow2 https://github.com/jorgecarleitao/parquet2 I understand that large new projects like this are sometimes necessary, but what some Apache projec

Re: Rust sync meeting

2021-04-07 Thread Wes McKinney
ble to participate in the call (because of time zones or English fluency) and so it's important to have written discussions to explain. On Wed, Apr 7, 2021 at 6:27 AM Wes McKinney wrote: > > I'm sorry to be the PMC worry wart around here, but I'm curious what > is th

Re: [C++] (Eventually) merging asynchronous datasets feature

2021-04-07 Thread Wes McKinney
I would also lean in the direction of progress to get user feedback sooner — if our test suite passes stably then it is probably okay to merge, and if it's possible (without great hardship) to have a fallback to the non-async version (so there's a workaround if there end up being show-stopping bugs

Re: Rust sync meeting

2021-04-08 Thread Wes McKinney
With both what has occurred with the Julia project and what may possibly be occurring (what appears to me to be occurring) with these Rust overhaul projects, is that the communities expectations with regards to Openness aren't being followed. If a change is significant and will affect other develo

Re: Rust sync meeting

2021-04-08 Thread Wes McKinney
On Thu, Apr 8, 2021 at 7:49 AM Wes McKinney wrote: > > With both what has occurred with the Julia project and what may > possibly be occurring (what appears to me to be occurring) with these > Rust overhaul projects, is that the communities expectations with > regards to Openne

Re: Rust sync meeting

2021-04-08 Thread Wes McKinney
ts with memory. > > " I do think that I was vocal enough. At some point > the interactions here started to affect my wellbeing and I thus decided to > scale down by efforts." > > On Thu, Apr 8, 2021 at 6:03 AM Wes McKinney wrote: > > > On Thu, Apr 8, 2021 at 7:49

Re: Rust sync meeting

2021-04-08 Thread Wes McKinney
] proposal to redesign > > Arrow crate to resolve safety violations" on 7th February before any commit > > in arrow2 (resulting in zero discussion or any objection)? > > > > Best regards, > > Adam Lippai > > > > On Thu, Apr 8, 2021 at 4:41 PM Wes McK

Re: [ANNOUNCE] [Rust] Ballista donation has been merged

2021-04-08 Thread Wes McKinney
Congrats Andy! I know this was a lot of work, but I think it speaks to a bright future for the Arrow ecosystem. Once we can sort out the release and code management concerns to suit the needs of the Rust ecosystem, I trust you will all be on a good path. On Thu, Apr 8, 2021 at 6:05 PM Andy Grove

  1   2   3   4   5   6   7   8   9   10   >