Re: [DICSUSS] Split Swift to separated repository

2025-05-19 Thread Antoine Pitrou
+1, I am supportive of this change as well. Regards Antoine. Le 16/05/2025 à 10:48, Sutou Kouhei a écrit : Hi, This is a similar discussion to the "[DISCUSS] Split Go release process" thread[1], the "[DISCUSS] Split Java release process" thread[2], the "[DISCUSS] Split R release process" t

[DISCUSS][C++] Switch to C++20

2025-05-19 Thread Antoine Pitrou
Hello, I am proposing that we switch Arrow C++ to require C++20. C++20 will offer support for more C++ language and standard library features, such as: - concepts - generic lambdas with explicit type parameters - designated initializers - calendar and timezone functions (currently, our Wind

Re: [DISCUSS] Arrow Variant Extension Type

2025-05-12 Thread Antoine Pitrou
Le 12/05/2025 à 18:20, Matt Topol a écrit : > It's not just Parquet Variant, it's also Iceberg (which has > standardized on this) and Spark in-memory (where this encoding scheme > originated). Ok, but it's called Parquet Variant now, since that's where the binary spec lives: https://github.co

Re: [DISCUSS] Re-enabling s390x CI

2025-05-12 Thread Antoine Pitrou
Hello, I'm sure the technical details can be ironed out, but the question is more whether someone is willing to do the maintenance work required to keep Arrow working on big-endian platforms, and if possible enable it for more components (most of us don't have access to such a platform). If

Re: [DISCUSS] Arrow Variant Extension Type

2025-05-11 Thread Antoine Pitrou
Hi Matt, Thanks for putting this together. I think we should make clear that this extension type is for transporting Parquet Variants. If we were to design a Variant type specifically for Arrow, it would probably look a bit different (in particular, we would make a better use of validity bi

Re: [Call For Volunteer] Apache Arrow Summit and Selection Committee

2025-05-10 Thread Antoine Pitrou
ith selection! --Matt On Fri, May 9, 2025, 8:26 AM Raúl Cumplido wrote: Hi, I already plan to attend PyData Paris so I would like to volunteer too. Thanks, Raúl El vie, 9 may 2025 a las 13:33, Antoine Pitrou () escribió: Hi JB, I'm volunteer too. Regards Antoine. Le 09/05/2025

Re: [Call For Volunteer] Apache Arrow Summit and Selection Committee

2025-05-09 Thread Antoine Pitrou
Hi JB, I'm volunteer too. Regards Antoine. Le 09/05/2025 à 13:00, Jean-Baptiste Onofré a écrit : Hi everyone, The Arrow PMC is pleased to announce Arrow Summit 25. The Arrow Summit 2025 is a community event, organised by a Selection Committee. The event’s focus is to build community arou

Re: [VOTE] Split JS implementation and Release Process

2025-05-07 Thread Antoine Pitrou
+1 (binding) Le 07/05/2025 à 10:48, Raúl Cumplido a écrit : Hi, I would like to propose splitting the JS implementation and the corresponding release process to its own repository. Motivation: * We want to reduce needless major releases to avoid unnecessary user burden. * We want to avoid

[C++] Deprecate Skyhook?

2025-05-05 Thread Antoine Pitrou
Hello, "Skyhook" is a little-known C++ component that interfaces Arrow with the Ceph distributed filesystem. It received it last non-trivial change in 2022: https://github.com/apache/arrow/commit/546c3771a209cbcac5e03cf26e07bcd8c9601d5a You won't find much documentation for it except for an

Re: [VOTE] Flight SQL: support remarks field in xDBC column metadata

2025-04-29 Thread Antoine Pitrou
Hello, +1 from me (binding). Side question: is the FlightSQL spec versioned? Regards Antoine. Le 27/04/2025 à 11:45, David Li a écrit : Hello, Mateusz Rzeszutek has proposed adding a "remarks" field in xDBC column metadata in Flight SQL [1]. This better aligns Flight SQL with existing A

Re: [C++][DISCUSS] FileSystem construction from URIs and secrets

2025-04-23 Thread Antoine Pitrou
Hi Ben and all, Sorry for chiming in lately. I do find the URI-and-kv-pairs interface attractive. That said, some filesystem options can't reasonably be expressed as strings. For example, `S3Options` has a `std::shared_ptrKeyValueMetadata> default_metadata` and a `std::shared_ptr`. So, p

Re: [Discuss][C++] Deprecate precompiled headers option?

2025-04-22 Thread Antoine Pitrou
er hygiene (compared to other codebases I've worked on). With a little bit more effort we can probably eliminate long header include chains. -- Felipe On Wed, Oct 2, 2024 at 6:53 AM Antoine Pitrou wrote: Hello, Long ago, we added a ARROW_USE_PRECOMPILED_HEADERS to the Arrow C++ CMake

[DISCUSS][C++] Adding new IPC option 'ensure_memory_alignment'

2025-04-05 Thread Antoine Pitrou
As I suggested in https://github.com/apache/arrow/pull/44279#issuecomment-2757128297 , do we want to make this a `enum` for a more future-proof API? i.e., instead of: ``` bool ensure_alignment = false; ``` have: ``` enum Alignment { kNoAlignment, kNaturalAlignment }; Alignment ensure_ali

[ANNOUNCE] New Arrow PMC member: Rok Mihevc

2025-03-28 Thread Antoine Pitrou
Hello all, The Project Management Committee (PMC) for Apache Arrow has invited Rok Mihevc to become a PMC member and we are pleased to announce that Rok has accepted. Regards Antoine.

Re: Request for comments on adding new IPC option 'ensure_memory_alignment'

2025-03-27 Thread Antoine Pitrou
Le 27/03/2025 à 18:14, Raphael Taylor-Davies a écrit : It's obviously preferrable to be zero-copy but it's certainly not mandatory, especially as the data being shared is assumed to be read-only in most use cases. In which case we should probably remove the comment about alignment from the C i

Re: Request for comments on adding new IPC option 'ensure_memory_alignment'

2025-03-27 Thread Antoine Pitrou
Hello, Le 27/03/2025 à 17:53, Raphael Taylor-Davies a écrit : The current ambiguity, however, makes it hard to set reasonable defaults, as it isn't clear if FFI should be zero-copy and therefore have alignment restrictions or not. It's obviously preferrable to be zero-copy but it's certainl

Re: Arrow Flight Endpoint Location URLs

2025-03-27 Thread Antoine Pitrou
Indeed, it doesn't sound like a terrific use of Arrow maintainer time... Especially as there's a growing feeling that Flight was not very well designed, and should perhaps be slowly obsoleted in favor of more focussed initiative (such as the Arrow-over-HTTP effort that's still not finished :-

Re: [ANNOUNCE] New Arrow PMC member: Jacob Wujciak

2025-03-17 Thread Antoine Pitrou
Congratulations Jacob :) Le 17/03/2025 à 18:28, Jacob Wujciak a écrit : Thank you everyone! Bryce Mecum schrieb am Mo., 17. März 2025, 17:25: Congrats! On Sun, Mar 16, 2025 at 10:23 PM Sutou Kouhei wrote: The Project Management Committee (PMC) for Apache Arrow has invited Jacob Wujciak

Re: [DISCUSS] Do we want to enable GitHub Discussions for apache/arrow?

2025-03-16 Thread Antoine Pitrou
We can start with users@ and, if the experience is subpar, switch to something else. Regards Antoine. Le 16/03/2025 à 15:04, Weston Pace a écrit : +1 A possible reason for hesitation is that it provides us yet another stream that requires maintainer attention I had been lukewarm on di

Re: [DISCUSS] Apache Arrow Meetup in Europe

2025-03-06 Thread Antoine Pitrou
Hi JB, This is a great idea, I like it. +1 for doing this in Europe, also less risky these days given the geopolitical context. Half a day is probably too short given the breadth of topics. Though, of course, the longer the more difficult to organize (and the more expensive). Regards An

Re: [DISCUSS] Split R release process

2025-03-03 Thread Antoine Pitrou
I agree with Neal that the decoupling is less obviously desirable on the R side. About the number of R-related CI jobs, is there still a need for testing so many different configurations? Le 03/03/2025 à 15:32, Neal Richardson a écrit : Thanks for raising this, Kou. I'm personally torn on

Re: Docs AI bot

2025-02-17 Thread Antoine Pitrou
Hi Nic, Le 17/02/2025 à 12:18, Nic Crane a écrit :> It'd give us useful insights into where we have gaps in our docs where we can improve things, or what are common things that users struggle with. Would Kapa provide us with stats about usage of their service? What do folks think of the i

Re: [MONOREPO] Can we improve our PR template?

2025-02-17 Thread Antoine Pitrou
Hi Kou, Le 17/02/2025 à 07:43, Sutou Kouhei a écrit : Here are some ideas to improve our PR template: 1. Remove them entirely: [...] 2. Keep minimal notes as a normal text not a comment something like: I think our template is useful (it forces us to better describe PRs), so I'm in

Re: [DISCUSS] Proposal to remove unmaintaned conda-recipes from Arrow repository

2025-02-10 Thread Antoine Pitrou
+1. I don't think it makes sense to keep them since we are not able to produce nightly builds anymore. Regards Antoine. Le 07/02/2025 à 12:06, Raúl Cumplido a écrit : Hi, In the past we used to run nightly jobs for our conda recipes. The CI jobs were turned off approximately 6 months ago

[ANNOUNCE] New Arrow PMC member: Bryce Mecum

2025-02-05 Thread Antoine Pitrou
Hello, The Project Management Committee (PMC) for Apache Arrow has invited Bryce Mecum to become a PMC member and we are pleased to announce that Bryce has accepted. Congratulations and welcome! Regards Antoine.

Re: [ANNOUNCE] New Arrow committer: Ed Seidl (etseidl)

2025-01-31 Thread Antoine Pitrou
Congratulations and welcome, Ed! Le 29/01/2025 à 11:18, Andrew Lamb a écrit : On behalf of the Arrow PMC, I'm happy to announce that Ed Seidl has accepted an invitation to become a committer on Apache Arrow. Welcome, and thank you for your contributions! Andrew

Re: Arrow board report due January 8

2025-01-07 Thread Antoine Pitrou
Hi Neal, I've tweaked the wording in the community health section a bit, please review! Regards Antoine. Le 06/01/2025 à 22:45, Neal Richardson a écrit : Thanks Andrew, and thanks to everyone else who has added stuff. I went through the dev mailing list to look for notable discussions/vo

Re: [INFO] Arrow 19.0.0 feature freeze - January 6th, 2025

2024-12-19 Thread Antoine Pitrou
Hi Bryce, This sounds good to me. Thanks Antoine. Le 18/12/2024 à 19:08, Bryce Mecum a écrit : Hello all, I'd like to propose a feature freeze date of Monday, January 6th, 2025 for the upcoming 19.0.0 release of Arrow. Please take a look through the milestone [1] to ensure it includes the

Re: When is bit width 0 in dictionary encoded parquet files?

2024-12-19 Thread Antoine Pitrou
Yes, exactly. There's actually a 0-bitwidth example for DELTA_BINARY_PACKED in the spec (see "Example 1"): https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5 Regards Antoine. Le 19/12/2024 à 05:04, Micah Kornfield a écrit : I seem to re

Re: [RUST] Use of Panics

2024-12-18 Thread Antoine Pitrou
Hi, I'm not a Rust user, but I would expect invalid input files to return regular errors, not panic. Unlike API usage errors, invalid input files are not a bug in the calling code. This is also much nicer for bindings in high-level languages such as Python. Regards Antoine. On Tue, 17 Dec 2

Re: [C++] Bump required CMake version

2024-12-09 Thread Antoine Pitrou
My vote is on CMake 3.25. Best regards Antoine. Le 09/12/2024 à 22:08, Sutou Kouhei a écrit : Hi, Currently, we require CMake 3.16 or later: https://github.com/apache/arrow/blob/e0f8c5e8e6f8b328a997f7e21bc6fd1a01b3b3fd/cpp/CMakeLists.txt#L18 cmake_minimum_required(VERSION 3.16) We want

Re: [VOTE] Statistics through the C data interface

2024-12-05 Thread Antoine Pitrou
I don't think a second implementation is strictly necessary because this is just defining a schema and some conventions around it. Though of course a second implementation is always better to have. Regards Antoine. Le 05/12/2024 à 17:47, Matt Topol a écrit : * I implemented this proposal

Re: [VOTE] Statistics through the C data interface

2024-12-05 Thread Antoine Pitrou
Hi, While I'm generally in favor of accepting this soon, I'm -1 on accepting it right now because it seems the PR hasn't had enough review attention on it (I posted some comments). A spec is an important document that will bind us for years, so let's make sure we write something that will

Re: [C++] Arrow S3 filesystem init/finalize

2024-12-02 Thread Antoine Pitrou
Hi, Le 02/12/2024 à 08:12, Jerry Adair a écrit : Hi Weston, Thank you for the reply. IIRC, this is a limitation given to use by the AWS C++ SDK. See [1]. The AWS C++ SDK has static state and they do not manage it with static local variables. As result, the initialization and finalizati

Re: [ANNOUNCE] New Arrow committer: Laurent Goujon

2024-11-25 Thread Antoine Pitrou
Welcome to the team Laurent! Le 25/11/2024 à 10:39, Raúl Cumplido a écrit : Thanks and welcome Laurent! El lun, 25 nov 2024 a las 10:36, David Li () escribió: On behalf of the Arrow PMC, I'm happy to announce that Laurent Goujon has accepted an invitation to become a committer on Apache A

Re: [VOTE] Split Java release process

2024-11-22 Thread Antoine Pitrou
+1 (binding) Le 22/11/2024 à 02:31, Sutou Kouhei a écrit : Hi, I would like to propose splitting Java release process. Motivation: * We want to reduce needless major releases because major releases require users' change * We want to reduce apache/arrow's release cost Approach: 1. Extract

Re: [DISCUSS] Split Java release process

2024-11-18 Thread Antoine Pitrou
Hi Kou, Thanks a lot for bringing this. I'm +1 on the principle, both for splitting the Java release process and moving the Java implementation into another repository. We do need to find more maintainers for Arrow Java, but that is true regardless of whether the Java implementation stays

Re: [VOTE] Add Async C Data Interface

2024-10-25 Thread Antoine Pitrou
+1, with the same comments as Felipe and Dewey. Just at one condition from me: the API should be marked experimental. Regards Antoine. Le 24/10/2024 à 23:17, Felipe Oliveira Carvalho a écrit : +1 from me. I reviewed the PR some time ago and it's not a trivial protocol, but the complexity

Re: [VOTE] Release Apache Arrow 18.0.0 - RC0

2024-10-25 Thread Antoine Pitrou
I also agree that letting conda-forge carry the patch until 19.0.0 is a reasonable solution. It's much more light-weight than having us issue a new RC just for it, unfortunately. Regards Antoine. Le 24/10/2024 à 17:07, Raúl Cumplido a écrit : El jue, 24 oct 2024 a las 0:14, Sutou Kouhei

Re: [ANNOUNCE] New Arrow committer: Rossi Sun

2024-10-22 Thread Antoine Pitrou
Welcome Rossi, and thanks a lot for all your contributions, past and future! Le 22/10/2024 à 21:02, Weston Pace a écrit : On behalf of the Arrow PMC, I'm happy to announce that Rossi Sun has accepted an invitation to become a committer on Apache Arrow. Welcome, and thank you for your contribut

Re: Community Over Code NA next week - Data Engineering track (with Security twist)

2024-10-04 Thread Antoine Pitrou
I see that there's a European variant of that event which seems more adapted for at least some of the Arrow development community: https://eu.communityovercode.org/ Le 04/10/2024 à 10:50, Raúl Cumplido a écrit : Hi Jarek, It seems really interesting, I won't be able to attend. Do you know

[Discuss][C++] Deprecate precompiled headers option?

2024-10-02 Thread Antoine Pitrou
Hello, Long ago, we added a ARROW_USE_PRECOMPILED_HEADERS to the Arrow C++ CMake options in the hope of speeding up builds by reducing C++ header parsing time. However, we later started to use a concurrent (*) solution added in CMake itself: CMAKE_UNITY_BUILD, which merges batches of sourc

Re: [ANNOUNCE] New Arrow committer: Will Ayd

2024-10-01 Thread Antoine Pitrou
Hello Will, and thanks a lot for your involvement! Le 01/10/2024 à 18:55, Dewey Dunnington a écrit : On behalf of the Arrow PMC, I'm happy to announce that Will Wyd has accepted an invitation to become a committer on Apache Arrow. Welcome, and thank you for your contributions! -dewey

Re: [DISCUSS][C++] Can we use "0E+1" not "0.E+1" for deciaml for broader compatibility?

2024-10-01 Thread Antoine Pitrou
Hi Kou, That sounds fine to me. Regards Antoine. Le 01/10/2024 à 03:55, Sutou Kouhei a écrit : Hi, The current decimal implementation omits the fractional part if the fractional part is 0. For example: "0.E+1" not "0.0E+1" Most environments such as Python, Node.js, PostgreSQL and MySQL a

Re: [CROWDSOURCING] Arrow board report due October 9

2024-09-30 Thread Antoine Pitrou
*they receive Le 30/09/2024 à 11:57, Antoine Pitrou a écrit : There might be a misunderstanding, but this is a report for the Apache Software Foundation (they recent reports from hundreds of projects). It's not really useful to copy our release notes there. Regards Antoine. Le

Re: [CROWDSOURCING] Arrow board report due October 9

2024-09-30 Thread Antoine Pitrou
There might be a misunderstanding, but this is a report for the Apache Software Foundation (they recent reports from hundreds of projects). It's not really useful to copy our release notes there. Regards Antoine. Le 30/09/2024 à 11:46, Vibhatha Abeykoon a écrit : Hi Andy, Thanks for sha

Re: [DISCUSS][Flight] Improved Arrow Flight as alternative to Iceberg for DB--engine interop

2024-09-13 Thread Antoine Pitrou
se's (Databend, Doris, Druid, DeepLake, Firebolt, Lance, Oxla, Pinot, QuestDB, SingleStore, etc.) native at-rest partition file formats. On Fri, 13 Sept 2024 at 16:43, Antoine Pitrou wrote: Hello, I'm perplexed by this discussion. If you want to send highly-compressed files over

Re: [DISCUSS][Flight] Improved Arrow Flight as alternative to Iceberg for DB--engine interop

2024-09-13 Thread Antoine Pitrou
Hello, I'm perplexed by this discussion. If you want to send highly-compressed files over the network that is already possible: just send Parquet files via HTTP(S) (or another protocol of choice). Arrow Flight is simply a *streaming* protocol that allows sending/requesting the Arrow format over

Re: [DISCUSS][C++] Should we disallow storage account key in Azure file system URL?

2024-09-12 Thread Antoine Pitrou
Hi, I sympathize with the security argument. If no other library allows for embedding the Azure password directly in the URL, then I would be ok for deprecating it. Regards Antoine. Le 10/09/2024 à 03:24, Sutou Kouhei a écrit : Hi, The current Azure file system URI accepts account key

Re: [DISCUSS] Monorepo GitHub workflow: allow one issue with multiple PRs

2024-09-12 Thread Antoine Pitrou
Hi, I don't have a specific opinion on this, but as a data point, this already happens from time to time (though rarely). Regards Antoine. Le 11/09/2024 à 17:32, Joris Van den Bossche a écrit : Hi all, This is a discussion specifically for the GitHub development workflow we use in the m

Re: [VOTE] Allow Decimal32 and Decimal64 bitwidths in Arrow Format

2024-09-05 Thread Antoine Pitrou
+1 (binding). Can you open a PR with the spec updates? Regards Antoine. Le 04/09/2024 à 23:17, Matt Topol a écrit : Based on various discussions among the ecosystem and to continue expanding the zero-copy interoperability for Arrow to be used with different libraries and databases (such as

Re: [DISCUSS][C++] Indent #if (preprocessor directives)

2024-08-28 Thread Antoine Pitrou
Is there a way to ensure this is done automatically? Regards Antoine. On Wed, 28 Aug 2024 10:05:45 +0900 (JST) Sutou Kouhei wrote: > Hi, > > How about indenting preprocessor directives for readability? > > Issue: https://github.com/apache/arrow/issues/43796 > PR: https://github.com/apache

Re: [VOTE] Split Go release process

2024-08-26 Thread Antoine Pitrou
+1 (binding) Le 26/08/2024 à 04:37, Sutou Kouhei a écrit : Hi, I would like to propose splitting Go release process. Motivation: * We want to reduce needless major releases because major releases require users' change Approach: 1. Extract go/ in apache/arrow to apache/arrow-go like a

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Antoine Pitrou
Le 22/08/2024 à 17:08, Curt Hagenlocher a écrit : (I also happen to want a canonical Arrow representation for variant data, as this type occurs in many databases but doesn't have a great representation today in ADBC results. That's why I filed [Format] Consider adding an official variant type

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Antoine Pitrou
u, Aug 22, 2024 at 3:51 PM Antoine Pitrou wrote: Hi Gang, Sorry, but can you give a pointer to the start of this discussion thread in a readable format (for example a mailing-list archive)? It appears that dev@arrow wasn't cc'ed from the start and that can make it difficult to unde

Re: [DISCUSS] Variant Spec Location

2024-08-22 Thread Antoine Pitrou
Hi Gang, Sorry, but can you give a pointer to the start of this discussion thread in a readable format (for example a mailing-list archive)? It appears that dev@arrow wasn't cc'ed from the start and that can make it difficult to understand what this is about. Regards Antoine. Le 22/08/2

Re: [VOTE][Format] Bool8 Canonical Extension Type

2024-08-05 Thread Antoine Pitrou
Binding +1 (but posted one minor comment on the format PR). Thank you Joel! Regards Antoine. Le 05/08/2024 à 14:59, Joel Lubinitsky a écrit : Hello Devs, I would like to propose a new canonical extension type: Bool8 The prior mailing list discussion thread can be found at [1]. The format

Re: [DISCUSS][Acero] Upgrading to 64-bit row offsets in row table

2024-08-05 Thread Antoine Pitrou
I don't have any concrete data to test this against, but using 64-bit offsets sounds like an obvious improvement to me. Regards Antoine. Le 01/08/2024 à 13:05, Ruoxi Sun a écrit : Hello everyone, We've identified an issue with Acero's hash join/aggregation, which is currently limited to

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-22 Thread Antoine Pitrou
Le 22/07/2024 à 21:25, Joel Lubinitsky a écrit : If Canonical Extensions had existed at the time, I think there's a chance we may have ended up with int32 Date as a first class type and int64 MillisecondDate as a Canonical Extension type. Agreed. Are there any lessons we've learned from im

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-19 Thread Antoine Pitrou
I can't > find now that new types should be implemented as extension types if > possible for these (and perhaps other) reasons. > > > On Fri, Jul 19, 2024 at 5:39 AM Antoine Pitrou wrote: > > > > > > Agreed with Felipe. This is meant for communicating with no

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-19 Thread Antoine Pitrou
out any provisions on the specification that might make this impossible. -dewey [1] https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37 On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou < anto...@python.org>

Re: [DISCUSS] Split Go release process

2024-07-18 Thread Antoine Pitrou
Hi Kou, Le 18/07/2024 à 11:33, Sutou Kouhei a écrit : Here is my idea how to proceed this: 1. Extract go/ in apache/arrow to apache/arrow-go like apache/arrow-rs * Filter go/ related commits from apache/arrow and create apache/arrow-go with them like we did for apache/arrow-rs

Re: [Discuss][C++] Switch to mimalloc by default?

2024-07-16 Thread Antoine Pitrou
Hello, Thanks all for this discussion. Given that there was no strong argument against doing this, I decided to move forward and the change was made in https://github.com/apache/arrow/pull/40875 Regards Antoine. On Wed, 5 Jun 2024 17:18:36 +0200 Antoine Pitrou wrote: > Hello, > >

Re: Understanding possible synergies between arrow & zarr communities?

2024-07-16 Thread Antoine Pitrou
Hi Carl, Le 08/07/2024 à 18:43, Carl Boettiger a écrit : As an observer to both communities, I'm interested in if there is or might be more communication between the Pangeo community's focus on Zarr serialization with what the Arrow team has done with Parquet. I recognize that these are diff

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

2024-07-16 Thread Antoine Pitrou
Hi Joel, This looks good to me on the principle. Can you split the spec and the implementation(s) into separate PRs? Regards Antoine. Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit : Hi Arrow devs, I'm working on adding an extension type for 8-bit booleans, and wanted to start a discuss

Re: [DISCUSS][C++] Empty directory marker creation in S3FileSystem

2024-07-16 Thread Antoine Pitrou
[1]: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html # -- # Aldrin https://github.com/drin/ https://gitlab.com/octalene https://keybase.io/octalene On Monday, July 15th, 2024 at 07:59, Antoine Pitrou wrote: No, because these marke

Re: [DISCUSS][C++] Empty directory marker creation in S3FileSystem

2024-07-15 Thread Antoine Pitrou
No, because these markers also communicate the information to other implementations of S3 abstractions. An example of this is: https://docs.cyberduck.io/protocols/s3/#folders Regards Antoine. Le 13/07/2024 à 07:15, Aldrin a écrit : ...then I still expect the directory /foo to exist Rig

Re: [DISCUSS][C++] Empty directory marker creation in S3FileSystem

2024-07-12 Thread Antoine Pitrou
Hi, Le 12/07/2024 à 12:21, Hyunseok Seo a écrit : *### Why Maintain Empty Directory Markers?* From what I understand, object stores like S3 do not have a concept of directories. The motivation behind maintaining these markers could be to manage the object store as if it were a traditional fi

Re: [DISCUSS] Statistics through the C data interface

2024-07-01 Thread Antoine Pitrou
Hmmm, I strive to understand why a `(int32, utf8)` tuple for statistic keys would be any simpler to implement than either `int32` *or* `utf8` *or* `dictionary(int32, utf8)`. Let's keep in mind that we would like to keep things simple for consumers and producers of statistics. We should al

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Antoine Pitrou
Is this UDF implementation based on DataFusion? If so, it makes sense for it to be part of the DataFusion project. OTOH, if it can work with any data in the Arrow format, then it would sound weird to maintain it in the DataFusion repo IMHO. Regards Antoine. Le 28/06/2024 à 21:52, Andrew

Re: [DISCUSS] Donation of a User-Defined Function Framework for Apache Arrow

2024-06-28 Thread Antoine Pitrou
I'll note that PyArrow also allows defining user-defined functions and they are vectorized (the function arguments can be PyArrow arrays or scalars, depending on the context in which a function is being executed): https://arrow.apache.org/docs/python/compute.html#user-defined-functions My vo

Re: [Discuss][C++] Switch to mimalloc by default?

2024-06-12 Thread Antoine Pitrou
Le 12/06/2024 à 04:45, Sutou Kouhei a écrit : It seems that we need to disable MI_OVERRIDE explicitly to not define malloc() in libmimalloc.so: https://github.com/microsoft/mimalloc/blob/03020fbf81541651e24289d2f7033a772a50f480/CMakeLists.txt#L10 Yes, that's what we do when building the bund

Re: Unsupported/Other Type

2024-06-11 Thread Antoine Pitrou
Sorry, I had forgotten to comment on this. I think this is generally a good idea, but it would obviously need more eyes on it :-) Can other people go and take a look at David's PR below? Le 25/05/2024 à 04:47, David Li a écrit : I've put up a draft PR here: https://github.com/apache/arrow/

Re: [Discuss][C++] Switch to mimalloc by default?

2024-06-11 Thread Antoine Pitrou
Le 11/06/2024 à 10:35, Sutou Kouhei a écrit : Hi, In <2a32f61c-dd22-4f3f-bc98-822dcb6b0...@python.org> "Re: [Discuss][C++] Switch to mimalloc by default?" on Tue, 11 Jun 2024 10:21:12 +0200, Antoine Pitrou wrote: I was thinking about find_package(). Good to know

Re: [Discuss][C++] Switch to mimalloc by default?

2024-06-11 Thread Antoine Pitrou
Le 11/06/2024 à 10:01, Sutou Kouhei a écrit : 2. Is it OK that we add support for system mimalloc? Hmm... that sounds legitimate, but with the caveat that a system mimalloc can override the standard malloc/free functions. Would that affect an application using Arrow C++? Are you saying th

Re: [Discuss][C++] Switch to mimalloc by default?

2024-06-10 Thread Antoine Pitrou
Hi Kou, Le 09/06/2024 à 09:16, Sutou Kouhei a écrit : Questions: 1. Do we need to keep jemalloc support? Compatibility? Can we drop support for jemalloc to decrease maintenance cost? I'm not sure there's much maintenance cost. I expect some people might prefer jemalloc, and perhaps it

Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Antoine Pitrou
Le 09/06/2024 à 08:33, Sutou Kouhei a écrit : Fields: | Name | Type | Comments | ||---| | | column | utf8 | (2) | | key| utf8 not null | (3) | 1. Should the key be

Re: [DISCUSS] Statistics through the C data interface

2024-06-09 Thread Antoine Pitrou
Le 09/06/2024 à 09:01, Sutou Kouhei a écrit : Hi, One thing that a plain integer makes more difficult is representing non-standard statistics. For example some engine might want to expose elaborate quantile-based statistics even if it not officially defined here. With a `utf8` or `dictionary(

Re: [DISCUSS] Statistics through the C data interface

2024-06-08 Thread Antoine Pitrou
Le 07/06/2024 à 18:30, Felipe Oliveira Carvalho a écrit : On Fri, Jun 7, 2024 at 6:24 AM Antoine Pitrou wrote: Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit : I've been thinking about how to encode statistics on Arrow arrays and how to keep the set of statistics known by

Re: [DISCUSS] Statistics through the C data interface

2024-06-07 Thread Antoine Pitrou
Le 07/06/2024 à 04:27, Felipe Oliveira Carvalho a écrit : I've been thinking about how to encode statistics on Arrow arrays and how to keep the set of statistics known by both producers and consumers (i.e. standardized). The statistics array(s) could be a map< // the column index or n

Re: [DISCUSS] Statistics through the C data interface

2024-06-06 Thread Antoine Pitrou
Hi Kou, Thanks for pushing for this! Le 06/06/2024 à 11:27, Sutou Kouhei a écrit : 4. Standardize Apache Arrow schema for statistics and transmit statistics via separated API call that uses the C data interface [...] I think that 4. is the best approach in these candidates. I agr

[Discuss][C++] Switch to mimalloc by default?

2024-06-05 Thread Antoine Pitrou
Hello, Arrow C++ features a MemoryPool abstraction that allows using different allocators interchangeably. Several MemoryPool implementations are provided with Arrow C++ (though one can also build their own): - a jemalloc-based implementation, currently the default on Linux - a mimalloc-bas

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-06-04 Thread Antoine Pitrou
(Gang Wu, Antoine Pitrou, Wes McKinney) 9x +1 non-binding (Micah Kornfield, Felipe Oliveira Carvalho, Fokko Driesprong, Alenka Frim, Andy Grove, Raúl Cumplido, Sutou Kouhei, Jiashen Zhang, Rok Mihevc) Arrow: 6x +1 binding (Micah Kornfield, Antoine Pitrou, Andy Grove, Raúl Cumplido, Wes McKinney

Re: [C++] Thread deadlock in ObjectOutputStream

2024-05-29 Thread Antoine Pitrou
Hi Li! Sorry for the delay. It seems the problem lies here: https://github.com/apache/arrow/blob/9f5899019d23b2b1eae2fedb9f6be8827885d843/cpp/src/arrow/filesystem/s3fs.cc#L1858 The Future is marked finished with the ObjectOutputStream's mutex taken, and the Future's callback then triggers a c

Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-05-29 Thread Antoine Pitrou
+1 (binding). Thanks for taking this up, Rok! Regards Antoine. Le 29/05/2024 à 16:14, Rok Mihevc a écrit : # sending this to both dev@arrow and dev@parquet Hi all, Following the ML discussion [1] I would like to propose a vote for parquet-cpp issues to be moved from Parquet Jira [2] to Arr

Re: [DISCUSS] Apache Arrow LinkedIn page

2024-05-24 Thread Antoine Pitrou
Is it somehow possible to be a "member" of this account to indicate that we have PMC status, or is that not possible within the LinkedIn membership/permissions model? Le 24/05/2024 à 18:04, Ian Cook a écrit : Following the discussion [1] earlier this year about the status of the Apache Ar

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou
> 2. We'll provide pre-defined keys such as "max", "min", > > >"byte_width" and "distinct_count" but users can also use > > >application specific keys. > > > 3. If true, then the value is approximate or best-effort.

Re: [DISCUSS] Statistics through the C data interface

2024-05-23 Thread Antoine Pitrou
Le 23/05/2024 à 16:09, Felipe Oliveira Carvalho a écrit : Protocols that produce/consume statistics might want to use the C Data Interface as a primitive for passing Arrow arrays of statistics. This is also my opinion. I think what we are slowly converging on is the need for a spec to desc

Re: [DISCUSS] Statistics through the C data interface

2024-05-22 Thread Antoine Pitrou
Hi Kou, I agree that Dewey that this is overstretching the capabilities of the C Data Interface. In particular, stuffing a pointer as metadata value and decreeing it immortal doesn't sound like a good design decision. Why not simply pass the statistics ArrowArray separately in your produce

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

2024-05-14 Thread Antoine Pitrou
I think these flags should be advisory and consumers should be free to ignore them. However, some consumers apparently would benefit from them to more faithfully represent the producer's intention. For example, in Arrow C++, we could perhaps have a ImportDatum function whose actual return t

Re: [VOTE][Format] UUID canonical extension type

2024-04-30 Thread Antoine Pitrou
+1 (binding) Le 19/04/2024 à 22:22, Rok Mihevc a écrit : Hi all, Following initial requests [1][2] and recent tangential ML discussion [3] I would like to propose a vote to add language for UUID canonical extension type to CanonicalExtensions.rst as in PR [4] and written below. A draft C++ and

Re: [VOTE][Format] JSON canonical extension type

2024-04-30 Thread Antoine Pitrou
+1 (binding) for the current proposal, i.e. with the RFC 8289 requirement and the 3 current String types allowed. Regards Antoine. Le 30/04/2024 à 19:26, Rok Mihevc a écrit : Hi all, thanks for the votes and comments so far. I've amended [1] the proposed language with the RFC-8259 requiremen

Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou
mes, and so we could use this in that context). I think that I would still prefer a canonical extension type (with storage type null) over a new dedicated type. On Wed, Apr 17, 2024 at 5:39 AM Antoine Pitrou wrote: Ah! Well, I think this could be an interesting proposal, but someone should

Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou
Ah! Well, I think this could be an interesting proposal, but someone should put a more formal proposal, perhaps as a draft PR. Regards Antoine. Le 17/04/2024 à 11:57, David Li a écrit : For an unsupported/other extension type. On Wed, Apr 17, 2024, at 18:32, Antoine Pitrou wrote: What

Re: AW: Personal feedback on your last release on Apache Arrow ADBC 0.11.0

2024-04-17 Thread Antoine Pitrou
Out of curiosity, did you notice this by chance or do you have some kind of script that processes ASF mailing-list archives for possible voting irregularities? Regards Antoine. Le 17/04/2024 à 10:44, Christofer Dutz a écrit : When looking at whimsy, I can’t see any person named Sutou Kou

Re: Unsupported/Other Type

2024-04-17 Thread Antoine Pitrou
eation of one-off nominal types for very specific use-cases? — Felipe On Thu, 11 Apr 2024 at 05:06 Antoine Pitrou wrote: Yes, JSON and UUID are obvious candidates for new canonical extension types. XML also comes to mind, but I'm not sure there's much of a use case for it. Regards A

Re: Unsupported/Other Type

2024-04-11 Thread Antoine Pitrou
:06 Antoine Pitrou wrote: Yes, JSON and UUID are obvious candidates for new canonical extension types. XML also comes to mind, but I'm not sure there's much of a use case for it. Regards Antoine. Le 10/04/2024 à 22:55, Wes McKinney a écrit : In the past we have discussed adding a

Re: Unsupported/Other Type

2024-04-11 Thread Antoine Pitrou
Yes, JSON and UUID are obvious candidates for new canonical extension types. XML also comes to mind, but I'm not sure there's much of a use case for it. Regards Antoine. Le 10/04/2024 à 22:55, Wes McKinney a écrit : In the past we have discussed adding a canonical type for UUID and JSON.

Re: [RFC] Enabling data frames in disaggregated shared memory

2024-04-10 Thread Antoine Pitrou
Hello John, Arrow IPC files can be backed quite naturally by shared memory, simply by memory-mapping them for reading. So if you have some pieces of shared memory containing Arrow IPC files, and they are reachable using a filesystem mount point, you're pretty much done. You can see an exam

Re: [DISCUSS] Versioning and releases for apache/arrow components

2024-04-09 Thread Antoine Pitrou
It seems that perhaps this discussion should be rebooted for each individual component, one at a time? Let's start with something simple and obvious, with some frequent contribution activity, such as perhaps Go? Le 09/04/2024 à 14:27, Joris Van den Bossche a écrit : I am also in favor o

  1   2   3   4   5   6   7   8   9   10   >