Re: Turn a vector of Scalar to an Array/ArrayData of the same datatype

2023-06-15 Thread Jin Shang
Hi Li, I've faced this issue before, and I ended up using a generic ArrayBuilder, for example: ```cpp auto type = int32(); std::vector> scalars = {MakeScalar(1), MakeScalar(2)}; ARROW_ASSIGN_OR_RAISE(std::unique_ptr builder, MakeBuilder(type)); ARROW_RETURN_NOT_OK(builder->AppendScalars(scalars)

Re: [Parquet C++] Plan to bump default write version from 2.4 -> 2.6 (include nanoseconds LogicalType)

2023-06-15 Thread Gang Wu
+ dev@parquet On Fri, Jun 16, 2023 at 7:43 AM Jacob Wujciak-Jens wrote: > +1 on the update but also on properly communicating the change to avoid > surprising issues :) > > On Thu, Jun 15, 2023 at 7:53 PM Joris Van den Bossche < > jorisvandenboss...@gmail.com> wrote: > > > On Thu, 15 Jun 2023 at

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Gang Wu
Hi Ben, The posted benchmark [1] looks pretty good to me. However, I want to raise a possible issue from the perspective of parquet-cpp. Parquet-cpp uses a customized parquet::ByteArray type [2] for string/binary, I would expect some regression of conversions between parquet reader/writer and the

Re: pyarrow Table.from_pylist doesn;t release memory

2023-06-15 Thread Weston Pace
Note that you can ask pyarrow how much memory it thinks it is using with the pyarrow.total_allocated_bytes[1] function. This can be very useful for tracking memory leaks. I see that memory-profiler now has support for different backends. Sadly, it doesn't look like you can register a custom backe

Re: [DISCUSS][C++] Can we require CMake 3.16+ since 13.0.0?

2023-06-15 Thread Sutou Kouhei
Hi, Ah, sorry. I should have written it in the original e-mail. If we can require CMake 3.16+: * We can always use the precompiled headers feature that reduces build time: https://github.com/apache/arrow/pull/35921/files#diff-1bba462ab050e89360fd88110a689e85ee037749cea091a1848ab574381d3795L

[VOTE] Release Apache Arrow ADBC 0.5.0 - RC0

2023-06-15 Thread David Li
Hello, I would like to propose the following release candidate (RC0) of Apache Arrow ADBC version 0.5.0. This is a release consisting of 36 resolved GitHub issues [1]. This release candidate is based on commit: ac0e0ef8bd83787f65e53d421fce6ad490d9a37d [2] The source release rc0 is hosted at [

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-15 Thread Jacob Wujciak-Jens
> Even if ListView is rarely used for interoperability (if it never gains wide adoption), some of the arrow implementations could use ListView to offer faster computation kernels, which I think has real value This is an important point, thanks for the clear phrasing Andrew! On Thu, Jun 15, 2023 a

Re: [Parquet C++] Plan to bump default write version from 2.4 -> 2.6 (include nanoseconds LogicalType)

2023-06-15 Thread Jacob Wujciak-Jens
+1 on the update but also on properly communicating the change to avoid surprising issues :) On Thu, Jun 15, 2023 at 7:53 PM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > On Thu, 15 Jun 2023 at 19:08, Ian Cook wrote: > > > > It will still be possible to write files using Parquet

Re: [DISCUSS][C++] Can we require CMake 3.16+ since 13.0.0?

2023-06-15 Thread Jacob Wujciak-Jens
+1 on 3.16 and dropping amazon linux 2 (as that is recommended by aws). @antonie 3.14+ has a number of improvements to FetchContent that we could use to vastly improve our bundled dependency system. There are also improvements to precompiled headers etc. an overview of some of the changes in each

Proposal to move the benchmark executables into a seperate build subdir

2023-06-15 Thread Anja
Hello! The benchmark executables are placed in the same directory as the other test executables: https://github.com/apache/arrow/blob/b4ac585ecb4da610cc64e346e564ca86594aec53/cpp/cmake_modules/BuildUtils.cmake#L614. This means that if somebody builds the benchmarks with `ARROW_BUILD_BENCHMARK=ON

Re: [DISCUSS][C++] Can we require CMake 3.16+ since 13.0.0?

2023-06-15 Thread Antoine Pitrou
Hi, I'd ask the question differently: what do we gain from requiring 3.16 rather than 3.13? Le 15/06/2023 à 23:19, Sutou Kouhei a écrit : Hi, We require CMake 3.5+ now because Ubuntu 18.04 ships 3.5. We dropped support for Ubuntu 18.04 because it reached EOL. Can we require CMake 3.16+ i

Turn a vector of Scalar to an Array/ArrayData of the same datatype

2023-06-15 Thread Li Jin
Hi, I find myself in need of a function to turn a vector of Scalar to an Array of the same datatype. The data type is known at the runtime. e.g. shared_ptr concat_scalars(vector values. shared_ptr type); I wonder if I need to use sth like Scalar::Accept(ScalarVisitor*) or is there an easier/bett

[DISCUSS][C++] Can we require CMake 3.16+ since 13.0.0?

2023-06-15 Thread Sutou Kouhei
Hi, We require CMake 3.5+ now because Ubuntu 18.04 ships 3.5. We dropped support for Ubuntu 18.04 because it reached EOL. Can we require CMake 3.16+ in Apache Arrow C++ 13.0.0? Here are CMake versions of our supported platforms: * Ubuntu 20.04: CMake 3.16 * CentOS 7: CMake 3.17 * Debian GNU/Lin

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Will Jones
Cool. Thanks for doing that! On Thu, Jun 15, 2023 at 12:40 Benjamin Kietzman wrote: > I've added https://github.com/apache/arrow/issues/36112 to track > deduplication of buffers on write. > I don't think it would require modification of the IPC format. > > Ben > > On Thu, Jun 15, 2023 at 1:30 PM

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Benjamin Kietzman
I've added https://github.com/apache/arrow/issues/36112 to track deduplication of buffers on write. I don't think it would require modification of the IPC format. Ben On Thu, Jun 15, 2023 at 1:30 PM Matt Topol wrote: > Based on my understanding, in theory a buffer *could* be shared within a > b

Re: [Parquet C++] Plan to bump default write version from 2.4 -> 2.6 (include nanoseconds LogicalType)

2023-06-15 Thread Joris Van den Bossche
On Thu, 15 Jun 2023 at 19:08, Ian Cook wrote: > > It will still be possible to write files using Parquet 2.4 by > explicitly specifying the 2.4 version to the Parquet writer, correct? > If yes, that provides a simple workaround for users who encounter > compatibility issues. Indeed. Using the pya

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Matt Topol
Based on my understanding, in theory a buffer *could* be shared within a batch since the flatbuffers message just uses an offset and length to identify the buffers. That said, I don't believe any current implementation actually does this or takes advantage of this in any meaningful way. --Matt O

RE: [Parquet C++] Plan to bump default write version from 2.4 -> 2.6 (include nanoseconds LogicalType)

2023-06-15 Thread wish maple
On 2023/06/15 16:24:44 Joris Van den Bossche wrote: > Hi all, > > Bringing up https://github.com/apache/arrow/issues/35746 to the > mailing list: this issue proposes to bump the default Parquet version > we use for writing to Parquet files in the C++ library (and in the > various bindings including

Re: [Parquet C++] Plan to bump default write version from 2.4 -> 2.6 (include nanoseconds LogicalType)

2023-06-15 Thread Ian Cook
It will still be possible to write files using Parquet 2.4 by explicitly specifying the 2.4 version to the Parquet writer, correct? If yes, that provides a simple workaround for users who encounter compatibility issues. However we should take care to document this as a potentially breaking change,

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Will Jones
Hi Ben, It's exciting to see this move along. The buffers will be duplicated. If buffer duplication is becomes a concern, > I'd prefer to handle > that in the ipc writer. Then buffers which are duplicated could be detected > by checking > pointer identity and written only once. Question: to be

[Parquet C++] Plan to bump default write version from 2.4 -> 2.6 (include nanoseconds LogicalType)

2023-06-15 Thread Joris Van den Bossche
Hi all, Bringing up https://github.com/apache/arrow/issues/35746 to the mailing list: this issue proposes to bump the default Parquet version we use for writing to Parquet files in the C++ library (and in the various bindings including pyarrow and R arrow) from the current default of "2.4" to "2.6

Re: pyarrow Table.from_pylist doesn;t release memory

2023-06-15 Thread Antoine Pitrou
Hi Alex, I think you're misinterpreting the results. Yes, the RSS memory (as reported by memory_profiler) doesn't seem to decrease. No, it doesn't mean that Arrow doesn't release memory. It's actually common for memory allocators (such as jemalloc, or the system allocator) to keep deallocat

Re: [DISCUSS][Format] Draft implementation of string view array format

2023-06-15 Thread Benjamin Kietzman
Hello again all, The PR [1] to add string view to the format and the C++ implementation is hovering around passing CI and has been undrafted. Furthermore, there is now also a PR [2] to add string view to the Go implementation. Code review is underway for each PR and I'd like to move toward a vote

Re: pyarrow Table.from_pylist doesn;t release memory

2023-06-15 Thread Jerald Alex
Hi Experts, I have come across the memory pool configurations using an environment variable *ARROW_DEFAULT_MEMORY_POOL* and I tried to make use of them and test it. I could observe improvements on macOS with the *system* memory pool but no change on linux os. I have captured more details on GH is

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-15 Thread Felipe Oliveira Carvalho
On Wed, Jun 14, 2023 at 5:07 PM Raphael Taylor-Davies wrote: > Even something relatively straightforward becomes a huge implementation > effort when multiplied by a large number of codebases, users and > datasets. Parquet is a great source of historical examples of the > challenges of incremental

[RESULT][VOTE][RUST][DataFusion] Release DataFusion Python Bindings 26.0.0 RC1

2023-06-15 Thread Andy Grove
On Thu, Jun 15, 2023 at 7:19 AM Andy Grove wrote: > The vote passes with 4 +1 votes (3 binding). Thanks, everyone. > > Source: > https://dist.apache.org/repos/dist/release/arrow/arrow-datafusion-python-26.0.0 > > PyPi: https://pypi.org/project/datafusion/26.0.0/ > > On Mon, Jun 12, 2023 at 6:26 A

Re: [VOTE][RUST][DataFusion] Release DataFusion Python Bindings 26.0.0 RC1

2023-06-15 Thread Andy Grove
The vote passes with 4 +1 votes (3 binding). Thanks, everyone. Source: https://dist.apache.org/repos/dist/release/arrow/arrow-datafusion-python-26.0.0 PyPi: https://pypi.org/project/datafusion/26.0.0/ On Mon, Jun 12, 2023 at 6:26 AM Jeremy Dyer wrote: > +1 (non-binding) > > Verified using veri

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-15 Thread Andrew Lamb
I want to be clear, insofar that ListView makes using the arrow libraries more attractive to system developers, I am in favor of adding it. Arrow the specification is focused on interoperability. Arrow the libraries (specifically the compute kernels included in many implementations) also offer fas