Problem reading parquet written with pyarrow=2.0.0 using pyarrow=8.0.0 (when using use_dictionary with ParquetWriter)

2022-06-14 Thread Niklas Bivald
Hi, I’m experiencing problem reading parquet files written with the `use_dictionary=[]` option in pyarrow 2.0.0. If I write a parquet file in 2.0.0 reading it in 8.0.0 gives: >>> pd.read_parquet(‘dataset.parq') > Traceback (most recent call last): > File "", line 1, in > File > "/Library/Fra

[RESULT][VOTE] Mark C Stream Interface as Stable

2022-06-14 Thread Will Jones
With 6 binding +1s (and 2 non-binding), we've approved marking the C stream interface as stable. I will move forward with the pull requests to update the documentation. On Thu, Jun 9, 2022 at 2:19 PM Neal Richardson wrote: > +1 > > On Wed, Jun 8, 2022 at 7:44 PM Sutou Kouhei wrote: > > > +1 > >

Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
Hello, This comment is regarding installation with `apt` on ubuntu 18.04 ... `libarrow-dev/bionic,now 8.0.0-1 amd64` I'm a bit confused about the memory pool situation: * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that `arrow::default_memory_pool()->backend_name() == arrow::system_m

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
A code review has demonstrated that Arrow uses posix_memalign ... I do believe mimalloc preload is "catching" this but I didn't tool it with my customization. Still interested in any guidance on the other points raised, and sorry for some of this being noise. -John On Tue, Jun 14, 2022 at 9:06 A

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread Weston Pace
I can try and give a more detailed answer later in the week but the gist of it is that Arrow manages all "buffer allocations" with a memory pool. These are the allocations for the actual data in the arrays. These are the allocations that use the memory pool configured by ARROW_DEFAULT_MEMORY_POOL

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread Weston Pace
Sorry, that should have said "when Arrow builds jemalloc". Here is the command we send down (from ThirdPartyToolchain.cmake): ``` JEMALLOC_CONFIGURE_COMMAND "--prefix=${JEMALLOC_PREFIX}" "--libdir=${JEMALLOC_LIB_DIR}" "--with-jemalloc-prefix=je_arrow_" "--with-private-namespace=je_arrow_private_"

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
I take that back... the preload is not intercepting memory_pool.cc -> SystemAllocator -> AllocateAligned -> posix_memalign (if indeed this is the system allocator path), although it is intercepting posix_memalign from a different .so On Tue, Jun 14, 2022 at 10:27 AM John Muehlhausen wrote: > A c

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
I'm using ARROW_DEFAULT_MEMORY_POOL=system Based on a review of memory_pool.cc I expect this to become posix_memalign calls on Linux When I call posiix_memalign in a .so that I created and linked with my app, using LD_PRELOAD=/usr/local/lib/libmimalloc.so to run the app, these calls get forwarded

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
My best guess at this moment is that the Arrow lib I'm using was built with a compiler that had something like __builtin_posix_memalign in effect ?? I say this because deploying __builtin_malloc has the same deleterious effect on my own .so On Tue, Jun 14, 2022 at 10:53 AM John Muehlhausen wrote

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
A minimal build using the following seems to have solved my problem. The various no-builtin params are guesswork based largely on alloc-override.c from mimalloc. It would be nice if someone documented somewhere how to turn off classes of builtins for each popular compiler or if this received comp

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread Sutou Kouhei
Hi, posix_memalign() in memory_pool.cc of libarrow-dev uses jemalloc's posix_memalign() (je_posix_memalign()). Because it's built with ARROW_JEMALLOC=ON (default) and JEMALLOC_MANGLE https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L53 . So we can't use mimalloc with LD_PRE

Re: [C++] Can we remove cpp/src/arrow/dbi/hiveserver2?

2022-06-14 Thread Sutou Kouhei
Hi, There is no objection. I'll remove cpp/src/arrow/dbi/hiveserver2/: https://issues.apache.org/jira/browse/ARROW-16832 Thanks, -- kou In <20220607.145634.286204450295433958@clear-code.com> "Re: [C++] Can we remove cpp/src/arrow/dbi/hiveserver2?" on Tue, 07 Jun 2022 14:56:34 +0900 (JS

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
Thanks for the reply. I had disabled jemalloc via ARROW_DEFAULT_MEMORY_POOL so that was not the issue. The issue was (I think) that the arrow lib I was using was built with compiler builtins (such as __builtin_posix_memalign) so that even the system default allocator wasn't able to be intercepted

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread Sutou Kouhei
Hi, I think that compiler builtins aren't related. Could you try only with -DARROW_JEMALLOC=OFF? Thanks, -- kou In "Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool" on Tue, 14 Jun 2022 18:32:00 -0500, John Muehlhausen wrote: > Thanks for the reply

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread Sutou Kouhei
Hi, Could you try https://github.com/apache/arrow/pull/13373 ? This will work with -DARROW_JEMALLOC=ON because it doesn't override posix_memalign() in the system memory pool even when -DARROW_JEMALLOC=ON is specified. Thanks, -- kou In <20220615.083854.1117478143326800670@clear-code.com>

Re: Existence/name/scope for minimal C/C++ Arrow C Data interface helpers

2022-06-14 Thread Dewey Dunnington
Hi all, I drafted a second PR [1] drafting a design for storing parsed information obtained from a struct ArrowSchema (i.e., parsing the format string into usable C structures). There are some unsolved problems that could use a fresh perspective...all comments welcome! [1] https://github.com/pale