RE: [DISCUSS][C++][Parquet] Expose the API to customize the compression parameter

2023-04-23 Thread wish maple
On 2023/04/23 09:38:02 "Yang, Yang10" wrote: > Hi, > > As discussed in this issue: https://github.com/apache/arrow/issues/35287, currently Arrow only supports one parameter: compression_level to be customized. We would like to make more compression parameters (such as window_bits) customizable when

RE: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-25 Thread wish maple
I think the ArrayVector can have benefits above: 1. Converting a Batch in Velox or other system to arrow array could be much more lightweight. 2. Modifying, filter and copy array or string could be much more lightweight Velox can make a Vector mutable, seems that arrow array cannot. Seems it m

RE: [DISCUSS] Interest in a 12.0.1 patch?

2023-05-18 Thread wish maple
I have two parquet related bug fixes and I wonder if we can release them in 12.0.1 1. https://github.com/apache/arrow/pull/35428 2. https://github.com/apache/arrow/pull/35520 Patch 1 can cause BYTE_STREAM_SPLIT unable to be read if the previous parquet page is larger than the incoming one. Patch 2

RE: [Parquet C++] Plan to bump default write version from 2.4 -> 2.6 (include nanoseconds LogicalType)

2023-06-15 Thread wish maple
On 2023/06/15 16:24:44 Joris Van den Bossche wrote: > Hi all, > > Bringing up https://github.com/apache/arrow/issues/35746 to the > mailing list: this issue proposes to bump the default Parquet version > we use for writing to Parquet files in the C++ library (and in the > various bindings including

Question about nested columnar validity

2023-06-28 Thread wish maple
Hi, By looking at the arrow standard, when it comes to nested structure, like StructArray[1] or FixedListArray[2], when parent is not valid, the correspond child leaves "undefined". If it's a BinaryArray, when when it parent is not valid, would a validity member point to a undefined address? And

RE: Question about nested columnar validity

2023-06-29 Thread wish maple
/c6frlr9gcxy8qdhbmv8cn3rdjbrqxb1v [4] https://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps Thanks, Xuwei Fu On 2023/06/28 15:03:11 wish maple wrote: > Hi, > > By looking at the arrow standard, when it comes to nested structure, like > StructArray[1] or FixedListArray[2], when parent is no

Question about nested columnar validity

2023-06-29 Thread wish maple
ity = true`, there offset might point to a invalid position Am I right? On 2023/06/29 12:10:52 Antoine Pitrou wrote: > > Le 29/06/2023 à 13:42, wish maple a écrit : > > Thanks all! > > So, in general: > > 1. For our Binary Like [1] format, and List formats [2], if the

Question about TypeHolder in arrow

2023-07-04 Thread wish maple
Hi, By looking into the code of arrow compute, I found there it uses `TypeHolder` [1], and expression might call `GetTypes` to get the input or output types. The document for `TypeHolder` says that it's a container for dynamically created `shared_ptr`. However, my view is: 1. It's widely used, an

RE: C++: State of parquet 2.x / nanosecond support

2023-07-14 Thread wish maple
Hi, Li Parquet 2.6 has been supported for a long time, and recently, in Parquet C++ and Python, Parquet 2.6 has been set to the default version of Parquet writer [1] [2]. So I think you can just use it! However, I don't know whether nanoarrow supports it. Best, Xuwei Fu [1] https://lists.apache.o

Re: [VOTE][Format] Add Utf8View Arrays to Arrow Format

2023-08-21 Thread wish maple
+1 (non-binding) It would help a lot when processing UTF-8 related data! Xuwei Andrew Lamb 于2023年8月22日周二 00:11写道: > +1 > > This is a great example of collaboration > > On Sat, Aug 19, 2023 at 4:10 PM Chao Sun wrote: > > > +1 (non-binding)! > > > > On Fri, Aug 18, 2023 at 12:59 PM Felipe Olive

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread wish maple
I've met lots of Parquet Dataset issues. The main problem is that currently we have 2 sets or API and they have different scan-options. And sometimes different interfaces like `to_batches()` or others would enable different scan options. I think [2] is similar to your problem. 1-4 are some issues

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread wish maple
rmation (perhaps > metadata) per file scanned? > > On Wed, Sep 6, 2023 at 12:10 PM wish maple wrote: > > > I've met lots of Parquet Dataset issues. The main problem is that > currently > > we have 2 sets or API > > and they have different scan-options. And sometimes diff

Re: [C++] Potential cache/memory leak when reading parquet

2023-09-06 Thread wish maple
By the way, you can try to use a memory-profiler like [1] and [2] . It would be help to find how the memory is used Best, Xuwei Fu [1] https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Heap-Profiling [2] https://google.github.io/tcmalloc/gperftools.html Felipe Oliveira Carvalho 于2023年9月7日周

Re: [VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-09-29 Thread wish maple
+1 LGTM, thanks! Ian Cook 于2023年9月30日周六 00:49写道: > +1 (non-binding) > > Thanks very much Felipe for your persistence and your commitment to > addressing the numerous questions and comments that have been raised > since the beginning of the discussion on this in April. > > On Fri, Sep 29, 2023 a

Re: [ANNOUNCE] New Arrow committer: Curt Hagenlocher

2023-10-15 Thread wish maple
Congratulations! Raúl Cumplido 于2023年10月15日周日 20:48写道: > Congratulations and welcome! > > El dom, 15 oct 2023, 13:57, Ian Cook escribió: > > > Congratulations Curt! > > > > On Sun, Oct 15, 2023 at 05:32 Andrew Lamb wrote: > > > > > On behalf of the Arrow PMC, I'm happy to announce that Curt Ha

Re: Apache Arrow file format

2023-10-17 Thread wish maple
Arrow IPC file is great, it focuses on in-memory representation and direct computation. Basically, it can support compression and dictionary encoding, and can zero-copy deserialize the file to memory Arrow format. Parquet provides some strong functionality, like Statistics, which could help prunin

Re: Apache Arrow file format

2023-10-22 Thread wish maple
he format affords. It is comparatively > > > expensive > > > > > to encode and decode, and instead relies on index structures and > > > > > statistics to accelerate access. > > > > > > > > > > Both are therefore perfectly viable options d

Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-23 Thread wish maple
Thanks kou and every nice person in arrow community! I've learned a lot during learning and contribution to arrow and parquet. Thanks for everyone's help. Hope we can bring more fancy features in the future! Best, Xuwei Fu Sutou Kouhei 于2023年10月23日周一 12:48写道: > On behalf of the Arrow PMC, I'm

Re: [ANNOUNCE] New Arrow PMC member: Raúl Cumplido

2023-11-13 Thread wish maple
Congrats Raul! Best, Xuwei Fu Andrew Lamb 于2023年11月14日周二 03:28写道: > The Project Management Committee (PMC) for Apache Arrow has invited > Raúl Cumplido to become a PMC member and we are pleased to announce > that Raúl Cumplido has accepted. > > Please join me in congratulating them. > > Andre

Re: C++: Code that read parquet into Arrow Arrays?

2023-11-17 Thread wish maple
Hi, The parquet is divided into arrow and parquet part. 1. The parquet part lowest position is parquet decoder, in [1]. The float point might choosing PLAIN, RLE_DCIT or BYTE_STREAM_SPLIT encoding. 2. parquet::ColumnReader is applied beyond decoder, each row-group might have one or tw

Re: [ANNOUNCE] New Arrow PMC chair: Andy Grove

2023-11-27 Thread wish maple
Congrats Andy! Best, Xuwei Fu Andrew Lamb 于2023年11月27日周一 20:47写道: > I am pleased to announce that the Arrow Project has a new PMC chair and VP > as per our tradition of rotating the chair once a year. I have resigned and > Andy Grove was duly elected by the PMC and approved unanimously by the >

Re: [ANNOUNCE] New Arrow committer: Felipe Oliveira Carvalho

2023-12-07 Thread wish maple
Congrats Felipe!!! Best, Xuwei Fu Benjamin Kietzman 于2023年12月7日周四 23:42写道: > On behalf of the Arrow PMC, I'm happy to announce that Felipe Oliveira > Carvalho > has accepted an invitation to become a committer on Apache > Arrow. Welcome, and thank you for your contributions! > > Ben Kietzman >

Re: [VOTE] Release Apache Arrow 14.0.2 - RC3

2023-12-14 Thread wish maple
+1 (binding) Verified C++ and Python in my M1 MacOS Best, Xuwei Fu Jean-Baptiste Onofré 于2023年12月15日周五 00:19写道: > +1 (non binding) > > I checked: > - hash and signature are OK > - build is OK as soon as submodule are added (see the discussion on > another thread) > - LICENSE and NOTICE look go

[DISCUSS] Proposal: Efficient filtering in parquet-cpp

2023-12-29 Thread wish maple
Hi, all. We're proposing Page Filtering in parquet-cpp implementation[1]. Currently, parquet-cpp and arrow only support RowGroup/ColumnChunk level pruning. Now we can support filtering with Parquet PageIndex[2]. The interface can be also used to helping implementing the iceberg positional delete f

Re: [VOTE] Release Apache Arrow 15.0.1 - RC0

2024-03-05 Thread wish maple
+1 verified C++ and Python on M1 MacOS Best, Xuwei Fu Raúl Cumplido 于2024年3月4日周一 17:05写道: > Hi, > > I would like to propose the following release candidate (RC0) of Apache > Arrow version 15.0.1. This is a release consisting of 37 > resolved GitHub issues[1]. > > This release candidate is based

Re: [C++][Parquet] Add support for writing bloom filter to Parquet file

2024-03-16 Thread wish maple
I was working on this previously[1]. But forgot the context for it. Now I'll moving this forward [1] https://github.com/apache/arrow/pull/37400 Best regards, Xuwei Fu Andrei Lazăr 于2024年3月17日周日 03:14写道: > Hi, > > I would like proposing extending the C++ library to add support for writing > blo

Re: [ANNOUNCE] New Arrow committer: Bryce Mecum

2024-03-17 Thread wish maple
Congrats! Best, Xuwei Fu Nic Crane 于2024年3月18日周一 10:24写道: > On behalf of the Arrow PMC, I'm happy to announce that Bryce Mecum has > accepted an invitation to become a committer on Apache Arrow. Welcome, and > thank you for your contributions! > > Nic >

Re: [ANNOUNCE] New Committer Joel Lubinitsky

2024-04-01 Thread wish maple
Congrats Joel! Best, Xuwei Fu Matt Topol 于2024年4月1日周一 22:59写道: > On behalf of the Arrow PMC, I'm happy to announce that Joel Lubinitsky has > accepted an invitation to become a committer on Apache Arrow. Welcome, and > thank you for your contributions! > > --Matt >

Re: [VOTE] Bulk ingestion support for Flight SQL (vote #2)

2024-04-06 Thread wish maple
+1 (non binding) Best, Xuwei Fu ulk ingestion support for Flight SQL David Li 于2024年4月5日周五 16:38写道: > Hello, > > Joel Lubinitsky has proposed adding bulk ingestion support to Arrow Flight > SQL [1]. This provides a path for uploading an Arrow dataset to a Flight > SQL server to create or append

Parquet: Legacy timestamp "adjustToUtc" conversion change in arrow 16.0

2024-04-10 Thread wish maple
The issue [1] mentions about the syntax change about arrow parquet. In general, when reading from a Parquet file with legacy timestamp not written by arrow, isAdjustedToUTC would be ignored during read. And when filtering a file like this, filtering would not work. When casting from a "deprecated

Re: [ANNOUNCE] New Arrow committer: Sarah Gilmore

2024-04-11 Thread wish maple
Congrats! Best, Xuwei Fu Kevin Gurney 于2024年4月11日周四 23:22写道: > Congratulations, Sarah!! Well deserved! > > From: Jacob Wujciak > Sent: Thursday, April 11, 2024 11:14 AM > To: dev@arrow.apache.org > Subject: Re: [ANNOUNCE] New Arrow committer: Sarah Gilmore > >

Re: [ANNOUNCE] New Arrow committer: Dane Pitkin

2024-05-07 Thread wish maple
Congrats! Best, Xuwei Fu Joris Van den Bossche 于2024年5月7日周二 21:53写道: > On behalf of the Arrow PMC, I'm happy to announce that Dane Pitkin has > accepted an invitation to become a committer on Apache Arrow. Welcome, > and thank you for your contributions! > > Joris >

Re: [VOTE] Release Apache Arrow 16.1.0 - RC1

2024-05-09 Thread wish maple
+1 (binding) TEST_DEFAULT=0 TEST_CPP=1 ./verify-release-candidate.sh 16.1.0 1 Release candidate 16.1.0 works well on my M1 MacOS Best, Xuwei Fu David Li 于2024年5月10日周五 09:30写道: > +1 (binding) > > Tested sources with Conda on Debian 12/x86_64 (binaries failed due to > download flakiness) > > On

Re: [VOTE] Release Apache Arrow 16.1.0 - RC1

2024-05-10 Thread wish maple
Ah, only PMC can vote binding Please regard me as non-binding Best, Xuwei Fu wish maple 于2024年5月10日周五 10:39写道: > +1 (binding) > > TEST_DEFAULT=0 TEST_CPP=1 ./verify-release-candidate.sh 16.1.0 1 > Release candidate 16.1.0 works well on my M1 MacOS > > Best, > Xuwei Fu >

Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

2024-06-13 Thread wish maple
Some configs, like use_thread would be true in Python but false in C++ Maybe we call fill all configs explicitly with same values Best, Xuwei Fu J N 于2024年6月13日周四 13:32写道: > Hello, > We all know that there inherent overhead in Python, and we wanted to > compare the performance of reading d

Re: [VOTE][Format] Opaque canonical extension type

2024-07-24 Thread wish maple
+1 (non-binding) Checked spec change and C++ impl. Best, Xuwei Fu Gang Wu 于2024年7月24日周三 20:51写道: > +1 (non-binding) > > Checked spec change and C++ impl. > > On Wed, Jul 24, 2024 at 6:52 PM Joel Lubinitsky > wrote: > > > +1 (non-binding) > > > > Go implementation LGTM > > > > On Wed, Jul 24,

Re: [VOTE][Format] Bool8 Canonical Extension Type

2024-08-05 Thread wish maple
+1 (non-binding) Best, Xuwei Fu David Li 于2024年8月6日周二 10:20写道: > +1 (binding) > > On Tue, Aug 6, 2024, at 10:17, Sutou Kouhei wrote: > > +1 (binding) > > > > In > > "[VOTE][Format] Bool8 Canonical Extension Type" on Mon, 5 Aug 2024 > > 08:59:42 -0400, > > Joel Lubinitsky wrote: > > > >> H

Re: [VOTE] Split Go release process

2024-08-26 Thread wish maple
+1 (non-binding) Best, Xuwei Fu Raúl Cumplido 于2024年8月26日周一 15:48写道: > +1 (binding) > > El lun, 26 ago 2024, 6:23, Matt Topol escribió: > > > +1 (binding) > > > > On Mon, Aug 26, 2024, 12:08 AM Ruoxi Sun wrote: > > > > > +1 non-binding > > > > > > > > > *Regards,* > > > *Rossi SUN* > > > > >

Re: [DISCUSS][C++] Indent #if (preprocessor directives)

2024-08-27 Thread wish maple
+1 (non-binding) LGTM in Parquet part Best, Xuwei Fu Sutou Kouhei 于2024年8月28日周三 09:07写道: > Hi, > > How about indenting preprocessor directives for readability? > > Issue: https://github.com/apache/arrow/issues/43796 > PR: https://github.com/apache/arrow/pull/43798 > > For example: > > Before:

Re: [ANNOUNCE] New Arrow committer: Will Ayd

2024-10-09 Thread wish maple
Congrats! Best, Xuwei Fu David Li 于2024年10月5日周六 19:15写道: > Welcome, Will! > > On Wed, Oct 2, 2024, at 23:25, Gang Wu wrote: > > Congrats and welcome! > > > > Best regards, > > Gang > > > > On Wed, Oct 2, 2024 at 10:16 PM Vibhatha Abeykoon > > wrote: > > > >> Congratulations, Will! > >> > >> On

Re: [ANNOUNCE] New Arrow committer: Rossi Sun

2024-10-22 Thread wish maple
Congrats Ruoxi! Best, Xuwei Fu Felipe Oliveira Carvalho 于2024年10月23日周三 08:18写道: > Great news! Congratulations. > > — > Felipe > > On Tue, 22 Oct 2024 at 16:03 Weston Pace wrote: > > > On behalf of the Arrow PMC, I'm happy to announce that Rossi Sun has > > accepted an invitation to become a co

Re: [VOTE] Add Async C Data Interface

2024-10-28 Thread wish maple
+1 (non-binding) Best, Xuwei Fu David Li 于2024年10月29日周二 07:51写道: > +1 (binding) for me > > On Sat, Oct 26, 2024, at 10:39, Ian Cook wrote: > > Oh ok, thanks Matt, I understand. > > > > In that case I am +1 on the proposal but I would like to see notes added > to > > the documentation to make th

Re: [ANNOUNCE] New Arrow PMC chair: Neil Richardson

2024-10-30 Thread wish maple
Thanks Andy and congrats Neal! Andrew Lamb 于2024年10月30日周三 19:28写道: > I am pleased to announce that the Arrow Project has a new PMC chair and VP > as per our tradition of rotating the chair once a year. Andy Grove has > resigned and > Neil Richardson was duly elected by the PMC and approved unani

Re: [ANNOUNCE] New Arrow committer: Adam Reeve

2024-11-18 Thread wish maple
Congrets Adam! Best, Xuwei Fu Sutou Kouhei 于2024年11月19日周二 08:31写道: > On behalf of the Arrow PMC, I'm happy to announce that Adam Reeve > has accepted an invitation to become a committer on Apache > Arrow. Welcome, and thank you for your contributions! > > -- > kou > >

Re: [ANNOUNCE] New Arrow committer: Laurent Goujon

2024-11-25 Thread wish maple
Congrats! Best, Xuwei Fu David Li 于2024年11月25日周一 17:35写道: > On behalf of the Arrow PMC, I'm happy to announce that Laurent Goujon has > accepted an invitation to become a committer on Apache Arrow. Welcome, and > thank you for your contributions! > > -- > David >

Re: [VOTE] Statistics through the C data interface

2024-12-04 Thread wish maple
+1 (non-binding) Best, Xuwei Fu Sutou Kouhei 于2024年12月5日周四 10:58写道: > Hi, > > I would like to propose standardizing how to pas statistics > through the C data interface. > > Motivation: > > * We want to pass not only Apache Arrow data but also > statistics of them through the C data interface

Re: [ANNOUNCE] New Arrow PMC member: Gang Wu

2024-12-03 Thread wish maple
Congrats! Best, Xuwei Fu Sutou Kouhei 于2024年12月4日周三 05:20写道: > The Project Management Committee (PMC) for Apache Arrow has invited > Gang Wu to become a PMC member and we are pleased to announce > that Gang Wu has accepted. > > Congratulations and welcome! >

Re: [ANNOUNCE] New Arrow PMC member: Bryce Mecum

2025-02-05 Thread wish maple
Congrats! Best, Xuwei Fu Raúl Cumplido 于2025年2月6日周四 15:47写道: > Congrats Bryce! > > El jue, 6 feb 2025, 6:22, Weston Pace escribió: > > > Congrats Bryce! > > > > On Wed, Feb 5, 2025 at 8:35 PM Saurabh Singh > > wrote: > > > > > Congratulations Bryce. > > > > > > On Thu, 6 Feb 2025 at 07:41, Ga

Re: [ANNOUNCE] New Arrow committer: Ed Seidl (etseidl)

2025-01-29 Thread wish maple
Congratulations Ed! Well deserved! Best, Xuwei Fu Weston Pace 于2025年1月29日周三 20:19写道: > Congratulations Ed! > > On Wed, Jan 29, 2025 at 2:20 AM Andrew Lamb > wrote: > > > On behalf of the Arrow PMC, I'm happy to announce that Ed Seidl > > has accepted an invitation to become a committer on Apac

Re: [C++] Bump required CMake version

2024-12-09 Thread wish maple
+1 on 3.25 Best, Xuwei Fu Ruoxi Sun 于2024年12月10日周二 08:36写道: > I would +1 on 3.25. > > Thanks kou for driving this. > > *Regards,* > *Rossi SUN* > > > On Tue, Dec 10, 2024 at 7:41 AM Jacob Wujciak-Jens > wrote: > > > +1 on 3.25 > > > > Thanks for the summary kou. > > > > Am Mo., 9. Dez. 2024 um

Re: [ANNOUNCE] New Arrow PMC member: Rok Mihevc

2025-03-19 Thread wish maple
Congratulations Rok! Best, Xuwei Fu Antoine Pitrou 于2025年3月20日周四 03:10写道: > > Hello all, > > The Project Management Committee (PMC) for Apache Arrow has invited > Rok Mihevc to become a PMC member and we are pleased to announce that > Rok has accepted. > > Regards > > Antoine. >

Re: [DISCUSS] Turtle canonical extension type

2025-04-01 Thread wish maple
Out of curiosity, so this turtle type is like an array containing the info arrow stream ipc batches? Do binary values have some alignas rule? And is `label` and `value` all non-nullable? Best, Xuwei Fu Weston Pace 于2025年4月2日周三 02:52写道: > I've written a draft at [1] but for simplicity's sake I

Re: [ANNOUNCE] New Arrow PMC member: Ian Cook

2025-04-04 Thread wish maple
Congrats Ian! Best, Xuwei Fu Sutou Kouhei 于2025年3月20日周四 16:04写道: > The Project Management Committee (PMC) for Apache Arrow has invited > Ian Cook to become a PMC member and we are pleased to announce > that Ian Cook has accepted. > > Congratulations and welcome! >

Re: [DISCUSS][C++] Switch to C++20

2025-05-19 Thread wish maple
+1 (non-binding) Best, Xuwei Fu Antoine Pitrou 于2025年5月20日周二 00:14写道: > > Hello, > > I am proposing that we switch Arrow C++ to require C++20. > > C++20 will offer support for more C++ language and standard library > features, such as: > > - concepts > - generic lambdas with explicit type param

Re: [DISCUSS] Arrow Variant Extension Type

2025-05-21 Thread wish maple
When I went through the parquet variant spec, I found that an arrow extension type might be a must because decoding the parquet row by row is so inefficient. I've draft a decoding tool in parquet c++ and ready for review now [1] [1] https://github.com/apache/arrow/pull/46372 Best, Xuwei Fu Matt

Re: Blog post about recent improvements to hash join in Arrow C++

2025-07-14 Thread wish maple
+1 Best, Xuwei Fu Alenka Frim 于2025年7月14日周一 20:41写道: > +1 from me too! > > I really like that this topic is being shared in the form of a blog post — > it's well written and nice to read. I especially like the introduction! > I’d also be happy to read a bit more about hash joins here, as Nic >

Re: [ANNOUNCE] New Arrow PMC member: Alenka Frim

2025-07-01 Thread wish maple
Congrats, Alenka! Best, Xuwei Fu Krisztián Szűcs 于2025年7月1日周二 17:13写道: > Congrats Alenka! > > > On 2025. Jul 1., at 9:38, Raúl Cumplido wrote: > > > > The Project Management Committee (PMC) for Apache Arrow has invited > Alenka > > Frim to become a PMC member and we are pleased to announce tha