from:"Ian Cook"

Re: [RESULT][VOTE] Release Apache Arrow 8.0.0 - RC3

2022-05-28 Thread Ian Cook

> 16. [todo:ianmcook] update vcpkg port

Done

On Sat, May 7, 2022 at 07:20 Krisztián Szűcs 
wrote:

> Updated status of post-release tasks:
>
> 1. [done] make the released version as "RELEASED" on JIRA
> 2. [done] start the new version on JIRA
> 3. [done] merge changes on release branch to maintenance branch for
> patch releases
> 4. [done] upload source
> 5. [done] upload binaries
> 6. [done] update website
> 7. [done] update Homebrew packages
> 8. [done] update MSYS2 package
> 9. [todo:kou] upload RubyGems
> 10. [done] upload JS packages
> 11. [done] upload C# packages
> 12. [todo:xhochy?] update conda recipes
> 13. [done] upload wheels/sdist to pypi
> 14. [done] publish Maven artifacts
> 15. [todo:nealrichardson] update R packages
> 16. [todo:ianmcook] update vcpkg port
> 17. [done] bump versions
> 18. [done] update tags for Go modules
> 19. [done] update docs
> 20. [done] announce release
> 21. [done] remove old release candidates
>
> On Sat, May 7, 2022 at 3:40 AM Neal Richardson
>  wrote:
> >
> > I will handle the R submission to CRAN.
> >
> > Neal
> >
> > On Fri, May 6, 2022 at 6:14 PM Sutou Kouhei  wrote:
> >
> > > > 9. [todo:kou?] upload RubyGems
> > >
> > > I'll do it once Homebrew and MSYS2 packages are updated.
> > >
> > > In  >
> > >   "Re: [RESULT][VOTE] Release Apache Arrow 8.0.0 - RC3" on Fri, 6 May
> 2022
> > > 23:37:58 +0200,
> > >   Krisztián Szűcs  wrote:
> > >
> > > > Current status of the post-release tasks:
> > > >
> > > > 1. [done] make the released version as "RELEASED" on JIRA
> > > > 2. [done] start the new version on JIRA
> > > > 3. [done] merge changes on release branch to maintenance branch for
> > > > patch releases
> > > > 4. [done] upload source
> > > > 5. [done] upload binaries
> > > > 6. [done] update website
> > > > 7. [in-pr] update Homebrew packages
> > > > 8. [in-pr] update MSYS2 package
> > > > 9. [todo:kou?] upload RubyGems
> > > > 10. [done] upload JS packages
> > > > 11. [done] upload C# packages
> > > > 12. [todo:xhochy?] update conda recipes
> > > > 13. [done] upload wheels/sdist to pypi
> > > > 14. [done] publish Maven artifacts
> > > > 15. [todo:nealrichardson?] update R packages
> > > > 16. [todo:ianmcook?] update vcpkg port
> > > > 17. [done] bump versions
> > > > 18. [done] update tags for Go modules
> > > > 19. [in-pr] update docs
> > > > 20. [todo:kszucs] announce release
> > > >
> > > > On Fri, May 6, 2022 at 10:18 PM Krisztián Szűcs
> > > >  wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> The vote carries with 4 +1 binding votes, 4 +1 non-binding votes and
> > > >> no -1 votes.
> > > >> I'm starting to work on the post-release tasks and keep this thread
> > > >> updated about the current status.
> > > >>
> > > >> Thanks everyone!
> > > >>
> > > >> - Krisztian
> > > >>
> > > >> On Fri, May 6, 2022 at 8:33 PM Krisztián Szűcs
> > > >>  wrote:
> > > >> >
> > > >> > +1 (binding)
> > > >> >
> > > >> > Verified on macOS 12 arm64.
> > > >> > The crossbow verification tasks were also successful [1].
> > > >> >
> > > >> > [1]: https://github.com/apache/arrow/pull/13057
> > > >> >
> > > >> > On Thu, May 5, 2022 at 4:02 PM Dewey Dunnington <
> > > de...@voltrondata.com> wrote:
> > > >> > >
> > > >> > > +1 (non-binding)
> > > >> > >
> > > >> > > I ran:
> > > >> > > TEST_DEFAULT=0 TEST_CPP=1
> dev/release/verify-release-candidate.sh
> > > 8.0.0 3
> > > >> > >
> > > >> > > I also ran R CMD check locally on that commit, and only got the
> > > usual NOTE
> > > >> > > about a large libs directory.
> > > >> > >
> > > >> > > I ran into an OSError (too many open files) when trying with
> > > TEST_PYTHON=1,
> > > >> > > but I assume this is something to do with my local setup (this
> is
> > > the first
> > > >> > > release I've tried to verify).
> > > >> > >
> > > >> > > On Thu, May 5, 2022 at 7:00 AM Raul Cumplido <
> r...@voltrondata.com>
> > > wrote:
> > > >> > >
> > > >> > > > +1 (non-binding)
> > > >> > > >
> > > >> > > > I ran:
> > > >> > > > TEST_DEFAULT=0 TEST_CPP=1 TEST_GLIB=1 TEST_PYTHON=1 TEST_GO=1
> > > TEST_JAVA=1
> > > >> > > > TEST_JS=1 TEST_RUBY=1 TEST_CSHARP=1
> > > dev/release/verify-release-candidate.sh
> > > >> > > > 8.0.0 3
> > > >> > > >
> > > >> > > > on arch linux (5.17.5-arch1-1), x86_64 with:
> > > >> > > > gcc version 11.2.0 (GCC)
> > > >> > > > openjdk version "11.0.15" 2022-04-19
> > > >> > > > python 3.10.4
> > > >> > > > go 1.18.1
> > > >> > > > ruby 3.0.4
> > > >> > > >
> > > >> > > > On Thu, May 5, 2022 at 8:24 AM Bryan Cutler <
> cutl...@gmail.com>
> > > wrote:
> > > >> > > >
> > > >> > > > > +1 (non-binding)
> > > >> > > > >
> > > >> > > > > I ran:
> > > >> > > > > TEST_DEFAULT=0 TEST_INTEGRATION_CPP=1
> TEST_INTEGRATION_JAVA=1
> > > >> > > > > ARROW_GANDIVA=OFF ARROW_PLASMA=OFF
> > > >> > > > dev/release/verify-release-candidate.sh
> > > >> > > > > 8.0.0 3
> > > >> > > > >
> > > >> > > > > On Wed, May 4, 2022 at 3:23 PM Sutou Kouhei <
> k...@clear-code.com>
> > > wrote:
> > > >> > > > >
> > > >> > > > > > +1
> > > >> > > > > >
> > > >> > >

Re: June 23 virtual conference to highlight work in the Arrow ecosystem

2022-06-07 Thread Ian Cook

The deadline for applying to speak at the June 23 event has passed. We
plan to do more events like this in the future so I expect there will
be more opportunities to speak.

Ian

On Mon, Jun 6, 2022 at 10:28 PM Niranda Perera  wrote:
>
> Hi,
>
> Is the event still open for submitting talks? If so, I'd like to talk about 
> how we use Arrow in Cylon for distributed memory parallel computations.
>
> Best
>
> On Fri, May 13, 2022 at 7:21 PM Wes McKinney  wrote:
>>
>> hi all,
>>
>> My employer (Voltron Data) is organizing a free virtual conference on
>> June 23 to highlight development work and usage of Apache Arrow — you
>> can register for this or apply to give a talk here:
>>
>> https://thedatathread.com/
>>
>> We are especially interested in hearing from users (as opposed to only
>> project developers/contributors!) about how they are using Arrow in
>> their downstream applications. If you would be interested in speaking
>> (talks will be pre-recorded, so you don't need to be available on June
>> 23), please apply to give a short talk (~15 min) on the website!
>>
>> Thanks,
>> Wes
>
>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44
>

Arrow sync call June 8 at 12:00 US/Eastern, 16:00 UTC

2022-06-08 Thread Ian Cook

Hi all,

Our biweekly sync call is today at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: Arrow sync call June 8 at 12:00 US/Eastern, 16:00 UTC

2022-06-08 Thread Ian Cook

Attendees:

Ian Cook
Raúl Cumplido
Alenka Frim
Ian Joiner
Will Jones
Jorge Leitão
David Li
Rok Mihevc
Ashish Paliwal
Matthew Topol
Jacob Wujciak


Discussion:

Recent changes to the merge script for apache/arrow PRs
- Now uses a personal access token (PAT) to authenticate to the ASF Jira
- Now requires the GitHub PAT to have workflow scope
- See discussion about this on Zulip [1]

Stabilizing the C Stream interface
- It has been 20 months since its introduction, with no changes
- See the ML discussion [2] about this
- Will Jones has put up two PRs [3][4] and started a vote [5] about
this on the mailing list

Changes to release management guide
- Most of the content from the release management guide has been moved
[6] from Confluence [7] to the Arrow repo [8] where it is built as
part of the Arrow docs site [9]

Proposed changes to release process
-  Raúl has proposed [10] a change to the release process to simplify
creation of release candidates and has opened a PR [11] to update the
release management guide to reflect this change

Substrait project
- There is more collaboration happening between the Arrow and Substrait projects
- There is a Substrait Community page [12] with details about how to
get involved in Substrait

Proposal to Dockerize the integration tests:
- Jorge opened a PR proposing this [13] that Raúl and Jacob are reviewing

[1] 
https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Merge.20script.20with.20API.20keys/near/285049925
[2] https://lists.apache.org/thread/0y604o9s3wkyty328wv8d21ol7s40q55
[3] https://github.com/apache/arrow/pull/13345
[4] https://github.com/apache/arrow-rs/pull/1821
[5] https://lists.apache.org/thread/5bvk6m3y3wl0m4jdsnyhdylt1w5j288k
[6] https://github.com/apache/arrow/pull/13272
[7] https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
[8] 
https://github.com/apache/arrow/blob/master/docs/source/developers/release.rst
[9] https://arrow.apache.org/docs/dev/developers/release.html
[10] https://lists.apache.org/thread/g6mqpyq2hc11xbgrq2pf653njzy53plt
[11] https://github.com/apache/arrow/pull/13308
[12] https://substrait.io/community/
[13] https://github.com/apache/arrow/pull/12407

On Wed, Jun 8, 2022 at 10:44 AM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly sync call is today at 12:00 noon Eastern time.
>
> The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>
> Alternatively, enter this information into the Zoom website or app to
> join the call:
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian

Re: Arrow sync call June 8 at 12:00 US/Eastern, 16:00 UTC

2022-06-08 Thread Ian Cook

Hi Gavin,

There was no detailed discussion in the meeting about this, just some
general comments, but I'll share a few areas of collaboration that I'm
aware of:
- There is work ongoing to enable the Arrow C++ compute engine (aka
"Acero") to consume Substrait plans, change them into ExecPlans, and
execute them. Work started on this late last year [1] and has
continued since then [2].
- There are plans to adopt Substrait in DataFusion [3] and Ballista [4]

There are also several other Sustrait-related projects not directly in
Arrow repos that engineers at Voltron Data are working on:
- Creating a Substrait compiler for Ibis [5], to allow Python users to
write code in a convenient analytics DSL and have it execute on
engines that can consume Substrait
- Creating a Substrait compiler for dplyr [6], to allow R users to
write dplyr code that can execute on engines that can consume
Substrait
- Creating a Substrait plan validator [7]
- Planning for "ADBC" to support Substrait [8]
- Defining more functions in the Substrait specification [9] <-- This
is an area where we could use more help

Thanks,
Ian

[1] https://github.com/apache/arrow/pull/11707
[2] 
https://github.com/apache/arrow/pulls?q=is%3Apr+substrait+label%3Alang-c%2B%2B
[3] https://github.com/apache/arrow-datafusion/issues/2646
[4] https://github.com/apache/arrow-ballista/issues/32
[5] https://github.com/ibis-project/ibis-substrait/
[6] https://github.com/voltrondata/substrait-r
[7] http://github.com/substrait-io/substrait-validator
[8] 
https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/
[9] https://github.com/substrait-io/substrait/tree/main/extensions



On Wed, Jun 8, 2022 at 5:41 PM Gavin Ray  wrote:
>
> Thanks Ian -- can I ask whether there was any discussion of note that
> happened around Arrow + Substrait stuff?
>
>
> On Wed, Jun 8, 2022 at 5:31 PM Ian Cook  wrote:
>
> > Attendees:
> >
> > Ian Cook
> > Raúl Cumplido
> > Alenka Frim
> > Ian Joiner
> > Will Jones
> > Jorge Leitão
> > David Li
> > Rok Mihevc
> > Ashish Paliwal
> > Matthew Topol
> > Jacob Wujciak
> >
> >
> > Discussion:
> >
> > Recent changes to the merge script for apache/arrow PRs
> > - Now uses a personal access token (PAT) to authenticate to the ASF Jira
> > - Now requires the GitHub PAT to have workflow scope
> > - See discussion about this on Zulip [1]
> >
> > Stabilizing the C Stream interface
> > - It has been 20 months since its introduction, with no changes
> > - See the ML discussion [2] about this
> > - Will Jones has put up two PRs [3][4] and started a vote [5] about
> > this on the mailing list
> >
> > Changes to release management guide
> > - Most of the content from the release management guide has been moved
> > [6] from Confluence [7] to the Arrow repo [8] where it is built as
> > part of the Arrow docs site [9]
> >
> > Proposed changes to release process
> > -  Raúl has proposed [10] a change to the release process to simplify
> > creation of release candidates and has opened a PR [11] to update the
> > release management guide to reflect this change
> >
> > Substrait project
> > - There is more collaboration happening between the Arrow and Substrait
> > projects
> > - There is a Substrait Community page [12] with details about how to
> > get involved in Substrait
> >
> > Proposal to Dockerize the integration tests:
> > - Jorge opened a PR proposing this [13] that Raúl and Jacob are reviewing
> >
> > [1]
> > https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Merge.20script.20with.20API.20keys/near/285049925
> > [2] https://lists.apache.org/thread/0y604o9s3wkyty328wv8d21ol7s40q55
> > [3] https://github.com/apache/arrow/pull/13345
> > [4] https://github.com/apache/arrow-rs/pull/1821
> > [5] https://lists.apache.org/thread/5bvk6m3y3wl0m4jdsnyhdylt1w5j288k
> > [6] https://github.com/apache/arrow/pull/13272
> > [7]
> > https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide
> > [8]
> > https://github.com/apache/arrow/blob/master/docs/source/developers/release.rst
> > [9] https://arrow.apache.org/docs/dev/developers/release.html
> > [10] https://lists.apache.org/thread/g6mqpyq2hc11xbgrq2pf653njzy53plt
> > [11] https://github.com/apache/arrow/pull/13308
> > [12] https://substrait.io/community/
> > [13] https://github.com/apache/arrow/pull/12407
> >
> > On Wed, Jun 8, 2022 at 10:44 AM Ian Cook  wrote:
> > >
> > > Hi all,
> > >
> > > Our biweekly sync call is today at 12:00 noon Eastern time.
> > >
> > > The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> > > https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> > >
> > > Alternatively, enter this information into the Zoom website or app to
> > > join the call:
> > > Meeting ID: 876 4903 3008
> > > Passcode: 958092
> > >
> > > Thanks,
> > > Ian
> >

Arrow sync call June 22 at 12:00 US/Eastern, 16:00 UTC

2022-06-22 Thread Ian Cook

Hi all,

Our biweekly sync call is today at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: [ANNOUNCE] New Arrow committers: Dewey Dunnington, Alenka Frim, and Rok Mihevc

2022-06-22 Thread Ian Cook

Congratulations Dewey, Alenka, and Rok!!!

On Wed, Jun 22, 2022 at 1:13 PM Neal Richardson
 wrote:
>
> On behalf of the Arrow PMC, I'm happy to announce that
>
> Dewey Dunnington
> Alenka Frim
> Rok Mihevc
>
> have all accepted invitations to become committers on Apache Arrow!
> Welcome, thank you for all your contributions so far, and we look forward
> to continuing to drive Apache Arrow forward to an even better place in the
> future.
>
> Neal

Re: Arrow sync call June 22 at 12:00 US/Eastern, 16:00 UTC

2022-06-22 Thread Ian Cook

Attendees:

- Ian Cook
- Will Jones
- David Li
- Rok Mihevc
- Dragoș Moldovan-Grünfeld
- Aldrin Montana
- Matthew Topol
- Jacob Wujciak


Discussion:

Book about Apache Arrow released this week
- Matt Topol's book "In-Memory Analytics with Apache Arrow" goes on
sale this Friday June 24 [1]

Conference about Apache Arrow happening this week
- This Thursday June 23, Voltron Data is hosting "The Data Thread", a
virtual conference focused on the Apache Arrow ecosystem [2]

Discussion about patch releases
- Prompted by recent reports of a vulnerability in a dependency of the
Arrow Go package [3] and a missing feature in one of the PyArrow
wheels [4], there was an inconclusive discussion about our current
process for creating patch releases for packages maintained the
apache/arrow monorepo and whether this meets the needs of the
different language libraries.
- This led to a discussion about whether the Arrow maintainers should
provide patch releases for older major versions of Arrow, which would
fix critical bugs without introducing any breaking changes. It was
agreed that this would likely be burdensome because of the
unpredictable level of difficulty in backporting fixes and the
predictably high level of difficulty in producing releases, verifying
them, packaging them, and distributing them.

Discussion about versioning conventions
- There was an inconclusive discussion about whether we should
consider moving away from the current Arrow release versioning
conventions in which each quarterly release of the packages maintained
in the apache/arrow monorepo increments the major version number. As
some implementations of Arrow become more stable, it is unclear to
some whether we should always be incrementing the major version
number, which under the semantic versioning scheme indicates that the
release includes breaking changes to the API. Always incrementing
major version numbers creates some difficulty for package maintainers,
for example for the Go package.
- Note that the current semantic versioning convention was explained
in the Arrow 1.0.0 release blog post [5]

Upcoming 9.0.0 release targeted for mid-July
- Please see the email about this upcoming release [6]
- There have been some changes to the release process that will take
effect with this release, as described on the mailing list [7] and in
the updated release management guide [8]. There was another PR merged
just today with some additional changes to the release management
guide [9]. Please review the updated processes if you will be
participating in the 9.0.0 release.
- It is anticipated that fewer problems will be identified during the
release verification process because of the increased visibility and
attention on fixing nightly CI failures, for example using the
Crossbow nightly report [10]


[1] https://lists.apache.org/thread/gnlby6hs2jl4fhtk6wlx0zmw3ox3lqyj
[2] https://thedatathread.com
[3] https://lists.apache.org/thread/zyhl1r1nkp82lr6wtyz6h2z5knoly73q
[4] https://issues.apache.org/jira/browse/ARROW-16779
[5] https://arrow.apache.org/blog/2020/07/24/1.0.0-release/
[6] https://lists.apache.org/thread/8b7yyzgmtb6mq7od0jbntvfflm0vv72o
[7] https://lists.apache.org/thread/g6mqpyq2hc11xbgrq2pf653njzy53plt
[8] https://arrow.apache.org/docs/dev/developers/release.html
[9] https://github.com/apache/arrow/pull/13308
[10] https://crossbow.voltrondata.com



On Wed, Jun 22, 2022 at 10:31 AM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly sync call is today at 12:00 noon Eastern time.
>
> The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>
> Alternatively, enter this information into the Zoom website or app to
> join the call:
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian

Arrow sync call July 6 at 12:00 US/Eastern, 16:00 UTC

2022-07-05 Thread Ian Cook

Hi all,

Our biweekly sync call is tomorrow at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: Arrow sync call July 6 at 12:00 US/Eastern, 16:00 UTC

2022-07-06 Thread Ian Cook

Attendees:

- Ian Cook
- Raul Cumplido
- James Duong
- Will Jones
- David Li
- Ashish Paliwal
- Matt Topol
- Jacob Wujciak


Discussion:

Donation of Flight SQL drivers to Arrow project
- The vote to accept the donation of the Flight SQL JDBC driver has
passed [1]. We are awaiting some administrative steps before the
donation is complete.
- Dremio/Bit Quill has also created a Flight SQL ODBC driver [2] but
there are currently some pieces of this driver that are incompatible
with the Apache 2.0 license. There are also some additional features
needed to make it generally usable outside of Dremio applications.
James Duong is looking into the feasibility of resolving these issues
so that it could potentially be donated to the Arrow project.

Build issues with MSVC for Win32
- There is a discussion thread [3] and a Jira ticket [4] with several
open PRs to fix a problem building with MSVC for Win32. One of the PRs
was closed awaiting a corporate approval process, but James Duong has
been working to resolve this issue independently of that PR and is
hoping to open a separate PR to fix it soon.

Discussion about good resources for getting oriented/started as a new
Arrow user/developer
- Suggestions included the Arrow Cookbook [5], Matt Topol's new book
"In-Memory Analytics with Apache Arrow" [6], and the videos from The
Data Thread [7]

Reminder about the upcoming 9.0.0 release
- Plan is to start preparing the release around July 25

[1] https://lists.apache.org/thread/vbb5k9v2h4c5ct8zffyl6b767frz6wzt
[2] https://github.com/dremio/flightsql-odbc
[3] https://lists.apache.org/thread/sqtb98ohqplmlg93wkow2v2c44sc4pqx
[4] https://issues.apache.org/jira/browse/ARROW-16778
[5] https://arrow.apache.org/cookbook/
[6] 
https://www.amazon.com/Memory-Analytics-Apache-Arrow-hierarchical/dp/1801071039/
[7] https://www.youtube.com/c/TheDataThread


On Tue, Jul 5, 2022 at 10:06 PM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly sync call is tomorrow at 12:00 noon Eastern time.
>
> The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>
> Alternatively, enter this information into the Zoom website or app to
> join the call:
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian

Arrow sync call July 20 at 12:00 US/Eastern, 16:00 UTC

2022-07-19 Thread Ian Cook

Hi all,

Our biweekly sync call is tomorrow at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Arrow sync call August 3 at 12:00 US/Eastern, 16:00 UTC

2022-08-02 Thread Ian Cook

Hi all,

Our biweekly sync call is tomorrow at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: Arrow sync call August 3 at 12:00 US/Eastern, 16:00 UTC

2022-08-06 Thread Ian Cook

With apologies for the delay.

Attendees:

Ian Cook
Raúl Cumplido
Dewey Dunnington
James Duong
Todd Farmer
Will Jones
David Li
Rok Mihevc
Ivan Ogasawara
Ashish Paliwal
Niranda Perera
Antoine Pitrou
Matt Topol
Jacob Wujciak


Discussion:

9.0.0 release
- Vote on RC2 has passed
- Post-release tasks are underway
- Please contribute to 9.0.0 release blog post[1]

Discussions about adding new memory layouts to Arrow [2]
- Work is underway to add a run-length encoded layout [3]
- There are plans to start work soon on a StringView layout [4]
- Several attendees discussed points that have been raised in the
discussions above and in comments in the linked documents—in
particular the challenges that the StringView layout creates for
garbage-collected languages

Flight SQL JDBC driver
- James and David discussed changes to the JDBC driver and the
timeline for moving this work out of the flight-sql-jdbc branch and
into the main branch of the apache/arrow repo [5]

Build issues with MSVC for Win32
- James has a PR up for this [6] that needs review
- David will review it

Discussion about the API of the Arrow Java library
- The Java library currently does not provide access to Acero
- The Java API is provides a lower-level abstraction than PyArrow
- The Table object in other Arrow APIs does not map 1:1 onto any one
object in the Java API
- It would be nice to have more resources explaining these things

nanoarrow [7]
- Dewey described this Arrow subproject which is focused on making it
easier to create Arrow arrays in C code to export them to languages
that integrate with the Arrow C APIs
- Useful in cases where you need a lightweight way to build Arrow data
- Is intended to be vendored or statically linked, never used as a
shared library
- Is intended to be minimal but easy to extend


[1] https://github.com/apache/arrow-site/pull/227
[2] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[3] https://lists.apache.org/thread/djy8xn28p264vhj8y5rqbgkgwss6oyo1
[4] https://lists.apache.org/thread/dccj1qrozo88qsxx133kcy308qwfwpfm
[5] https://lists.apache.org/thread/om3yt3w6mngdlck3ghhc6m3k4848wfvk
[6] https://github.com/apache/arrow/pull/13532
[7] http://github.com/apache/arrow-nanoarrow


On Tue, Aug 2, 2022 at 6:15 PM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly sync call is tomorrow at 12:00 noon Eastern time.
>
> The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>
> Alternatively, enter this information into the Zoom website or app to
> join the call:
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian

Arrow sync call August 17 at 12:00 US/Eastern, 16:00 UTC

2022-08-17 Thread Ian Cook

Hi all,

Our biweekly sync call is today at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: Usage of the name Feather?

2022-08-29 Thread Ian Cook

+1 We should explicitly discourage further use of “Feather” to refer to
Arrow IPC files.

In this spirit of simplifying terminology: Does the “IPC” in the term
“Arrow IPC files” serve a truly necessary purpose? Is there another type of
“Arrow file” that the “IPC” serves to disambiguate? If not, can we simply
refer to these files as “Arrow files” in most places in the documentation
and website? (In a few important places we should clarify that when we say
“Arrow file” we are referring to a file that uses the Arrow IPC file
format.)

Ian

On Mon, Aug 29, 2022 at 17:33 Sutou Kouhei  wrote:

> +1 for 1.
>
> Thanks,
> --
> kou
>
> In 
>   "Re: Usage of the name Feather?" on Mon, 29 Aug 2022 20:18:37 +0200,
>   Jorge Cardoso Leitão  wrote:
>
> > I agree.
> >
> > I suspect that the most widely used API with "feather" is Pandas'
> > read_feather.
> >
> >
> >
> > On Mon, 29 Aug 2022, 19:55 Weston Pace,  wrote:
> >
> >> I agree as well.  I think most lingering uses of the term "feather"
> >> are in pyarrow and R however, so it might be good to hear from some of
> >> those maintainers.
> >>
> >>
> >>
> >> On Mon, Aug 29, 2022 at 9:35 AM Antoine Pitrou 
> wrote:
> >> >
> >> >
> >> > I agree with this as well.
> >> >
> >> > Regards
> >> >
> >> > Antoine.
> >> >
> >> >
> >> > On Mon, 29 Aug 2022 11:29:45 -0400
> >> > Andrew Lamb  wrote:
> >> > > In the rust implementation we use the term "Arrow IPC" and I support
> >> your
> >> > > option 1:
> >> > >
> >> > > > The name Feather V2 is deprecated. Only the extension ".arrow"
> will
> >> be
> >> > > used for IPC files.
> >> > >
> >> > > Andrew
> >> > >
> >> > > On Mon, Aug 29, 2022 at 11:21 AM Matthew Topol
> >> 
> >> > > wrote:
> >> > >
> >> > > > When I wrote "In-Memory Analytics with Apache Arrow" I definitely
> >> > > > treated "Feather" as deprecated and mentioned it only in passing
> >> > > > specifically indicating "Arrow IPC" as the terminology to use. I
> only
> >> > > > even mentioned "Feather" at all because there are still methods in
> >> > > > pyarrow that reference it by name.
> >> > > >
> >> > > > That's just my opinion though...
> >> > > >
> >> > > > On Mon, Aug 29 2022 at 11:08:53 AM -0400, David Li
> >> > > >  wrote:
> >> > > > > This has come up before, e.g. see [1] [2] [3].
> >> > > > >
> >> > > > > I would say "Feather" is effectively deprecated and we are using
> >> > > > > "Arrow IPC" now but I am not sure what others think. (From that
> >> > > > > GitHub link, it seems to be mixed.) And ".arrow" is the official
> >> > > > > extension now (since it is registered as part of our MIME type).
> >> But
> >> > > > > there's existing documentation and not everything has been
> updated
> >> to
> >> > > > > be consistent (as you saw).
> >> > > > >
> >> > > > > [1]:
> >> > > > > <
> https://lists.apache.org/thread/0s6lgvd3g56ymd60vl5lgzhf4ro6hts5>
> >> > > > > [2]:
> >> > > > > <
> https://arrow.apache.org/faq/#what-about-the-feather-file-format>
> >> > > > > [3]:
> >> > > > > <
> >> > > >
> >>
> https://stackoverflow.com/questions/67910612/arrow-ipc-vs-feather/67911190#67911190
> >> > > > >
> >> > > > >
> >> > > > > -David
> >> > > > >
> >> > > > > On Mon, Aug 29, 2022, at 10:50, 島 達也 wrote:
> >> > > > >>  Hi all.
> >> > > > >>
> >> > > > >>  I know the documentation (mainly pyarrow documentation)
> sometimes
> >> > > > >> refers
> >> > > > >>  to IPC files as Feather files, but are there any guidelines
> for
> >> > > > >> when to
> >> > > > >>  refer to an IPC file as a Feather file and when to refer to
> it as
> >> > > > >> an IPC
> >> > > > >>  file?
> >> > > > >>  I believe that calling the same file an Arrow IPC file at
> times
> >> and
> >> > > > >> a
> >> > > > >>  Feather file at other times is confusing to those unfamiliar
> with
> >> > > > >> Apache
> >> > > > >>  Arrow (myself included).
> >> > > > >>  Surprisingly, these files may even have completely different
> >> > > > >> extensions,
> >> > > > >>  ".arrow" and ".feather", which are not similar.
> >> > > > >>
> >> > > > >>  Perhaps there are several options for future use of the name
> >> > > > >> Feather,
> >> > > > >>  such as
> >> > > > >>
> >> > > > >>   1. The name Feather V2 is deprecated. Only the extension
> >> ".arrow"
> >> > > > >> will
> >> > > > >>  be used for IPC files.
> >> > > > >>   2. In some contexts(?), IPC files are referred to as Feather;
> >> only
> >> > > > >>  ".arrow" is used for the IPC file extension to clearly
> >> > > > >> distinguish
> >> > > > >>  it from Feather V1's ".feather".
> >> > > > >>   3. When an IPC file is called Feather by some rule, extension
> >> > > > >>  ".feather" is used, and when an IPC file is not called
> >> Feather,
> >> > > > >>  extension ".arrow" is used.
> >> > > > >>
> >> > > > >>  I mistakenly thought the current status was 2, but according
> to
> >> the
> >> > > > >>  discussion in this PR
> >> > > > >> (),
> >> > > > >>  apparently the current status seems 3. (However,

Arrow sync call August 31 at 12:00 US/Eastern, 16:00 UTC

2022-08-30 Thread Ian Cook

Hi all,

Our biweekly sync call is tomorrow at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: Arrow sync call August 31 at 12:00 US/Eastern, 16:00 UTC

2022-08-31 Thread Ian Cook

Attendees:

- Ian Cook
- Raúl Cumplido
- Dewey Dunnington
- James Duong
- Ian Joiner
- Will Jones
- Jonathan Keane
- David Li
- Rok Mihevc
- Niranda Perera
- Antoine Pitrou
- Gavin Ray
- Kae Suarez
- Matt Topol
- Brent Gardner
- Dalton Modlin


Discussion:

Proposal to switch to C++ 17 as the baseline supported version [1]
- The vote has passed [2]
- Antoine is working to diagnose and solve test failures [3]

Proposed rules for registering canonical extension types [4]
- The vote has passed [5] with the change that names will start with
"arrow." not "org.apache.arrow."
- Next step is a PR with docs updates

Status of JDBC driver for Arrow Flight SQL [6]
- James and David are working to get tests to pass

Suggestion for Arrow to add a fraction type
- AKA rational numbers, like what Julia has [7]
- Enables exact ratios to be represented as integers divided by integers
- Ian Joiner intends to start a Jira or ML discussion about this

Proposed support for float16 in Arrow and Parquet
- AKA half-precision
- There is a new discussion [8] and PR [9] proposing support in Parquet
- There is a new discussion about support in Arrow [10]
- There was some discussion about whether and how Arrow could use an
off-spec way of reading and writing float16 columns in Parquet files

Proposal for Flight SQL to support Substrait
- Discussion about this started a while ago [11] and there is a PR
open with discussion [12]
- David has opened a vote about this plus two other proposed
extensions to Flight SQL [13]

Arrow Database Connectivity (ADBC)
- Discussion about how we should handle ADBC in terms of the Arrow
specification; for example should the C header files [14] and Java
interfaces be made a part of the Arrow specification like we did with
the C data interface [15]
- Depending on this discussion, David will start a vote to decide
whether to do this
- For more context/background about ADBC, see the recent Voltron Data
blog post [16]


[1] https://lists.apache.org/thread/9g14n3odhj6kzsgjxr6k6d3q73hg2njr
[2] https://lists.apache.org/thread/dod96gbqtfz7pf096vhlczq6f5hv81z8
[3] https://github.com/apache/arrow/pull/13991
[4] https://lists.apache.org/thread/sxd5fhc42hb6svs79t3fd79gkqj83pfh
[5] https://lists.apache.org/thread/dzognw3o1ozvyn65d9gf8t2r3g8qc7sc
[6] https://github.com/apache/arrow/pull/13800
[7] 
https://docs.julialang.org/en/v1/manual/complex-and-rational-numbers/#Rational-Numbers
[8] https://lists.apache.org/thread/03vmcj7ygwvsbno764vd1hr954p62zr5
[9] https://github.com/apache/parquet-format/pull/184
[10] https://issues.apache.org/jira/browse/ARROW-17464
[11] https://lists.apache.org/thread/ldtlrqzs4qmv7vm64j3g97q5fhzfo9hf
[12] https://github.com/apache/arrow/pull/13492
[13] https://lists.apache.org/thread/3k3np6314dwb0n7n1hrfwony5fcy7kzl
[14] https://github.com/apache/arrow-adbc/blob/main/adbc.h
[15] https://lists.apache.org/thread/rxxgdrfph95zn3clg76posjw2vchclb5
[16] 
https://voltrondata.com/news/simplifying-database-connectivity-with-arrow-flight-sql-and-adbc/


On Tue, Aug 30, 2022 at 4:38 PM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly sync call is tomorrow at 12:00 noon Eastern time.
>
> The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>
> Alternatively, enter this information into the Zoom website or app to
> join the call:
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian

Re: [ANNOUNCE] New Arrow PMC member: Weston Pace

2022-09-05 Thread Ian Cook

Congratulations Weston!

On Mon, Sep 5, 2022 at 01:56 Sutou Kouhei  wrote:

> The Project Management Committee (PMC) for Apache Arrow has invited
> Weston Pace to become a PMC member and we are pleased to announce
> that Weston Pace has accepted.
>
> Congratulations and welcome!
>

Arrow sync call September 14 at 12:00 US/Eastern, 16:00 UTC

2022-09-13 Thread Ian Cook

Hi all,

Our biweekly sync call is tomorrow at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Arrow sync call September 28 at 12:00 US/Eastern, 16:00 UTC

2022-09-28 Thread Ian Cook

Hi all,

Our biweekly sync call is today at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Arrow sync call October 12 at 12:00 US/Eastern, 16:00 UTC

2022-10-11 Thread Ian Cook

Hi all,

Our biweekly sync call is tomorrow at 12:00 noon Eastern time.

I expect this meeting might run shorter than usual because the
attendees from Voltron Data will need to leave to join another call at
12:30 pm Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: Arrow sync call October 12 at 12:00 US/Eastern, 16:00 UTC

2022-10-13 Thread Ian Cook

Attendees:

- Vibhatha Abeykoon
- Anja Boskovic
- Ian Cook
- Will Jones
- David Li
- Rok Mihevc
- Percy Triveño Aucahuasi
- Joris Van den Bossche
- Jacob Wujciak


Discussion:

DELTA_BINARY_PACKED encoder for Parquet
- Rok looking for help debugging the PR [1]
- Tests are failing on some architectures but it is unclear why

Arrow 10.0.0 Release
- We were planning to cut an RC on Monday of next week, but there are
several blockers [2]
- We do not intend to open a vote on an RC until these are resolved;
the timeframe for getting them closed is unclear and so we will likely
have a small delay
- Jacob, Rok, Vibhatha, Percy, and Weston are actively working to
resolve some of these blockers (or determine that they are not true
blockers)
- Percy noted that that the C++ library segmentation fault blocker [3]
seems to be happening because of flakiness similar to what he helped
with in another PR recently [4]
- The release blog post PR [5] has been up for two weeks; thanks to
everyone who already contributed; we are well ahead of where we were
with the blog post in the previous release

Crossbow Nightly Report
- Note that the hosting for the Crossbow Nightly Report website [6]
has been moved from GitHub Pages to an S3 bucket
- Note that the URL will no longer open with the https:// prefix; you
will need to use http://
- There should be no other visible changes

Build Breaking PRs
- As discussed in the previous biweekly call, there were some recent
unannounced changes to build system that broke builds; see the email
from Anja for details [7]
- Anja added a Python developer docs entry explaining how to resolve
this by deleting stale build artifacts [8]
- We should mention this in the release blog post


[1] https://github.com/apache/arrow/pull/14191
[2] https://cwiki.apache.org/confluence/display/ARROW/Arrow+10.0.0+Release
[3] https://issues.apache.org/jira/browse/ARROW-17292
[4] https://github.com/apache/arrow/pull/14298
[5] https://github.com/apache/arrow-site/pull/242
[6] http://crossbow.voltrondata.com
[7] https://lists.apache.org/thread/o9hjqt6zfxxr3xhf3jg61csrkmp84z33
[8] 
https://arrow.apache.org/docs/dev/developers/python.html#deleting-stale-build-artifacts

On Tue, Oct 11, 2022 at 5:28 PM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly sync call is tomorrow at 12:00 noon Eastern time.
>
> I expect this meeting might run shorter than usual because the
> attendees from Voltron Data will need to leave to join another call at
> 12:30 pm Eastern time.
>
> The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>
> Alternatively, enter this information into the Zoom website or app to
> join the call:
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian

Re: [ANNOUNCE] New Arrow PMC member: Nicola Crane

2022-10-25 Thread Ian Cook

Congratulations Nic!!!

On Tue, Oct 25, 2022 at 17:06 Sutou Kouhei  wrote:

> The Project Management Committee (PMC) for Apache Arrow has invited
> Nicola Crane to become a PMC member and we are pleased to announce
> that Nicola Crane has accepted.
>
> Congratulations and welcome!
>

Arrow sync call October 26 at 12:00 US/Eastern, 16:00 UTC

2022-10-26 Thread Ian Cook

Hi all,

Our biweekly sync call is today at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: [DISCUSS] Move issue tracking to

2022-10-26 Thread Ian Cook

Todd said:
>User identifiers differ between systems...

To help with this, I created a couple of small scripts [1] that
extract details from resolved Arrow Jira issues and the associated
merged GitHub PRs, then perform some cleaning, to generate a list [2]
of matching Jira user identifiers and GitHub user identifiers.

This is an incomplete list: it does not include users who have created
Arrow Jira issues but never created an Arrow PR that was merged.

Please do some sanity-checking on this list before using it.

[1] https://gist.github.com/ianmcook/0f1538ebc8268a88cd4e0a0a61445287
[2] 
https://docs.google.com/spreadsheets/d/1C8Pxmad93TJjIeN1lX9wkw1KORKocNSLYtUUPlxHmG0/

Ian

On Wed, Oct 26, 2022 at 8:12 AM Neal Richardson
 wrote:
>
> Thanks Todd! That's really helpful.
>
> Neal
>
> On Tue, Oct 25, 2022 at 7:47 PM Todd Farmer 
> wrote:
>
> > I'm beginning to track decisions needing to be made and problems needing to
> > be solved in the Arrow project GitHub issues, prefixing all such issues
> > with "MIGRATION: ..". They can be found here:
> >
> > https://github.com/apache/arrow/issues?q=is%3Aissue+is%3Aopen+MIGRATION
> >
> > I feel this will be a better forum to talk through specific challenges or
> > decisions if a future vote supports migration to GitHub issues.  Please
> > feel free to comment or to create an issue to track a decision/concern
> > that's not already defined.
> >
> > On Tue, Oct 25, 2022 at 6:37 AM Neal Richardson <
> > neal.p.richard...@gmail.com>
> > wrote:
> >
> > > I'll start a vote on this in the next day or so since it seems like we
> > have
> > > consensus on the main issue (moving from Jira to GitHub Issues) and are
> > > working out the finer points on how we'll migrate and how we'll map Jira
> > > concepts to Issues.
> > >
> > > Speaking of the migration: now would be an *excellent* time for folks to
> > go
> > > through the Jira backlog and close out some old, stale, invalid issues so
> > > that we don't have to import a bunch of stuff we don't need.
> > >
> > > Neal
> > >
> > >
> > > On Tue, Oct 25, 2022 at 5:33 AM Alessandro Molina
> > >  wrote:
> > >
> > > > On Tue, Oct 25, 2022 at 1:55 AM Joris Van den Bossche <
> > > > jorisvandenboss...@gmail.com> wrote:
> > > >
> > > > >
> > > > > I think the main thing we will miss are the Links (relation between
> > > > > issues), but we can try to promote some consistent usage of adding
> > > > > "Duplicate of #...", "Related to #..." in top post of an issue when
> > > > > appropriate.
> > > > >
> > > >
> > > > If we plan to migrate to GitHub, I think we should add an Issue
> > Template
> > > (
> > > >
> > > >
> > >
> > https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/configuring-issue-templates-for-your-repository
> > > > ) to make sure we don't proliferate too many ways to do the same thing
> > in
> > > > terms of categorizing issues properly
> > > >
> > >
> >

Re: [DISCUSS] Move issue tracking to

2022-10-26 Thread Ian Cook

I updated this user mapping list to include both the "key" and "name"
fields for each Jira account. These fields sometimes do not match, and
the "key" field sometimes contains unfriendly values like
JIRAUSER123456. To follow further discussion about this, please see
https://github.com/apache/arrow/issues/14510

Ian

On Wed, Oct 26, 2022 at 1:30 PM Ian Cook  wrote:
>
> Todd said:
> >User identifiers differ between systems...
>
> To help with this, I created a couple of small scripts [1] that
> extract details from resolved Arrow Jira issues and the associated
> merged GitHub PRs, then perform some cleaning, to generate a list [2]
> of matching Jira user identifiers and GitHub user identifiers.
>
> This is an incomplete list: it does not include users who have created
> Arrow Jira issues but never created an Arrow PR that was merged.
>
> Please do some sanity-checking on this list before using it.
>
> [1] https://gist.github.com/ianmcook/0f1538ebc8268a88cd4e0a0a61445287
> [2] 
> https://docs.google.com/spreadsheets/d/1C8Pxmad93TJjIeN1lX9wkw1KORKocNSLYtUUPlxHmG0/
>
> Ian
>
> On Wed, Oct 26, 2022 at 8:12 AM Neal Richardson
>  wrote:
> >
> > Thanks Todd! That's really helpful.
> >
> > Neal
> >
> > On Tue, Oct 25, 2022 at 7:47 PM Todd Farmer 
> > wrote:
> >
> > > I'm beginning to track decisions needing to be made and problems needing 
> > > to
> > > be solved in the Arrow project GitHub issues, prefixing all such issues
> > > with "MIGRATION: ..". They can be found here:
> > >
> > > https://github.com/apache/arrow/issues?q=is%3Aissue+is%3Aopen+MIGRATION
> > >
> > > I feel this will be a better forum to talk through specific challenges or
> > > decisions if a future vote supports migration to GitHub issues.  Please
> > > feel free to comment or to create an issue to track a decision/concern
> > > that's not already defined.
> > >
> > > On Tue, Oct 25, 2022 at 6:37 AM Neal Richardson <
> > > neal.p.richard...@gmail.com>
> > > wrote:
> > >
> > > > I'll start a vote on this in the next day or so since it seems like we
> > > have
> > > > consensus on the main issue (moving from Jira to GitHub Issues) and are
> > > > working out the finer points on how we'll migrate and how we'll map Jira
> > > > concepts to Issues.
> > > >
> > > > Speaking of the migration: now would be an *excellent* time for folks to
> > > go
> > > > through the Jira backlog and close out some old, stale, invalid issues 
> > > > so
> > > > that we don't have to import a bunch of stuff we don't need.
> > > >
> > > > Neal
> > > >
> > > >
> > > > On Tue, Oct 25, 2022 at 5:33 AM Alessandro Molina
> > > >  wrote:
> > > >
> > > > > On Tue, Oct 25, 2022 at 1:55 AM Joris Van den Bossche <
> > > > > jorisvandenboss...@gmail.com> wrote:
> > > > >
> > > > > >
> > > > > > I think the main thing we will miss are the Links (relation between
> > > > > > issues), but we can try to promote some consistent usage of adding
> > > > > > "Duplicate of #...", "Related to #..." in top post of an issue when
> > > > > > appropriate.
> > > > > >
> > > > >
> > > > > If we plan to migrate to GitHub, I think we should add an Issue
> > > Template
> > > > (
> > > > >
> > > > >
> > > >
> > > https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/configuring-issue-templates-for-your-repository
> > > > > ) to make sure we don't proliferate too many ways to do the same thing
> > > in
> > > > > terms of categorizing issues properly
> > > > >
> > > >
> > >

Re: [ANNOUNCE] New Arrow committer: Will Jones

2022-10-27 Thread Ian Cook

Congratulations Will!

On Thu, Oct 27, 2022 at 19:56 Sutou Kouhei  wrote:

> On behalf of the Arrow PMC, I'm happy to announce that Will Jones
> has accepted an invitation to become a committer on Apache
> Arrow. Welcome, and thank you for your contributions!
>
> kou
>

Arrow sync call November 9 at 12:00 US/Eastern, 17:00 UTC

2022-11-09 Thread Ian Cook

Hi all,

Our biweekly sync call is today at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: Arrow sync call November 9 at 12:00 US/Eastern, 17:00 UTC

2022-11-09 Thread Ian Cook

Attendees:

- Ian Cook
- Raúl Cumplido
- Bin Deng
- Sean Gallagher
- Will Jones
- David Li
- Ashish Paliwal
- Niranda Perera
- Matthew Topol


Discussion:

10.0.1 release

- Primary reason for this patch release is that we do not have PyArrow
10.0.0 wheels for Python 3.11, because the Arrow 10.0.0 and Python
3.11 releases were very close in time.
- Our CI and nightly test matrix does not currently include
environments with Python 3.11. We want to add Python 3.11 this matrix
and see that all tests are passing before we do the 10.0.1 release.
- In the meantime, users of Python 3.11 can use the latest nightly
builds which include Python 3.11 wheels. This was implemented in [1].
Installation instructions are at [2].
- We also want to include some fixes for other Arrow libraries
including Go and Java in the 10.0.1 release
- We do not yet have an estimated time for the release. The release
manager has not yet been determined.

Migrating issue reporting from Jira to GitHub

- As described at [3]: "ASF Infra has announced that, due to spam
account creation, it will no longer be possible for people to sign
themselves up for a Jira account to report issues as of November 6.
Instead, the PMC will have to request the creation of Jira accounts."
- Todd Farmer created a set of GitHub issues to capture outstanding
questions and tasks associated with the Jira --> GitHub Issues
migration [4]
- We will also need to modify the PR merge script, release automation
scripts, Archery and Crossbow code, etc. as a part of this migration.
These are not all captured in the GitHub issues above.
- The November 6 date has passed, but the ASF Jira account signup page
seems to still be available.
- If the migration to GitHub Issues is not expected to be complete
until weeks after Jira account signup has closed, what should we do in
the interim? Should we (a) ask new participants to use GitHub issues
for their reports, or (b) offer to create Jira accounts for new
participants? Input on this question is welcome.

[1] https://github.com/apache/arrow/pull/14499
[2] 
https://arrow.apache.org/docs/developers/python.html#installing-nightly-packages
[3] https://lists.apache.org/thread/l545m95xmf3w47oxwqxvg811or7b93tb
[4] https://github.com/apache/arrow/issues?q=is%3Aissue+is%3Aopen+MIGRATION

On Wed, Nov 9, 2022 at 9:01 AM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly sync call is today at 12:00 noon Eastern time.
>
> The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>
> Alternatively, enter this information into the Zoom website or app to
> join the call:
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian

Arrow sync call November 23 at 12:00 US/Eastern, 17:00 UTC

2022-11-22 Thread Ian Cook

Hi all,

Our biweekly sync call is tomorrow at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: Arrow sync call November 23 at 12:00 US/Eastern, 17:00 UTC

2022-11-24 Thread Ian Cook

Attendees:

- Percy T. Aucahuasi
- Ian Cook
- Raúl Cumplido
- James Duong
- Todd Farmer
- Alenka Frim
- Stephanie Hazlitt
- Ian Joiner
- David Li
- Rok Mihevc
- Matthew Topol
- Joris Van den Bossche
- Jacob Wujciak

Discussion:

Migration from Jira to GitHub issues

- ASF Infra has disabled creation of new Jira accounts [1]
- The Arrow PMC has voted to move issue tracking to GitHub Issues [2]
- There is ongoing work to migrate existing Jira issues to GitHub
Issues and to improve the user and developer experience with GitHub
Issues [3][4]
- There was some discussion about whether we should stop new Jira
issue creation now and begin to use GitHub Issues for all new issues
(even before the existing Jira issues are migrated)
- An alternative approach would be to have Arrow maintainers open a
Jira issue to represent each GitHub Issue (if there is a fix PR) until
the next release (11.0.0) at which point we can fully migrate off of
Jira. This might make the release process more straightforward but
would create extra work for maintainers in the interim.
- The general consensus on the call was that we should stop new Jira
issue creation now (pending a vote); Todd started a vote on the ML
after the call [5]
- This creates an immediate need to modify the PR merge script; Raúl
opened an issue for this after the call [6]; this also raises the
question of whether we still need the PR merge script or whether
committers can use the "Squash and merge" button in the GitHub web UI
instead
- There was a discussion about whether we should still require
contributors to open Issues before they open PRs; the general
consensus was that this will be unnecessary in many cases (because
GitHub PRs have all the same fields as GitHub Issues) however changes
that require community discussion before implementation should still
be opened as Issues first
- The consensus was that we should put in practice a policy asking
people to create meaningful descriptions in their PRs; at the next
release when we are finalizing the migration we could consider
adopting a standard convention for PR messages, such as Conventional
Commits [7]; Jacob will send out an ML post about this
- Todd expects that the mechanism for importing Jira issues to GitHub
Issues should be mostly ready in about a week
- Communications and docs updates will be required to inform
contributors and maintainers of the changes; these are listed in [3]
- We will need to do a lot of communications and docs updates


Proposal for catalog support in Flight SQL

- James started a discussion on the ML about improving support for
catalogs in Flight SQL [8]
- There are open questions in the ML discussion about how this should
be implemented; additional comments are welcome


Flight SQL name

- Some people outside the core Arrow developer community have reported
confusion about the differences between Flight SQL and ADBC, despite
our explanations of the differences (for example: [9])
- Some people have also reported confusion about the Flight SQL name
because it supports Substrait, not only SQL
- There was some discussion about whether we might consider using
"ADBC" as an umbrella name encompassing the client API, driver client
driver, and wire protocol; there were some concerns about whether this
makes logical sense and about investments in the current "Flight SQL"
name; more ideas and discussion welcome


[1] https://infra.apache.org/blog/jira-public-signup-disabled.html
[2] https://lists.apache.org/thread/8pmlx3186b32hm36fkqxfj6vp2ltwkf7
[3] 
https://docs.google.com/document/d/1UaSJs-oyuq8QvlUPoQ9GeiwP19LK5ZzF_5-HLfHDCIg/
[4] https://github.com/apache/arrow/issues?q=is%3Aissue+is%3Aopen+MIGRATION
[5] https://lists.apache.org/thread/v9sjwx8mdg0bfssbrlqz7c0wxwc8dx49
[6] https://github.com/apache/arrow/issues/14720
[7] https://www.conventionalcommits.org/
[8] https://lists.apache.org/thread/fd6r1n7vt91sg2c7fr35wcrsqz6x4645
[9] 
https://voltrondata.com/resources/update/2022/08/25/simplifying-database-connectivity-with-arrow-flight-sql-and-adbc

On Tue, Nov 22, 2022 at 5:56 PM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly sync call is tomorrow at 12:00 noon Eastern time.
>
> The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>
> Alternatively, enter this information into the Zoom website or app to
> join the call:
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian

Re: [ANNOUNCE] New Arrow committer: Raúl Cumplido

2022-12-06 Thread Ian Cook

Congratulations Raúl!

On Tue, Dec 6, 2022 at 10:43 AM Matt Topol  wrote:
>
> Congrats Raúl!!
>
> On Tue, Dec 6, 2022 at 9:53 AM Dewey Dunnington
>  wrote:
>
> > Congrats! Welcome!
> >
> > On Tue, Dec 6, 2022 at 10:35 AM Larry White  wrote:
> >
> > > Congrats, Raúl!
> > >
> > > On Tue, Dec 6, 2022 at 9:20 AM David Li  wrote:
> > >
> > > > Welcome Raúl!
> > > >
> > > > On Tue, Dec 6, 2022, at 08:41, Neal Richardson wrote:
> > > > > Congratulations!
> > > > >
> > > > >
> > > > >
> > > > >> On Dec 6, 2022, at 6:11 AM, vin jake  wrote:
> > > > >>
> > > > >> Congratulations Raúl !!
> > > > >>
> > > > >>> On Tue, Dec 6, 2022 at 7:11 PM Rok Mihevc 
> > > > wrote:
> > > > >>>
> > > > >>> Congrats Raul!!
> > > > >>>
> > > >  On Tue, Dec 6, 2022 at 12:04 PM Andrew Lamb  > >
> > > > wrote:
> > > > 
> > > >  Congratulations Raúl
> > > > 
> > > >  On Tue, Dec 6, 2022 at 2:17 AM Vibhatha Abeykoon <
> > > vibha...@gmail.com>
> > > >  wrote:
> > > > 
> > > > > Congratulations Raul!!!
> > > > >
> > > > > On Tue, Dec 6, 2022 at 11:38 AM Alenka Frim <
> > > ale...@voltrondata.com
> > > > > .invalid>
> > > > > wrote:
> > > > >
> > > > >> Congratulations Raul!! 🎉
> > > > >>
> > > > >> On Tue, Dec 6, 2022 at 6:54 AM Benson Muite <
> > > >  benson_mu...@emailplus.org>
> > > > >> wrote:
> > > > >>
> > > > >>> On 12/6/22 05:53, Sutou Kouhei wrote:
> > > >  On behalf of the Arrow PMC, I'm happy to announce that Raúl
> > > >  Cumplido has accepted an invitation to become a committer on
> > > >  Apache Arrow. Welcome, and thank you for your contributions!
> > > > 
> > > > >>> Congratulations Raúl
> > > > >>>
> > > > >>
> > > > > --
> > > > > Vibhatha Abeykoon
> > > > >
> > > > 
> > > > >>>
> > > >
> > >
> >

Arrow sync call December 7 at 12:00 US/Eastern, 17:00 UTC

2022-12-06 Thread Ian Cook

Hi all,

Our biweekly sync call is today at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: Arrow sync call December 7 at 12:00 US/Eastern, 17:00 UTC

2022-12-08 Thread Ian Cook

Attendees

- Ian Cook
- Raúl Cumplido
- Dewey Dunnington
- Ian Joiner
- Will Jones
- David Li
- Rok Mihevc
- Antoine Pitrou
- Matt Topol
- Joris Van den Bossche


Discussion

Maintenance policy

There is a discussion [1] on the mailing list about whether we should
define a maintenance policy.
- The central question is whether we should commit to doing patch
releases for older major versions of the Arrow implementations in the
monorepo to fix security vulnerabilities and perhaps other critical
issues such as bugs that silently produce incorrect data.
- The need this would address is that some Arrow implementation users
do not want to update to each new major version quarterly (which
carries the chance of breaking changes each time) and also do not want
to accept the risk that a security vulnerability or other critical
issue could at any time force them to do an unplanned major version
update.
- There are several interrelated factors that complicate this:
  - We do not have a rigorous process in place for designating changes
as breaking or non-breaking.
  - We nominally follow Semantic Versioning [2] (MAJOR.MINOR.PATCH)
but in practice we simply assume for convenience that each regular
quarterly release contains breaking changes, so we simply increment
the major version in each regular quarterly release. In practice we
never do minor releases . Beginning to do minor releases and reducing
the frequency of major releases could be an alternative or
complementary way of addressing the same need.
  - In practice we depend on the major version updates to enforce
interoperability between some of the  implementations in the monorepo,
and it is unclear how we will ensure this synchronization if we start
doing minor versions.
- Further discussion is required


ADBC 0.1 Release

- David has been implementing a release process for ADBC with help
from Jacob and Kou [3]
- David intends to ship this release before end of year
- We plan for PyArrow 11.0.0 to include Flight SQL support


Migrating issue reporting from Jira to GitHub

- There is a mailing list thread about the current state [4]
- Joris started a discussion there about whether we should track
website-related issues in the GitHub issues of arrow-site repo
- There is a way to provide non-committers with permission to triage
issues (assign, edit, close) as documented at [5] but this is limited
to 20 user accounts
- The release scripts still need to be updated. We plan to update them
in time for the 11.0.0 release. If there is a 10.0.2 patch release, we
might decide as a one-time measure to create a Jira issue representing
each GitHub issue that is resolved in the patch release so that we can
use the existing release script.


[1] https://lists.apache.org/thread/7mgwr02h1f7zvghym1kljyb32s50vx1o
[2] https://semver.org
[3] https://github.com/apache/arrow-adbc/pull/174
[4] https://lists.apache.org/thread/nkmbvhrydm9ftf2zh8ydvsthdr3cdxxq
[5] 
https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-AssigningexternalcollaboratorswiththetriageroleonGitHub



On Wed, Dec 7, 2022 at 12:09 AM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly sync call is today at 12:00 noon Eastern time.
>
> The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>
> Alternatively, enter this information into the Zoom website or app to
> join the call:
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian

Re: [DISCUSS] Maintenance policy

2022-12-08 Thread Ian Cook

This topic was discussed in the Arrow sync call this week. See the
notes from that call here:
https://lists.apache.org/thread/gbywpzbvpfydq24m1c0w6jgybnsrf9xm

Ian

On Wed, Nov 23, 2022 at 7:36 AM Benson Muite  wrote:
>
> On 10/19/22 20:47, Will Jones wrote:
> > One particular type of defect we might want to consider backporting to
> > supported versions are ones that silently produce incorrect data. Unlike
> > ones that cause a crash, it's not easy for a user to know they are affected.
> >
> > Here are a few examples:
> >
> >   * ARROW-17453: [Go][C++][Parquet] Inconsistent Data with Repetition Levels
> > [1] (fixed in 10.0.0)
> >   * ARROW-17995: [C++] Fix json decimals not being rescaled based on the
> > explicit schema [2] (fixed in 10.0.0)
> >   * ARROW-14523: [C++] Fix potential data loss in S3 multipart upload [3]
> > (fixed in 7.0.0)
> >
> > Also, I know we have high release costs for new versions, but is that also
> > true for backporting fixes? Unlike new releases, if we were creating a
> > bugfix release, we are presumably starting from a much more stable point,
> > right?
> >
> > Thanks,
> >
> > Will Jones
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-17453
> > [2] https://issues.apache.org/jira/browse/ARROW-17995
> > [3] https://issues.apache.org/jira/browse/ARROW-14523
> >
> > On Wed, Oct 19, 2022 at 9:32 AM Todd Farmer 
> > wrote:
> >
> >> Hi,
> >>
> >> I've been thinking a lot about maintenance and lifecycle policies and
> >> defect classification recently - I'm very grateful this is being raised. I
> >> believe establishing such policies will prove instrumental to enable
> >> adoption of Arrow for a number of use cases that prioritize stability over
> >> innovation.
> >>
> >> On Wed, Oct 19, 2022 at 5:42 AM Antoine Pitrou  wrote:
> >>
> >>>
> >>> Hi Kou,
> >>>
> >>> Le 19/10/2022 à 06:29, Sutou Kouhei a écrit :
> 
>  My proposal: We maintain the last major release:
>  * We maintain 9.Y.Z when the latest major release is 9.0.0
>  * We may release 9.Y.Z when we find a problem such as a
>  security vulnerability in 9.Y.Z
>  * We drop support for 9.Y.Z when we release 10.0.0
> >>>
> >>> That sounds ok to me, but is there a more precise criterion than "we
> >>> find a problem"?
> For most users, backwards compatibility and supported platforms are
> likely more important than the version number.  If there are many
> breaking API changes, this increases the cost of using Arrow, so
> supporting easy continuous use of Arrow should be the goal.
> >>>
> >>> In the past, we have from time to time done maintenance releases based
> >>> on annoying bugs/regressions. But not always.
> >>>
> >>
> >> I very much agree, and actually think there are multiple questions to
> >> answer here:
> >>
> >> 1. Which class of defects should be allowed to be merged into a maintenance
> >> branch?
> >> 2. Which class of defects must be fixed in a supported maintenance branch?
> >> 3. Which class of defects should trigger a maintenance release once a fix
> >> is made to the branch?
> >> 4. Which versions should be targeted in backporting a defect fix?  How long
> >> will a release receive maintenance support?
> >> 5. Which class of defects can be batched into a future maintenance release,
> >> and which need immediate release?
> >> 6. What delivery artifacts are needed for maintenance releases? Can some
> >> things be source-only?
> >>
> >> Today, any fix may be a candidate for backporting to a maintenance branch
> >> if there's support for doing so in a vote. I believe it might be useful to
> >> more formally triage defects in part to establish policy answering these
> >> questions. For example:
> >>
> >> * How severe is the defect?  Does it produce wrong results? Cause crashes?
> >> Or is it an annoying spelling error in a log message?
> >> * How widespread is the impact? Is everybody who uses Arrow going to be
> >> affected by this? Or is it only triggered by some very obscure use case?
> >> * How accessible is any workaround?
> >> * How much risk is involved in a fix?
> >>
> >> Having a common framework to classify those elements above would enable
> >> policy that clearly defines which defects can (or should, eventually) get
> >> what attention.
> >>
> >> If there is interest in the community, I'll continue a draft proposal I'm
> >> working on to formalize triage to capture these aspects. Any such triage
> >> process would be entirely optional for work done against master/main, but
> >> could be required for assessing potential backports as needed.
> >>
> >> I'll also note that I recognize Arrow may not currently see a need to
> >> answer all the questions about maintenance/lifecycle policy today, or may
> >> not have the resources needed to deliver what may be desired. It takes a
> >> lot of work to generate a release today. I think it's completely
> >> appropriate to commit only to what can be delivered today, with an eye
> >> towards incremental improvement. For example, an en

Re: [VOTE] Add RLE Arrays to Arrow Format

2022-12-14 Thread Ian Cook

Thank you Matt, Tobias, and others for the great work on this.

I am -0.5 on this proposal in its current form because (pardon the
pedantry) what we have implemented here is not run-length encoding; it
is run-end encoding. Based on community input, the choice was made to
store run ends instead of run lengths because this enables O(log(N))
random access as opposed to O(N). This is a sensible choice, but it
comes with some trade-offs including limitations in array length
(which maybe not really a problem in practice) and lack of bit-for-bit
equivalence with RLE encodings that use run lengths like Velox's
SequenceVector encoding (which I think is a more serious problem in
practice).

I believe that we should either:
(a) rename this to "run-end encoding"
(b) change this to a parameterized type called "run encoding" that
takes a Boolean parameter specifying whether run lengths or run ends
are stored.

Ian

On Wed, Dec 14, 2022 at 11:27 AM Matt Topol  wrote:
>
> Hello,
>
> I'd like to propose adding the RLE type based on earlier discussions[1][2]
> to the Arrow format:
> - Columnar Format description:
> https://github.com/apache/arrow/pull/1/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> - Flatbuffers changes:
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
>
> There is a proposed implementation available in both C++ (written by Tobias
> Zagorni) and Go[3][4]. Both implementations have mostly the same tests
> implemented and were tested to be compatible over IPC with an archery test.
> In both cases, the implementations are split out among several Draft PRs so
> that they can be easily reviewed piecemeal if the vote is approved, with
> each Draft PR including the changes of the one before it. The links
> provided are the Draft PRs with the entirety of the changes included.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 add the proposed RLE type to the Apache Arrow format
> [ ] -1 do not add the proposed RLE type to the Apache Arrow format
> because...
>
> Thanks much, and please let me know if any more information or links are
> needed (I've never proposed a vote before on here!)
>
> --Matt
>
> [1] https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> [2] https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> [3] https://github.com/apache/arrow/pull/14179
> [4] https://github.com/apache/arrow/pull/14223

Re: [ANNOUNCE] New Arrow committer: Jacob Wujciak

2022-12-15 Thread Ian Cook

Herzlichen Glückwunsch, Jacob!

On Thu, Dec 15, 2022 at 6:56 PM Rok Mihevc  wrote:
>
> Congrats Jacob!!
>
> Rok
>
> On Fri, Dec 16, 2022 at 12:52 AM Vibhatha Abeykoon 
> wrote:
>
> > Congratulations Jacob!!!
> >
> > On Fri, Dec 16, 2022 at 5:09 AM Raúl Cumplido 
> > wrote:
> >
> > > Congratulations Jacob!
> > >
> > > El vie, 16 dic 2022 a las 0:34, Weston Pace ()
> > > escribió:
> > >
> > > > Congratulations Jacob!
> > > >
> > > > On Thu, Dec 15, 2022 at 3:27 PM David Li  wrote:
> > > > >
> > > > > Congrats & welcome Jacob!
> > > > >
> > > > > On Thu, Dec 15, 2022, at 18:14, Nic Crane wrote:
> > > > > > On behalf of the Arrow PMC, I'm happy to announce that Jacob
> > Wujciak
> > > > has
> > > > > > accepted an invitation to become a committer on Apache Arrow.
> > > Welcome,
> > > > and
> > > > > > thank you for your contributions!
> > > > > >
> > > > > > Nic
> > > >
> > >
> > --
> > Vibhatha Abeykoon
> >

Re: [VOTE] Add RLE Arrays to Arrow Format

2022-12-19 Thread Ian Cook

@Matt Topol: Yes, a change of the name to "run-end encoding" changes
my (non-binding) vote to a +1.

On Mon, Dec 19, 2022 at 3:32 PM Matthew Topol
 wrote:
>
> Okay, slight edit to my previous email: It was brought to my attention that
> we need at least 3 +1 binding votes, so this vote is still open for the
> moment.
>
> @IanCook: With the change of the name to RunEndEncoding is that sufficient
> to change your vote to a +1?
>
> On Mon, Dec 19, 2022 at 12:57 PM Matt Topol  wrote:
>
> > That leaves us with a total vote of +1.5 so the vote carries with the
> > caveat of changing the name to be Run End Encoded rather than Run Length
> > Encoded (unless this means I need to do a new vote with the changed name?
> > This is my first time doing one of these so please correct me if I need to
> > do a new vote!)
> >
> > Thanks everyone for your feedback and comments!
> >
> > I'm going to go update the Go and Format specific PRs to make them regular
> > PR's (instead of drafts) and get this all moving. Thanks in advance to
> > anyone who reviews the upcoming PRs!
> >
> > --Matt
> >
> > On Fri, Dec 16, 2022 at 8:24 PM Weston Pace  wrote:
> >
> > > +1
> > >
> > > I agree that run-end encoding makes more sense but also don't see it
> > > as a deal breaker.
> > >
> > > The most compelling counter-argument I've seen for new types is to
> > > avoid a schism where some implementations do not support the newer
> > > types.  However, for the type proposed here I think the risk is low
> > > because data can be losslessly converted to existing formats for
> > > compatibility with any system that doesn't support the type.
> > >
> > > Another argument I've seen is that we should introduce a more formal
> > > distinction between "layouts" and "types" (with dictionary and
> > > run-end-encoding being layouts).  However, this seems like an
> > > impractical change at this point.  In addition, given that we have
> > > dictionary as an array type the cat is already out of the bag.
> > > Furthermore, systems and implementations are still welcome to make
> > > this distinction themselves.  The spec only needs to specify what the
> > > buffer layouts should be.  If a particular library chooses to group
> > > those layouts into two different categories I think that would still
> > > be feasible.
> > >
> > > -Weston
> > >
> > > On Fri, Dec 16, 2022 at 1:42 PM Andrew Lamb 
> > wrote:
> > > >
> > > > +1 on the proposal as written
> > > >
> > > > I think it makes sense and offers exciting opportunities for faster
> > > > computation (especially for cases where parquet files can be decoded
> > > > directly into such an array and avoid unpacking. RLE encoded dictionary
> > > are
> > > > quite compelling)
> > > >
> > > > I would prefer to use the term Run-End-Encoding (which would also
> > follow
> > > > the naming of the internal fields) but I don't view that as a deal
> > > blocker.
> > > >
> > > > Thank you for all your work in this matter,
> > > > Andrew
> > > >
> > > > On Wed, Dec 14, 2022 at 5:08 PM Matt Topol 
> > > wrote:
> > > >
> > > > > I'm not at all opposed to renaming it as `Run-End-Encoding` if that
> > > would
> > > > > be preferable. Hopefully others will chime in with their feedback.
> > > > >
> > > > > --Matt
> > > > >
> > > > > On Wed, Dec 14, 2022 at 12:09 PM Ian Cook 
> > > wrote:
> > > > >
> > > > > > Thank you Matt, Tobias, and others for the great work on this.
> > > > > >
> > > > > > I am -0.5 on this proposal in its current form because (pardon the
> > > > > > pedantry) what we have implemented here is not run-length encoding;
> > > it
> > > > > > is run-end encoding. Based on community input, the choice was made
> > to
> > > > > > store run ends instead of run lengths because this enables
> > O(log(N))
> > > > > > random access as opposed to O(N). This is a sensible choice, but it
> > > > > > comes with some trade-offs including limitations in array length
> > > > > > (which maybe not really a problem in practice) and lack of
> >

Arrow sync call December 21 at 12:00 US/Eastern, 17:00 UTC

2022-12-20 Thread Ian Cook

Hi all,

Our biweekly sync call is tomorrow at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Arrow sync call January 4 at 12:00 US/Eastern, 17:00 UTC

2023-01-03 Thread Ian Cook

Hi all,

Our biweekly sync call is tomorrow at 12:00 noon Eastern time.

The Zoom meeting URL for this and other biweekly Arrow sync calls is:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09

Alternatively, enter this information into the Zoom website or app to
join the call:
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: Arrow sync call January 4 at 12:00 US/Eastern, 17:00 UTC

2023-01-05 Thread Ian Cook

Attendees:

Ian Cook
Dewey Dunnington
Ian Joiner
Will Jones
David Li
Bryce Mecum
Rok Mihevc
Eduardo Ponce
Matthew Topol
Jacob Wujciak


Discussion:

ADBC 0.1.0 release vote

- The vote is open [1]
- David is looking for more review and votes from PMC members and others


Jira to GitHub migration

- Work is underway to migrate existing issues from Jira to GitHub
- Rok is making progress using the scripts started by Todd
- Rok intends to start a discussion on Zulip about this, then test the
process, then do the migration next week
- Rok is in communication with GitHub about whether we can retain
issue authors and comment authors by using GitHub Importer; using
“mannequin users” will apparently not work; Jacob is also looking into
this and will contact ASF Infra if needed


Proposal to move sync call meeting notes into a Google Doc

- Will proposed that we share notes from sync calls in a publicly
viewable Google Doc instead of in emails to the mailing list [2]
- There was a discussion about whether managing edit access to this
Google Doc would be difficult and whether we should consider
alternatives such as GitHub or Confluence, but the consensus seemed to
be that a Google Doc would be best
- Further discussion welcome; we tentatively plan to begin using a
Google Doc in the next meeting


Upcoming 11.0.0 release

- We are targeting January 16 for code freeze
- Current plan is for Raúl to serve as the release manager and Kou to
do the source signing and other tasks that require PMC membership
- Some details of the release will depend on whether the Jira to
GitHub migration is complete before the release


Future directions for the Arrow R package

- In the first Arrow R Package dev sync call in December, there was a
discussion about R package development priorities after the 11.0.0
release [3]
- Ideas included: making it easier to contribute without building the
Arrow C++ library from source; using Substrait to represent the query
plan that is passed to Acero for execution
- Ideas and engagement welcome in the #r-chat Zulip channel and in
future R package dev sync calls; the next one will be on January 12


Extending the columnar types specification

- As previously discussed [4][5][6][7], work is ongoing to propose and
implement new columnar memory layouts in the Arrow specification,
based on learnings from other projects such as DuckDB and Velox
- The vote to add run-end encoded (REE) arrays to the Arrow format has
passed [8]
- Ben Kietzman is doing work in a branch of the monorepo [9] to
implement a columnar type similar to Velox's StringView; Ben is in
discussion with some of the Velox maintainers about the advantages of
using offsets instead of pointers


[1] https://lists.apache.org/thread/vl9v32341xtmdy2x1n151gll4wgskboy
[2] https://lists.apache.org/thread/n4tm2nphoy1qgfbbll8174znkhtfpy3x
[3] 
https://docs.google.com/document/d/1nSIfJw8mfqtvScqvSVqmktpWff80pFmkqiZT7nTtiDo/
[4] https://lists.apache.org/thread/pb3v5p1yzw8y2qqyy224lmog9po39xzp
[5] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[6] https://lists.apache.org/thread/djy8xn28p264vhj8y5rqbgkgwss6oyo1
[7] https://lists.apache.org/thread/dccj1qrozo88qsxx133kcy308qwfwpfm
[8] https://lists.apache.org/thread/539scy67qom5t2fkkd1m6fvh5htvwo3s
[9] https://github.com/apache/arrow/tree/feature/format-string-view

On Tue, Jan 3, 2023 at 10:03 PM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly sync call is tomorrow at 12:00 noon Eastern time.
>
> The Zoom meeting URL for this and other biweekly Arrow sync calls is:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
>
> Alternatively, enter this information into the Zoom website or app to
> join the call:
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian

Re: Arrow sync call January 4 at 12:00 US/Eastern, 17:00 UTC

2023-01-06 Thread Ian Cook

> If a Google Doc is used, can it be configured to send out notifications of
the summary to the list?

Not as far as I know, but I think we can continue to send a copy of the
notes to the mailing list after each biweekly meeting, copied and pasted
from the Google Doc.

On Fri, Jan 6, 2023 at 21:40 Benson Muite 
wrote:

>
> > Proposal to move sync call meeting notes into a Google Doc
> >
> > - Will proposed that we share notes from sync calls in a publicly
> > viewable Google Doc instead of in emails to the mailing list [2]
> > - There was a discussion about whether managing edit access to this
> > Google Doc would be difficult and whether we should consider
> > alternatives such as GitHub or Confluence, but the consensus seemed to
> > be that a Google Doc would be best
> > - Further discussion welcome; we tentatively plan to begin using a
> > Google Doc in the next meeting
> If a Google Doc is used, can it be configured to send out notifications
> of the summary to the list?
> >
> >
> >
>

Arrow sync call January 18 at 17:00 UTC

2023-01-17 Thread Ian Cook

Hi all,

Our biweekly Arrow sync call is tomorrow at 17:00 UTC / 12:00 EST.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: Arrow sync call January 18 at 17:00 UTC

2023-01-20 Thread Ian Cook

Beginning this week, we are using Google Docs to capture the notes
from the biweekly Arrow sync call. The notes for this and future
instances of this call will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/

Going forward, I intend to refer to this biweekly meeting as the
"Arrow community meeting" which seems more welcoming and inclusive.
The title and heading of the Google Doc use that name.

A copy of the notes from this week's call are also included below in this email:

2023-01-18

Attendees:

- Anja Boskovic
- Rusty Conover
- Ian Cook
- Raúl Cumplido
- Ian Joiner
- Will Jones
- Sean Kelly
- David Li
- Bryce Mecum
- Rok Mihevc
- Matthew Topol
- Jacob Wujciak


Discussion:

11.0.0 release

- Code freeze was on Monday
- Release candidate was created today
- Packaging and validation is underway
- We encountered a few small problems but so far we have not needed to
create a new release candidate
- Help verifying the release is welcome once the RC is shared
  - We test on a large matrix of environments in our CI, but
verification in real-world environments is helpful

Jira to GitHub post-migration tasks (if any)

- The migration from Jira to GitHub issues is now complete
  - Thank you to Rok, Todd, and all others who contributed to this
- Anyone notice any problems?
  - There was a problem with selecting components, but Jacob fixed this
  - Some issues were removed from the 11.0.0 milestone but were not
assigned a new milestone. As a general rule, should we assign a new
milestone corresponding to the next major release when an issue is
bumped out of release scope?
- Probably not, because in the past, many of the issues that were
assigned Fixed In versions in Jira were very old and had just been
repeatedly bumped from one major release to the next
- It might be more meaningful to track coarser-grained
initiatives/projects for roadmap planning, rather than fine-grained
issues
  - We could use the Projects feature in GitHub for this
- Note that each GitHub Issue can only have one milestone associated
with it, so we cannot assign bugfixes to a maintenance release and to
the next major release like we often did in Jira
- Note that GitHub Issues can have multiple assignees, unlike in Jira
We need to have a broader discussion about our maintenance policy and
versioning conventions
  - See the notes from our discussion about that in the December 7 sync call [1]
- Keeping "GH-" as the issue prefix for PRs?
  - Changing this would break the auto-linking that GitHub does
between issues and PRs, so we should probably keep it for now
  - Other repos like CPython also use the "GH-" prefix
- Appropriate use of "wip" tag in GitHub Issues (if any)?
  - This is one of GitHub's built-in issue tags
  - Usually, assigning the issue to yourself signals that you are
working on it, so using the "wip" tag does not seem necessary
  - Jacob and Raúl are looking at how to improve the developer
experience, for example with a bot that removes the assignee for stale
issues
- Do we need to disable the Jira bots?
  - Jacob will do this
- Can we delete the migration dry-run repos?
  - Rok will do this and ask Todd to do it
- If you were previously subscribed to receive notifications on Jira
issues, you will not automatically receive notifications from GitHub
  - Use the script provided by Rok to re-subscribe to notifications [2]

Question about how best for contributors to communicate ideas before
they have a PR ready
- Open an issue
  - Appropriate for bugfixes, minor enhancements, etc.
- Send an email to the dev mailing list
  - Appropriate for larger-scale feature proposals, spec changes, etc.

[1] https://lists.apache.org/thread/gbywpzbvpfydq24m1c0w6jgybnsrf9xm
[2] 
https://github.com/rok/arrow-migration/blob/main/transfer_arrow_subscriptions.py


On Tue, Jan 17, 2023 at 9:35 PM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly Arrow sync call is tomorrow at 17:00 UTC / 12:00 EST.
>
> Zoom meeting URL:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian

Re: [Discuss] C++ query builder/execution API

2023-01-27 Thread Ian Cook

Hi Shoumyo,

This is exciting—thank you for the thoughtfulness you have put into
this proposal.

This topic of a C++ dataframe API for Arrow-native engine(s) has come
up in the past [3], but the bulk of the previous discussion about this
predated Substrait. With the Substrait project now quickly gaining
momentum, it seems an excellent time to revisit this topic and to
incorporate Substrait into it, as you have done.

I strongly believe that this work should happen in a repository that
is outside of the Arrow project. Many of the exciting developments in
Arrow-land these days are happening in the broader ecosystem around
Arrow. The proposed API could be used independently of Arrow libraries
(for example, it could be used with DuckDB). For projects like this, I
think our hope as Arrow maintainers is to "let a hundred flowers
bloom" around Arrow (all with excellent operability based on Arrow
standards) rather than centralizing the work inside Arrow
repositories. We can use resources including the "Powered by Arrow"
and "Powered by Substrait" pages to highlight the project.

Thank you,
Ian

[3] 
https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/
[4] https://arrow.apache.org/powered_by/
[5] https://substrait.io/community/powered_by/

On Wed, Jan 25, 2023 at 1:02 PM Shoumyo Chakravorti (BLOOMBERG/ 731
LEX)  wrote:
>
> Hi Arrow developers!
>
> This is my first time posting on this mailing list, so please let me know if 
> this post belongs elsewhere.
>
> I and a few colleagues plan to implement a C++ interface for building 
> read-only queries and executing those against Arrow Datasets through 
> Substrait consumers like Acero, DuckDB, and Velox. Since we hope to build 
> this out in the open, I have outlined the kind of interface that we intend to 
> build in this Google doc [0].
>
> I'm making this post for a few reasons:
>
> - To gauge whether the community feels like this work would be worth 
> pursuing as an open-source project
> - To receive feedback on the proposed interface and ensure that we would 
> be able to accommodate a wide variety of use-cases (please feel free to leave 
> comments directly on the doc)
> - To connect with developers who might be interested in collaborating on 
> this effort
>
> Relatedly, I would like to get the Arrow developers' thoughts on whether it 
> would make sense to pursue this work as an official Arrow project (e.g. in an 
> experimental repo) or if it would be better as a standalone project. I 
> understand that pursuing this as an Arrow project would have its downsides 
> (like increased review/maintenance burden) and risks confusing new users as 
> to what the official Arrow libraries aim to solve [1]. On the other hand, 
> making such an interface readily available alongside `libarrow` could 
> increase the adoption of Arrow among certain developers (e.g. in 
> finance/fintech). Regardless of your opinion, I'd love to hear your thoughts 
> on which approach makes more sense.
>
> Please feel free to reply here on the mailing list or leave comments on the 
> linked Google Doc!
>
> [0]: 
> https://docs.google.com/document/d/1_ktKxtOFW1grD-VcbBNc0FaP4g5j7vSx9bO2ht59JFA
> [1]: 
> https://www.datawill.io/posts/apache-arrow-2022-reflection/#who-is-libarrows-and-aceros-audience

Arrow community meeting February 1 at 17:00 UTC

2023-01-31 Thread Ian Cook

Hi all,

Our biweekly Arrow community meeting is tomorrow at 17:00 UTC / 12:00 EST.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Thanks,
Ian

Re: Arrow community meeting February 1 at 17:00 UTC

2023-02-01 Thread Ian Cook

The notes for this and future instances of this meeting will be
captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
A copy of the notes from this week's meeting are also included below:

2023-02-01

Attendees:

- Ian Cook
- Nic Crane
- Raúl Cumplido
- Dewey Dunnington
- Will Jones
- David Li
- Bryce Mecum
- Rok Mihevc
- Sri Nadukudy
- Dane Pitkin
- Soumya Sanyal
- Matthew Topol
- Jacob Wujciak

Discussion:

Mailing list label/tag guidance for new contributors (Bryce Mecum)
- Should we use tags like “[DISCUSS]” and “[RFC]” in addition to the
language tags in the subject line of emails?
- There is currently no documentation of what practices we should use
to tag/label emails to the mailing lists, even for commonly used tags
- Other common mailing list conventions (like saying whether your vote
is binding or non-binding) are also not formally documented anywhere
- For some users, it is not immediately obvious that they should label
their emails with the language implementation
- The consensus seems to be that it is worth documenting this on the
Arrow Community page of the website [1]
- Bryce will open a PR

Should Rust ADBC libraries be in apache/arrow-adbc? (Will Jones)
- Should the Rust ADBC libraries be released per the Rust library
release schedule or the ADBC library release schedule?
- Considerations include: whether it will be used within the Rust
ecosystem (or as a standalone tool that uses Rust); which component it
should have tighter integration testing with; what is most convenient
for development

Known alternatives to Plasma [2] that we can point users to? (Will Jones)
- For context: Plasma was added to Arrow C++ by Ray developers, but
has no active maintainers any longer and is deprecated and planned for
removal in 12.0.0 [3]
- Plasma continues to exist as an internal utility in Ray [4]
- Weston Pace has been considering how we might solve some of the
problems that Plasma solves, but by building on existing Arrow
interfaces instead of taking a general-purpose approach like Plasma

Release 11.0.0 status (Raúl Cumplido)
- Arrow 11.0.0 has been released
- There are some post-release tasks still in progress, including
downstream packaging and distribution tasks
- Raúl will merge the blog post PR and make an announcement on the
mailing list soon
- PR workflow automation (Raúl Cumplido)
- Raúl has proposed to implement some automation to improve the PRs
and issues workflows; feedback is welcome in the mailing list thread
[5]

Canonical TensorArray extension type [6] (Rok Mihevc)
- This would be the first canonical extension type since we adopted
the framework for that
- Looking for input from users/developers who are familiar with
working with tensor/multidimensional array data

nanoarrow release process (Dewey Dunnington)
- Dewey is hoping to do a 0.1 release candidate in the next couple of weeks

Jira to GitHub migration (Ian Cook)
- There was a discussion in the previous biweekly meeting about how
with GitHub Issues we cannot associate bug issues with two
milestones—one representing the next (possible/actual) maintenance
release and one representing the next major release—like we used to
with Jira; the newly proposed “backport candidate” provides a solution
to this [7]
- The migration dry-run repos discussed in the previous meeting have
been deleted
- Some users have reported that Jira offered richer options for
filtering issues than GitHub does

Can we better promote this and other Arrow community meetings? (Ian Cook)
- Information about this meeting and the Arrow R developers meeting is
shared in biweekly emails Arrow dev mailing list
- The Arrow Rust community used to have a sync meeting but stopped
having regular dedicated meetings in 2021
- Do any other Arrow language sub-communities hold regular meetings?
- We could better promote these biweekly meetings, not just on the mailing lists
- Ian will open a PR to add information about these meetings to the
Arrow Community page of the website [1]

[1] https://arrow.apache.org/community/
[2] https://arrow.apache.org/docs/python/plasma.html
[3] https://lists.apache.org/thread/nw232k2lzmg9kcl8ts475m9ybl34j81p
[4] https://discuss.ray.io/t/plasma-store-apis/5421/6
[5] https://lists.apache.org/thread/1rhsd8ovy4bfr8hcdohn0vh65frw0ggk
[6] https://github.com/apache/arrow/pull/33925
[7] https://lists.apache.org/thread/38xsz3ycr6jghv6h0d4bsb2y0z093lkf




On Tue, Jan 31, 2023 at 11:00 PM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly Arrow community meeting is tomorrow at 17:00 UTC / 12:00 EST.
>
> Zoom meeting URL:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> Thanks,
> Ian

Arrow community meeting February 15 at 17:00 UTC

2023-02-15 Thread Ian Cook

Hi all,

Our biweekly Arrow community meeting is today at 17:00 UTC / 12:00 EST.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

The notes for this and future instances of this meeting will be
captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend the meeting today, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: Arrow community meeting February 15 at 17:00 UTC

2023-02-20 Thread Ian Cook

Below is a summary of the notes from last week's meeting:

Attendees:

- Anja Boskovic
- Ian Cook
- Dewey Dunnington
- Ian Joiner
- Will Jones
- David Li
- Bryce Mecum
- Rok Mihevc
- Sri Nadukudy
- Dane Pitkin
- Matthew Topol
- Jacob Wujciak


Discussion:

Fixed-shape tensor canonical ExtensionType proposal
- Alenka, Joris, and Rok are preparing a proposal and vote for
canonical tensor extension type
- See the related mailing list discussion [1], specification and docs
PR [2], C++ implementation PR [3], and Python implementation PR [4]
- This is useful for applications including deep learning
- We hope to call a vote soon
- Please give input as soon as possible
- Members of Hugging Face, Ray, and PyTorch community have given input
and some of it was incorporated
- It would be good to have input from some other companies and project
communities including Lance, NumPy, Posit, MATLAB, DLPack,
CUDA/RAPIDS, Arrow Rust, Xarray, Julia, Fortran, TensorFlow, LinkedIn

Flight RPC/Flight SQL/ADBC proposals
- Several changes are proposed [5]
- Input welcome

nanoarrow release
- Dewey plans to propose a nanoarrow 0.1 release soon
- Dewey might need some help with signing keys
- There are no binary packages being distributed, just a source tarball

Improving the web of trust for signing
- Discussion about whether there are opportunities to improve this to
enable more committers to help with releases

[1] https://lists.apache.org/thread/oxvx0z0no2yyqsffzdc6nyjh6j4o6krs
[2] https://github.com/apache/arrow/pull/33925
[3] https://github.com/apache/arrow/pull/8510
[4] https://github.com/apache/arrow/pull/33948
[5] https://lists.apache.org/thread/247z3t06mf132nocngc1jkp3oqglz7jp

On Wed, Feb 15, 2023 at 8:11 AM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly Arrow community meeting is today at 17:00 UTC / 12:00 EST.
>
> Zoom meeting URL:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> The notes for this and future instances of this meeting will be
> captured in this Google Doc:
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
> If you plan to attend the meeting today, you are welcome to edit the
> document to add the topics that you would like to discuss.
>
> Thanks,
> Ian

Arrow community meeting March 1 at 17:00 UTC

2023-02-28 Thread Ian Cook

Hi all,

Our biweekly Arrow community meeting is tomorrow at 17:00 UTC / 12:00 EST.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

The notes for this and future instances of this meeting will be
captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend the meeting tomorrow, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: Arrow community meeting March 1 at 17:00 UTC

2023-03-01 Thread Ian Cook

Below is a summary of the notes from this week's meeting

Attendees:

 - Ian Cook
 - Raúl Cumplido
 - Dewey Dunnington
 - Ian Joiner
 - Will Jones
 - David Li
 - Bryce Mecum
 - Rok Mihevc
 - Sri Nadukudy
 - Weston Pace
 - Dane Pitkin


Discussion:

Fixed Shape Tensor canonical ExtensionType proposal
 - It seems like we have converged to the final state at this point,
but we are waiting for a few conversations to conclude
 - Alenka called a vote [1] but this sparked some additional feedback
so she plans to give it a few more days then open a new vote next week


PR automation Workflow
 - Proposal discussed on mailing list [2] has been implemented
 - There are a few hiccups that Raúl is working out
 - Feedback welcome


Self-hosted arm64 runners [3]
 - Raúl has been working with ASF Infra and has set up a GitHub
integration to add self-hosted runners at the organization level,
which allows us to use them from multiple arrow repos in the apache
organization on GitHub
 - This will allow us to retire some Travis CI jobs, but Travis CI
will continue to be used for some Crossbow jobs, e.g. for s390x
(big-endian)


Initial nanoarrow release candidate [4]
 - We are looking for people to verify the RC


Default Parquet row group size change [5]
 - This is specific to the Arrow C++ implementation and its bindings
 - Before this change, the default row group size was 64 million rows;
this was based on a misunderstanding and is much too large
 - Weston has changed the default to 1 million rows
 - There was some discussion about whether this should be something
smaller e.g. 100K rows, but overall there were no objections
 - This change caused a performance regression to write performance,
which Weston is investigating [6]
 - Is it possible to set the row group size based on bytes instead of
rows? Not yet but there was a recent change that should enable this
[7]


[1] https://lists.apache.org/thread/3cj0cr44hg3t2rn0kxly8td82yfob1nd
[2] https://lists.apache.org/thread/1rhsd8ovy4bfr8hcdohn0vh65frw0ggk
[3] https://lists.apache.org/thread/mskpqwpdq65t1wpj4f5klfq9217ljodw
[4] https://lists.apache.org/thread/slomdw52n9j7jq8zwl5v8cb4v8yfk9sj
[5] https://github.com/apache/arrow/pull/34281
[6] https://github.com/apache/arrow/issues/34374
[7] https://github.com/apache/arrow/pull/33897



On Tue, Feb 28, 2023 at 10:44 PM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly Arrow community meeting is tomorrow at 17:00 UTC / 12:00 EST.
>
> Zoom meeting URL:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> The notes for this and future instances of this meeting will be
> captured in this Google Doc:
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
> If you plan to attend the meeting tomorrow, you are welcome to edit the
> document to add the topics that you would like to discuss.
>
> Thanks,
> Ian

Re: [ANNOUNCE] New Arrow PMC member: Will Jones

2023-03-13 Thread Ian Cook

Congratulations Will!

On Mon, Mar 13, 2023 at 1:58 PM Andrew Lamb  wrote:
>
> The Project Management Committee (PMC) for Apache Arrow has invited
> Will Jones to become a PMC member and we are pleased to announce
> that Will Jones has accepted.
>
> Congratulations and welcome!

Arrow community meeting March 15 at 16:00 UTC

2023-03-15 Thread Ian Cook

Hi all,

Our biweekly Arrow community meeting is today at 16:00 UTC / 12:00
EST. For attendees in countries that have not yet switched to Daylight
Saving Time, please note that the time is one hour earlier than usual
in your local time zone.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

The notes for this and future instances of this meeting will be
captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend the meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: Arrow community meeting March 15 at 16:00 UTC

2023-03-15 Thread Ian Cook

Below is a summary of the notes from this week's meeting

Attendees:

- Ian Cook
- Raúl Cumplido
- Alenka Frim
- Will Jones
- David Li
- Bryce Mecum
- Rok Mihevc
- Sri Nadukudy
- Weston Pace
- Dane Pitkin
- Matthew Topol
- Joris Van den Bossche

Discussion:

Arrow 12.0.0 Release

- Planned date for code freeze April 11th.
- Plan is for Raúl and Kou to collaborate to manage the release
similar to the 11.0.0 release
- Raúl indicated that under this arrangement, the release process
itself has not been too burdensome, but it has been challenging to
complete all the post-release tasks in a timely manner. For example,
there have been delays updating the conan and vcpkg packages. We could
use help from community members willing to volunteer to perform some
of these post-release tasks
- Please tag any issues that should block the release with:
label:"Priority: Blocker" and milestone:"12.0.0"


Issues we hope to resolve before the code freeze

- Acero ordered execution [1]
- Incorrect hash join results [2]
- Acero scanner work (if time permits)
- Performance regression from row default Parquet group size change [3]
- Slow Parquet reads [4]
- Fixed-shape tensor implementation PR [5]
- Performance regression in writing FixedSizeList to Parquet with
default row group size (if feasible) [6]
- PyArrow bindings to Acero (and removing custom Cython ExecPlan usage) [7]


Removing plasma

- Plasma has been deprecated as of the 10.0.0 release, and its removal
is scheduled to happen in the 12.0.0 release [8] because its original
authors stopped contributing to the Arrow community and forked their
own code for internal use inside another project
- There is ongoing investigation into which alternatives we should
recommend to replace Plasma in different use cases
- The maintainers intend to proceed as per this plan
- Will Jones has sent an email to the dev@ and user@ lists to provide
a reminder about this and seek recommendations about alternatives [9]

[1] https://github.com/apache/arrow/issues/32991
[2] https://github.com/apache/arrow/issues/34474
[3] https://github.com/apache/arrow/issues/34374
[4] https://github.com/apache/arrow/issues/34319
[5] https://github.com/apache/arrow/pull/8510
[6] https://github.com/apache/arrow/issues/34510
[7] https://github.com/apache/arrow/pull/34401
[8] https://lists.apache.org/thread/nw232k2lzmg9kcl8ts475m9ybl34j81p
[9] https://lists.apache.org/thread/1mrx0qg8dflshc4k0fv7g5qm775yr282


On Wed, Mar 15, 2023 at 9:31 AM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly Arrow community meeting is today at 16:00 UTC / 12:00
> EST. For attendees in countries that have not yet switched to Daylight
> Saving Time, please note that the time is one hour earlier than usual
> in your local time zone.
>
> Zoom meeting URL:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> The notes for this and future instances of this meeting will be
> captured in this Google Doc:
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
> If you plan to attend the meeting, you are welcome to edit the
> document to add the topics that you would like to discuss.
>
> Thanks,
> Ian

Arrow community meeting March 29 at 16:00 UTC

2023-03-28 Thread Ian Cook

Hi all,

Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.

I expect that this meeting might run shorter than usual because all
the attendees from Voltron Data will need to leave to join another
meeting at 12:30 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

The notes for this and future instances of this meeting will be
captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: Arrow community meeting March 29 at 16:00 UTC

2023-04-01 Thread Ian Cook

Below is a summary of the notes from this week's meeting:

Attendees:

 - Ian Cook
 - Will Jones
 - David Li
 - Rok Mihevc
 - Sri Nadukudy
 - Dane Pitkin


Discussion:

Questions about versioning, packaging, releasing the Rust ADBC API [1]
 - ADBC drivers are packaged in native language, and can also be
released in Conda as well as within Python wheels
 - If a Rust driver were created, it could be released along with a
Python wheel if we wanted, and could be versioned and released with
the Rust library if desired
 - ADBC libraries and drivers are coupled in version for convenience
of release process


Plasma has been removed from Arrow [2]
 - Further discussion about alternatives is welcome in [3]


Arrow 12.0.0 release
- Planned code freeze around April 10
- Plan is for Raúl and Kou to collaborate to manage the release


[1] https://github.com/apache/arrow-adbc/pull/478
[2] https://github.com/apache/arrow/pull/34718
[3] https://lists.apache.org/thread/lk277x3b9gjol42sjg27bst2ggm5s0j2

On Tue, Mar 28, 2023 at 11:12 AM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.
>
> I expect that this meeting might run shorter than usual because all
> the attendees from Voltron Data will need to leave to join another
> meeting at 12:30 EDT.
>
> Zoom meeting URL:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> The notes for this and future instances of this meeting will be
> captured in this Google Doc:
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
> If you plan to attend this meeting, you are welcome to edit the
> document to add the topics that you would like to discuss.
>
> Thanks,
> Ian

Arrow community meeting April 12 at 16:00 UTC

2023-04-11 Thread Ian Cook

Hi all,

Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

The notes for this and future instances of this meeting will be
captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: Arrow community meeting April 12 at 16:00 UTC

2023-04-12 Thread Ian Cook

Below is a summary of the notes from today's meeting:

Attendees:

- Ian Cook
- Raúl Cumplido
- Xuwei Fu
- Will Jones
- Bryce Mecum
- Rok Mihevc
- Sri Nadukudy
- Ashish Paliwal
- Dane Pitkin
- David Dali Susanibar Arce
- Matthew Topol
- Joris Van den Bossche
- Jacob Wujciak


Discussion:

12.0.0 release

- Code freeze is scheduled for later today, April 12
- There are many nightly failures currently on main; Raúl and Jacob
have opened several blocker issues and we might need to create more
- Discussion of several current issues that might affect the release
   - C# tests not finding Python
   - PyArrow tests slowness on Windows [1]
   - PyArrow wheels on Windows not uploading to Gemfury
- Important items to mention in release changelog, release blog, etc.
  - Drop support for Ubuntu 18.04 [2]
  - Acero refactor (splitting Acero out from core Arrow library) [3]
  - Fixed shape tensor extension type [4]
  - Run-end encoded layout [5]
  - Plasma removal [6] and suggested alternatives [7]
  - Reminder about Jira to GitHub move (which happened just before the
11.0.0 release)
  - Initial Swift implementation [8]
  - nanoarrow (not technically a part of this release, but worth
drawing attention to) [9]
  - Also see ASF board report


Parquet tickets are still tracked in the ASF Jira

- We have to maintain a lot of code in Archery, etc. to automate the
tracking of Parquet C++ issues which are still in Jira, even though
there are only a few Parquet issues in each release (4 for 12.0.0)
  - PARQUET-2201 Add stress test for RecordReader ReadRecords and
SkipRecords. (#14879)
  - PARQUET-2225 Allow reading dense with RecordReader (#17877)
  - PARQUET-2232 Add an api to ColumnChunkMetaData to indicate if the
column chunk uses a bloom filter (#33736)
  - PARQUET-2250 Expose column descriptor through RecordReader (#34318)
- Can we move the Parquet C++ issues from the ASF Jira to GitHub?
- Joris believes we can go ahead and do this; the Parquet Rust
implementation did something similar
- There are already some Parquet issues that were reported and
resolved in the Arrow monorepo in this release without ever being
opened as Parquet Jira issues [10]
- Check with Micah Kornfield, Fatemah Panah
- There was a related Parquet mailing list discussion about this in
February [11]


[1] https://github.com/apache/arrow/issues/35078
[2] https://github.com/apache/arrow/issues/33800
[3] https://lists.apache.org/thread/5h5g9k9lvbybzl8fnbg4fppxczm42g6r
[4] 
https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#fixed-shape-tensor
[5] https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout
[6] https://github.com/apache/arrow/pull/34718
[7] https://lists.apache.org/thread/lk277x3b9gjol42sjg27bst2ggm5s0j2
[8] https://github.com/apache/arrow/issues/20484
[9] https://arrow.apache.org/blog/2023/03/07/nanoarrow-0.1.0-release/
[10] 
https://github.com/apache/arrow/issues?q=is%3Aissue+label%3A%22Component%3A+Parquet%22+is%3Aclosed
[11] https://lists.apache.org/thread/jf9wos3t6xxk6xdyx2dof1jlkbpkr56p


On Tue, Apr 11, 2023 at 5:35 PM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.
>
> Zoom meeting URL:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> The notes for this and future instances of this meeting will be
> captured in this Google Doc:
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
> If you plan to attend this meeting, you are welcome to edit the
> document to add the topics that you would like to discuss.
>
> Thanks,
> Ian

Arrow community meeting April 26 at 16:00 UTC

2023-04-25 Thread Ian Cook

Hi all,

Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

The notes for this and future instances of this meeting will be
captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-04-26 Thread Ian Cook

+1 to what Weston and Joris suggested regarding the name. "ListView"
seems like the best name to use for this layout in Arrow.

My understanding is that the primary benefit of this ListView layout
over Arrow's existing List layouts [1] is that ListView allows for
buffer alignment [2] without padding, which makes vectorized
processing much more efficient. Is this understanding correct?

[1] https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
[2] 
https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding

Ian

On Wed, Apr 26, 2023 at 5:27 AM Joris Van den Bossche
 wrote:
>
> On Wed, 26 Apr 2023 at 02:37, Weston Pace  wrote:
> >
> > For context, there was some discussion on this back in [1].  At that time
> > this was called "sequence view" but I do not like that name.  However,
> > array-view array is a little confusing.  Given this is similar to list can
> > we go with list-view array?
>
> Yes, given that this is essentially an alternative representation of a
> logical "list" array, I would also prefer that we use the term "list"
> in the name for such a new type. The word "array" has a different
> meaning in context of our columnar specification.

Re: Arrow community meeting April 26 at 16:00 UTC

2023-04-27 Thread Ian Cook

Below is a summary of the notes from yesterday's meeting:

Attendees:

- Ian Cook
- Raúl Cumplido
- Xuwei Fu
- Will Jones
- Bryce Mecum
- Rok Mihevc
- Sri Nadukudy
- Matthew Topol


Discussion:

Arrow 12.0.0 release
- RC0 has been proposed [1]
- There were a lot of CI failures at the time of the code freeze so it
took longer than usual to resolve these and generate RC0; thanks to
everyone who helped
- There is one outstanding question regarding an issue with pandas
2.0.1 [2] and there is a fix that skips the failing test [3]
- It is unclear whether we should create a new RC that skips this
test, or whether it is sufficient to release the current RC since
pandas will fix the issue on their end
- There are a couple of other minor issues that we don’t think are blockers


Support for non-CPU memory in Arrow C data interface [4][5]
- We are seeking input that addresses the questions posed and gives
concrete recommendations


Questions about usage of the new fixed-shape tensor canonical extension type [6]
- Can it be written to a Parquet file and read back in? If so, what
Parquet logical and physical types does it use?
- Is it recommended for use with image data, or should we use byte
arrays instead?


Status of proposed integration tests for C data interface [7]
- Has not yet been implemented


Suggested topics for next meeting
- Discuss priorities for Arrow 13.0.0 release


[1] https://lists.apache.org/thread/2cnl1nbr8kfcxxq9s9br9b6f4xpmsqz1
[2] https://github.com/pandas-dev/pandas/issues/52899
[3] https://github.com/apache/arrow/pull/35324
[4] https://github.com/apache/arrow/pull/34972
[5] https://lists.apache.org/thread/sntc3pp6msdvb94zhq2lvy70s1p6d1qg
[6] 
https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#official-list
[7] https://lists.apache.org/thread/nr05xwls713xpsxkobpln2f6wsdntrky


On Tue, Apr 25, 2023 at 3:54 PM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.
>
> Zoom meeting URL:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> The notes for this and future instances of this meeting will be
> captured in this Google Doc:
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
> If you plan to attend this meeting, you are welcome to edit the
> document to add the topics that you would like to discuss.
>
> Thanks,
> Ian

Re: [WEBSITE] [DISCUSS] Arrow-Site blog post

2023-04-28 Thread Ian Cook

Hi Matt,

I reviewed it and left a few very minor comments. Looks great to me.

Do any PMC members wish to chime in? If not, it seems OK to give it 72
hours from the time of your email here and then merge it.

Thanks,
Ian


On Fri, Apr 28, 2023 at 11:41 AM Matt Topol  wrote:
>
> Hey All,
>
> Yevgeny Pats has contributed a blog post to the Arrow Site via PR[1].
> detailing his company's usage of Arrow for their type system. I've reviewed
> it and it looks good to me, but as I'm not a PMC member I didn't want to go
> merging it and having it get published without input from others first
> along with potentially coordinating *when* we should merge it to publish.
>
> So I'm hoping I can get a bit more eyes on this to give it a look over.
>
> Thanks all!
>
> --Matt
>
> [1]: https://github.com/apache/arrow-site/pull/348

Re: [DISCUSS][Gandiva] changes in bundled double-conversion

2023-05-01 Thread Ian Cook

Looking at PR #9816 which is the PR that introduced downstream changes
to our vendored copy of double-conversion, it appears that the changes
were quite small: two files modified, fewer than 10 lines of added
code, plus some comments [1]. If this is correct, then I think the
easiest path forward for everyone might be to port these small changes
to the updated vendored copy of double-conversion while we await
possible addition of these changes to upstream double-conversion.

Ian

[1] 
https://github.com/apache/arrow/pull/9816/files#diff-d1cc5b70a5e980626bb70ae604a050d3393ac25a717a5a4c8dc40e8b5caf4b05R97-R105

On Sun, Apr 30, 2023 at 9:27 PM Sutou Kouhei  wrote:
>
> Hi Gandiva developers,
>
> Could you reply this?
>
> If no Gandiva developers reply this, I'll remove these
> changes next week.
>
> Thanks,
> --
> kou
>
> In <20230420.171528.668386893930308045@clear-code.com>
>   "[DISCUSS][Gandiva] changes in bundled double-conversion" on Thu, 20 Apr 
> 2023 17:15:28 +0900 (JST),
>   Sutou Kouhei  wrote:
>
> > Hi Gandiva developers,
> >
> > We're updating bundled double-conversion:
> > https://github.com/apache/arrow/pull/34919
> >
> > I noticed that our bundled double-conversion has our changes
> > introduced by https://github.com/apache/arrow/pull/9816 .
> >
> > I want Gandiva developers to upstream these changes instead
> > of maintaining our changes in apache/arrow for easy to
> > maintain and sharing improvements to all over the world like
> > Apache Arrow.
> >
> > If no Gandiva developer join this discussion, I want to
> > remove these changes.
> >
> > See also:
> > https://github.com/apache/arrow/pull/34919#issuecomment-1501420706
> >
> >
> > Thanks,
> > --
> > kou

Re: [DISCUSS][Gandiva] changes in bundled double-conversion

2023-05-01 Thread Ian Cook

Hi Kou,

Thank you. I think this is a reasonable approach.

I added a comment asking if the PR author can please update the PR by
porting the changes from PR #9816.

After that is done, it should be easier to create a PR to upstream
double-conversion repo to propose these changes.

Thanks,
Ian

On Mon, May 1, 2023 at 5:24 PM Sutou Kouhei  wrote:
>
> Hi,
>
> > Looking at PR #9816 which is the PR that introduced downstream changes
> > to our vendored copy of double-conversion, it appears that the changes
> > were quite small: two files modified, fewer than 10 lines of added
> > code, plus some comments [1]. If this is correct
> > ...
> > [1] 
> > https://github.com/apache/arrow/pull/9816/files#diff-d1cc5b70a5e980626bb70ae604a050d3393ac25a717a5a4c8dc40e8b5caf4b05R97-R105
>
> Correct.
>
> > then I think the easiest path forward for everyone might
> > be to port these small changes to the updated vendored
> > copy of double-conversion while we await possible addition
> > of these changes to upstream double-conversion.
>
> I'm OK with maintaining our changes ONLY WHILE we're
> discussing our changes with upstream.
>
> Does anyone want to upstream our changes? It seems that our
> changes break a compatibility. So I think that we need to
> explain our use-case to upstream.
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: [DISCUSS][Gandiva] changes in bundled double-conversion" on Mon, 1 May 
> 2023 13:06:27 -0400,
>   Ian Cook  wrote:
>
> > Looking at PR #9816 which is the PR that introduced downstream changes
> > to our vendored copy of double-conversion, it appears that the changes
> > were quite small: two files modified, fewer than 10 lines of added
> > code, plus some comments [1]. If this is correct, then I think the
> > easiest path forward for everyone might be to port these small changes
> > to the updated vendored copy of double-conversion while we await
> > possible addition of these changes to upstream double-conversion.
> >
> > Ian
> >
> > [1] 
> > https://github.com/apache/arrow/pull/9816/files#diff-d1cc5b70a5e980626bb70ae604a050d3393ac25a717a5a4c8dc40e8b5caf4b05R97-R105
> >
> > On Sun, Apr 30, 2023 at 9:27 PM Sutou Kouhei  wrote:
> >>
> >> Hi Gandiva developers,
> >>
> >> Could you reply this?
> >>
> >> If no Gandiva developers reply this, I'll remove these
> >> changes next week.
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In <20230420.171528.668386893930308045@clear-code.com>
> >>   "[DISCUSS][Gandiva] changes in bundled double-conversion" on Thu, 20 Apr 
> >> 2023 17:15:28 +0900 (JST),
> >>   Sutou Kouhei  wrote:
> >>
> >> > Hi Gandiva developers,
> >> >
> >> > We're updating bundled double-conversion:
> >> > https://github.com/apache/arrow/pull/34919
> >> >
> >> > I noticed that our bundled double-conversion has our changes
> >> > introduced by https://github.com/apache/arrow/pull/9816 .
> >> >
> >> > I want Gandiva developers to upstream these changes instead
> >> > of maintaining our changes in apache/arrow for easy to
> >> > maintain and sharing improvements to all over the world like
> >> > Apache Arrow.
> >> >
> >> > If no Gandiva developer join this discussion, I want to
> >> > remove these changes.
> >> >
> >> > See also:
> >> > https://github.com/apache/arrow/pull/34919#issuecomment-1501420706
> >> >
> >> >
> >> > Thanks,
> >> > --
> >> > kou

Re: [ANNOUNCE] New Arrow PMC member: Matt Topol

2023-05-03 Thread Ian Cook

Congratulations Matt!!!

On Wed, May 3, 2023 at 9:55 PM Yibo Cai  wrote:
>
> Congrats Matt!
>
> On 5/4/23 07:07, Krisztián Szűcs wrote:
> > Congrats Matt!
> >
> > On Wed, May 3, 2023 at 11:44 PM Rok Mihevc  wrote:
> >>
> >> Congrats Matt. Well deserved!
> >>
> >> Rok
> >>
> >> On Wed, May 3, 2023 at 11:03 PM David Li  wrote:
> >>
> >>> Congrats Matt!
> >>>
> >>> On Wed, May 3, 2023, at 16:06, Neal Richardson wrote:
>  Congratulations!
> 
>  On Wed, May 3, 2023 at 1:58 PM Jacob Wujciak
> >>> 
>  wrote:
> 
> > Congratulations, well deserved!
> >
> > On Wed, May 3, 2023 at 7:48 PM Weston Pace 
> >>> wrote:
> >
> >> Congratulations!
> >>
> >> On Wed, May 3, 2023 at 10:47 AM Raúl Cumplido  
> >> wrote:
> >>
> >>> Congratulations Matt!
> >>>
> >>> El mié, 3 may 2023, 19:44, vin jake 
> >>> escribió:
> >>>
>  Congratulations, Matt!
> 
>  Felipe Oliveira Carvalho  于 2023年5月4日周四
> > 01:42写道：
> 
> > Congratulations, Matt!
> >
> > On Wed, 3 May 2023 at 14:37 Andrew Lamb 
> >> wrote:
> >
> >> The Project Management Committee (PMC) for Apache Arrow has
> > invited
> >> Matt Topol (zeroshade) to become a PMC member and we are
> >>> pleased
> > to
> >> announce
> >> that Matt has accepted.
> >>
> >> Congratulations and welcome!
> >>
> >
> 
> >>>
> >>
> >
> >>>

Arrow community meeting May 10 at 16:00 UTC

2023-05-09 Thread Ian Cook

Hi all,

Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

The notes for this and future instances of this meeting will be
captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: Arrow community meeting May 10 at 16:00 UTC

2023-05-10 Thread Ian Cook

Below is a summary of the notes from today's meeting.

Attendees:

- Ian Cook
- Raúl Cumplido
- Dewey Dunnington
- Xuwei Fu
- Will Jones
- David Li
- Ashish Paliwal
- Dane Pitkin
- Matthew Topol
- Joris Van den Bossche


Discussion:

Arrow 12.0.0 release

- Release is complete
- Most post-release tasks are complete, except for vcpkg (should be
done soon) and conan (still TBD)
- We need the Parquet C++ issues to be tagged properly in Jira
  - These issues are: PARQUET-2201, PARQUET-2225, PARQUET-2232, PARQUET-2250
  - We need to:
- Tag them with fix version cpp-12.0.0
- Mark the cpp-12.0.0 version as closed
- Create a new cpp-13.0.0 version
- Xuwei will notify Gang Wu and Micah


Arrow 12.0.1 release?

- See issues tagged with the 12.0.1 milestone [1]
- In particular, see the performance regression report [2]
- This is related to [3]


Benchmark checking as part of the release process?

- The regression mentioned above was flagged by a performance
benchmark [4] but we didn’t take any action on it before the release
- Perhaps as a part of the release verification process we should make
it easier for reviewers to see a Conbench page directly comparing the
performance of the current release candidate with the previous release
  - This would be useful not just for detecting performance
regressions but also for seeing big areas of performance improvement
that we can mention in the release notes / blog post / etc.
  - Raúl will take some next steps on this


s390x CI migration from Travis

- ASF has already switched off all Travis jobs (about 3 months ago) so
this job has not been running
- Raúl is still working on this


[1] 
https://github.com/apache/arrow/issues?q=is%3Aopen+is%3Aissue+milestone%3A12.0.1
[2] https://github.com/apache/arrow/issues/35498
[3] https://github.com/apache/arrow/issues/33313
[4] 
https://conbench.ursa.dev/benchmark-results/2b587cc1079f4e3a97f542e6f11e883e/

On Tue, May 9, 2023 at 10:14 PM Ian Cook  wrote:
>
> Hi all,
>
> Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.
>
> Zoom meeting URL:
> https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
> Meeting ID: 876 4903 3008
> Passcode: 958092
>
> The notes for this and future instances of this meeting will be
> captured in this Google Doc:
> https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
> If you plan to attend this meeting, you are welcome to edit the
> document to add the topics that you would like to discuss.
>
> Thanks,
> Ian

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-15 Thread Ian Cook

I think it would be easier for us all to weigh the costs and benefits
of adding this proposed ListView layout to the Arrow specification and
implementing it in the various Arrow libraries if we could all see
some benchmarks demonstrating the performance/efficiency benefits
compared to Arrow’s existing List layouts.

Based on the Velox paper [1] and from conversations with the Velox
developers, I would anticipate that these benchmarks will show that
ListView confers substantial performance/efficiency benefits on some
workloads. I suggest conferring with the Velox developers to identify
benchmark workloads will best demonstrate the performance/efficiency
benefit of the ListView layout while representing common real-world
workloads.

Ian

[1] https://vldb.org/pvldb/vol15/p3372-pedreira.pdf

On Sat, May 13, 2023 at 3:09 PM Andrew Lamb  wrote:
>
> I agree that it is hard to see  any compelling advantage of adopting
> ListView that would incentivize adding it to DataFusion.
>
> It also seems like the conversion requires changing only indexes (not the
> underlying data) so it would likely be relatively inexpensive I would think
>
> On Thu, May 11, 2023 at 4:51 PM Raphael Taylor-Davies
>  wrote:
>
> > Hi All,
> >
> > > if we added this, do we think many Arrow and query
> > > engine implementations (for example, DataFusion) will be eager to add
> > full
> > > support for the type, including compute kernels? Or are they likely to
> > just
> > > convert this type to ListArray at import boundaries?
> > I can't speak for query engines in general, but at least for arrow-rs
> > and by extension DataFusion, and based on my current understanding of
> > the use-cases I would be rather hesitant to add support to the kernels
> > for this array type, definitely instead favouring conversion at the
> > edges. We already have issues with the amount of code generation
> > resulting in binary bloat and long compile times, and I worry this would
> > worsen this situation whilst not really providing compelling advantages
> > for the vast majority of workloads that don't interact with Velox.
> > Whilst I can definitely see that the ListView representation is probably
> > a better way to represent variable length lists than what arrow settled
> > upon, I'm not yet convinced it is sufficiently better to incentivise
> > broad ecosystem adoption.
> >
> > Kind Regards,
> >
> > Raphael Taylor-Davies
> >
> > On 11/05/2023 21:20, Will Jones wrote:
> > > Hi Felipe,
> > >
> > > Thanks for the additional details.
> > >
> > >
> > >> Velox kernels benefit from being able to append data to the array from
> > >> different threads without care for strict ordering. Only the offsets
> > array
> > >> has to be written according to logical order but that is potentially a
> > much
> > >> smaller buffer than the values buffer.
> > >>
> > > It still seems to me like applications are still pretty niche, as I
> > suspect
> > > in most cases the benefits are outweighed by the costs. The benefit here
> > > seems pretty limited: if you are trying to split work between threads,
> > > usually you will have other levels such as array chunks to parallelize.
> > And
> > > if you have an incoming stream of row data, you'll want to append in
> > > predictable order to match the order of the other arrays. Am I missing
> > > something?
> > >
> > > And, IIUC, the cost of using ListView with out-of-order values over
> > > ListArray is you lose memory locality; the values of element 2 are no
> > > longer adjacent to the values of element 1. What do you think about that
> > > tradeoff?
> > >
> > > I don't mean to be difficult about this. I'm excited for both the REE and
> > > StringView arrays, but this one I'm not so sure about yet. I suppose
> > what I
> > > am trying to ask is, if we added this, do we think many Arrow and query
> > > engine implementations (for example, DataFusion) will be eager to add
> > full
> > > support for the type, including compute kernels? Or are they likely to
> > just
> > > convert this type to ListArray at import boundaries?
> > >
> > > Because if it turns out to be the latter, then we might as well ask Velox
> > > to export this type as ListArray and save the rest of the ecosystem some
> > > work.
> > >
> > > Best,
> > >
> > > Will Jones
> > >
> > > On Thu, May 11, 2023 at 12:32 PM Felipe Oliveira Carvalho <
> > > felipe...@gmail.com> wrote:
> > >
> > >> Initial reason for ListView arrays in Arrow is zero-copy compatibility
> > with
> > >> Velox which uses this format.
> > >>
> > >> Velox kernels benefit from being able to append data to the array from
> > >> different threads without care for strict ordering. Only the offsets
> > array
> > >> has to be written according to logical order but that is potentially a
> > much
> > >> smaller buffer than the values buffer.
> > >>
> > >> Acero kernels could take advantage of that in the future.
> > >>
> > >> In implementing ListViewArray/Type I was able to reuse some C++
> > templates
> > >> used fo

Re: [ANNOUNCE] New Arrow committer: Gang Wu

2023-05-15 Thread Ian Cook

Congratulations Gang!

On Mon, May 15, 2023 at 9:47 AM vin jake  wrote:
>
> Congrats Gang!
>
> On Mon, May 15, 2023 at 9:33 PM Sutou Kouhei  wrote:
>
> > On behalf of the Arrow PMC, I'm happy to announce that Gang
> > Wu has accepted an invitation to become a committer on
> > Apache Arrow. Welcome, and thank you for your contributions!
> >
> > Thanks,
> > --
> > kou
> >

Re: [DISCUSS][Gandiva] changes in bundled double-conversion

2023-05-18 Thread Ian Cook

I upstreamed the changes to our vendored double-conversion in [1].
These changes are now released in double-conversion v3.3.0 [2]. We can
remove our patches when we upgrade to v3.3.0 [3].

Ian

[1] https://github.com/google/double-conversion/pull/195
[2] https://github.com/google/double-conversion/releases/tag/v3.3.0
[3] https://github.com/apache/arrow/issues/35669

On Mon, May 1, 2023 at 7:52 PM Sutou Kouhei  wrote:
>
> Hi Ian,
>
> Thanks for your action on the PR!
>
> --
> kou
>
> In 
>   "Re: [DISCUSS][Gandiva] changes in bundled double-conversion" on Mon, 1 May 
> 2023 19:01:31 -0400,
>   Ian Cook  wrote:
>
> > Hi Kou,
> >
> > Thank you. I think this is a reasonable approach.
> >
> > I added a comment asking if the PR author can please update the PR by
> > porting the changes from PR #9816.
> >
> > After that is done, it should be easier to create a PR to upstream
> > double-conversion repo to propose these changes.
> >
> > Thanks,
> > Ian
> >
> > On Mon, May 1, 2023 at 5:24 PM Sutou Kouhei  wrote:
> >>
> >> Hi,
> >>
> >> > Looking at PR #9816 which is the PR that introduced downstream changes
> >> > to our vendored copy of double-conversion, it appears that the changes
> >> > were quite small: two files modified, fewer than 10 lines of added
> >> > code, plus some comments [1]. If this is correct
> >> > ...
> >> > [1] 
> >> > https://github.com/apache/arrow/pull/9816/files#diff-d1cc5b70a5e980626bb70ae604a050d3393ac25a717a5a4c8dc40e8b5caf4b05R97-R105
> >>
> >> Correct.
> >>
> >> > then I think the easiest path forward for everyone might
> >> > be to port these small changes to the updated vendored
> >> > copy of double-conversion while we await possible addition
> >> > of these changes to upstream double-conversion.
> >>
> >> I'm OK with maintaining our changes ONLY WHILE we're
> >> discussing our changes with upstream.
> >>
> >> Does anyone want to upstream our changes? It seems that our
> >> changes break a compatibility. So I think that we need to
> >> explain our use-case to upstream.
> >>
> >>
> >> Thanks,
> >> --
> >> kou
> >>
> >> In 
> >>   "Re: [DISCUSS][Gandiva] changes in bundled double-conversion" on Mon, 1 
> >> May 2023 13:06:27 -0400,
> >>   Ian Cook  wrote:
> >>
> >> > Looking at PR #9816 which is the PR that introduced downstream changes
> >> > to our vendored copy of double-conversion, it appears that the changes
> >> > were quite small: two files modified, fewer than 10 lines of added
> >> > code, plus some comments [1]. If this is correct, then I think the
> >> > easiest path forward for everyone might be to port these small changes
> >> > to the updated vendored copy of double-conversion while we await
> >> > possible addition of these changes to upstream double-conversion.
> >> >
> >> > Ian
> >> >
> >> > [1] 
> >> > https://github.com/apache/arrow/pull/9816/files#diff-d1cc5b70a5e980626bb70ae604a050d3393ac25a717a5a4c8dc40e8b5caf4b05R97-R105
> >> >
> >> > On Sun, Apr 30, 2023 at 9:27 PM Sutou Kouhei  wrote:
> >> >>
> >> >> Hi Gandiva developers,
> >> >>
> >> >> Could you reply this?
> >> >>
> >> >> If no Gandiva developers reply this, I'll remove these
> >> >> changes next week.
> >> >>
> >> >> Thanks,
> >> >> --
> >> >> kou
> >> >>
> >> >> In <20230420.171528.668386893930308045@clear-code.com>
> >> >>   "[DISCUSS][Gandiva] changes in bundled double-conversion" on Thu, 20 
> >> >> Apr 2023 17:15:28 +0900 (JST),
> >> >>   Sutou Kouhei  wrote:
> >> >>
> >> >> > Hi Gandiva developers,
> >> >> >
> >> >> > We're updating bundled double-conversion:
> >> >> > https://github.com/apache/arrow/pull/34919
> >> >> >
> >> >> > I noticed that our bundled double-conversion has our changes
> >> >> > introduced by https://github.com/apache/arrow/pull/9816 .
> >> >> >
> >> >> > I want Gandiva developers to upstream these changes instead
> >> >> > of maintaining our changes in apache/arrow for easy to
> >> >> > maintain and sharing improvements to all over the world like
> >> >> > Apache Arrow.
> >> >> >
> >> >> > If no Gandiva developer join this discussion, I want to
> >> >> > remove these changes.
> >> >> >
> >> >> > See also:
> >> >> > https://github.com/apache/arrow/pull/34919#issuecomment-1501420706
> >> >> >
> >> >> >
> >> >> > Thanks,
> >> >> > --
> >> >> > kou

Re: [DISCUSS] Interest in a 12.0.1 patch?

2023-05-18 Thread Ian Cook

There is also a major issue with the 12.0.0 R package that has now
been fixed in the repo [2] and needs to be resubmitted to CRAN soon.
The R package developers are supportive of a 12.0.1 patch release
happening soon so that the resubmission of the R package to CRAN can
also include the fix for the performance regression you mention.

Ian

[2] https://github.com/apache/arrow/pull/35612

On Thu, May 18, 2023 at 1:04 PM Weston Pace  wrote:
>
> Regrettabl, 12.0.0 had a significant performance regression (I'll take the
> blame for not thinking through all the use cases), most easily exposed when
> writing datasets from pandas / numpy data, which is being addressed in
> [1].  I believe this to be a fairly common use case and it may warrant a
> 12.0.1 patch.  Are there other issues that would need a patch?  Do we feel
> this issue is significant enough to justify the work?
>
> [1] https://github.com/apache/arrow/pull/35565

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-05-19 Thread Ian Cook

That's great, thanks Brent. If possible could you share a specific
example of the operation you are referring to so that we can better
reason about how the ListView layout would help in this case?

Any additional input from the community providing specifics of
real-world workloads that are expected to benefit from the ListView
layout would be much appreciated.
.
Ian

On Mon, May 15, 2023 at 5:12 PM Brent Gardner
 wrote:
>
> For what it's worth, my company is building a database using arrow(rs) as
> an in memory storage format, and this feature would be very helpful because
> it would allow us to bitmask out mvcc rows that have been deleted / have
> not yet been committed / have been rolled back, etc.
>
> - Brent
>
> On Mon, May 15, 2023, 06:55 Ian Cook  wrote:
>
> > I think it would be easier for us all to weigh the costs and benefits
> > of adding this proposed ListView layout to the Arrow specification and
> > implementing it in the various Arrow libraries if we could all see
> > some benchmarks demonstrating the performance/efficiency benefits
> > compared to Arrow’s existing List layouts.
> >
> > Based on the Velox paper [1] and from conversations with the Velox
> > developers, I would anticipate that these benchmarks will show that
> > ListView confers substantial performance/efficiency benefits on some
> > workloads. I suggest conferring with the Velox developers to identify
> > benchmark workloads will best demonstrate the performance/efficiency
> > benefit of the ListView layout while representing common real-world
> > workloads.
> >
> > Ian
> >
> > [1] https://vldb.org/pvldb/vol15/p3372-pedreira.pdf
> >
> > On Sat, May 13, 2023 at 3:09 PM Andrew Lamb  wrote:
> > >
> > > I agree that it is hard to see  any compelling advantage of adopting
> > > ListView that would incentivize adding it to DataFusion.
> > >
> > > It also seems like the conversion requires changing only indexes (not the
> > > underlying data) so it would likely be relatively inexpensive I would
> > think
> > >
> > > On Thu, May 11, 2023 at 4:51 PM Raphael Taylor-Davies
> > >  wrote:
> > >
> > > > Hi All,
> > > >
> > > > > if we added this, do we think many Arrow and query
> > > > > engine implementations (for example, DataFusion) will be eager to add
> > > > full
> > > > > support for the type, including compute kernels? Or are they likely
> > to
> > > > just
> > > > > convert this type to ListArray at import boundaries?
> > > > I can't speak for query engines in general, but at least for arrow-rs
> > > > and by extension DataFusion, and based on my current understanding of
> > > > the use-cases I would be rather hesitant to add support to the kernels
> > > > for this array type, definitely instead favouring conversion at the
> > > > edges. We already have issues with the amount of code generation
> > > > resulting in binary bloat and long compile times, and I worry this
> > would
> > > > worsen this situation whilst not really providing compelling advantages
> > > > for the vast majority of workloads that don't interact with Velox.
> > > > Whilst I can definitely see that the ListView representation is
> > probably
> > > > a better way to represent variable length lists than what arrow settled
> > > > upon, I'm not yet convinced it is sufficiently better to incentivise
> > > > broad ecosystem adoption.
> > > >
> > > > Kind Regards,
> > > >
> > > > Raphael Taylor-Davies
> > > >
> > > > On 11/05/2023 21:20, Will Jones wrote:
> > > > > Hi Felipe,
> > > > >
> > > > > Thanks for the additional details.
> > > > >
> > > > >
> > > > >> Velox kernels benefit from being able to append data to the array
> > from
> > > > >> different threads without care for strict ordering. Only the offsets
> > > > array
> > > > >> has to be written according to logical order but that is
> > potentially a
> > > > much
> > > > >> smaller buffer than the values buffer.
> > > > >>
> > > > > It still seems to me like applications are still pretty niche, as I
> > > > suspect
> > > > > in most cases the benefits are outweighed by the costs. The benefit
> > here
> > > > > seems pretty limited: if you are trying

Arrow community meeting May 24 at 16:00 UTC

2023-05-23 Thread Ian Cook

Hi all,

Our biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

The notes for this and future instances of this meeting will be
captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: [VOTE][Format] Add experimental ArrowDeviceArray to C-Data API

2023-05-31 Thread Ian Cook

+1 (non-binding).

Thanks very much Matt for all the work you did here to solicit input from
other stakeholder communities.

On Mon, May 22, 2023 at 12:02 PM Matt Topol  wrote:

> Hello,
>
> Now that there's a rough consensus and a toy example POC[1], I would like
> to propose an official enhancement to the Arrow C-Data API specification as
> described in the PR[2]. The new ArrowDeviceArray/ArrowDeviceArrayStream
> structs would be considered "experimental" and the documentation would
> label them as such for the time being.
>
> Please comment, ask questions, and look at the PR and toy example POC as
> needed.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 Add this to the C-Data API
> [ ] +0
> [ ] -1 Do not add this to the C-Data API because...
>
> Thank you very much everyone!
> -- Matt
>
> [1]: https://github.com/zeroshade/arrow-non-cpu
> [2]: https://github.com/apache/arrow/pull/34972
>

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Ian Cook

To clarify why we cannot simply propose adding ListView as a new
“canonical extension type”: The extension type mechanism in Arrow
depends on the underlying data being organized in an existing Arrow
layout—that way an implementation that does not support the extension
type can still handle the underlying data. But ListView is a wholly
new layout.

I strongly agree with Weston’s idea that it is a good time for Arrow
to introduce the notion of “canonical alternative layouts.”

Taken together, I think that canonical extension types and canonical
alternative layouts could serve as an “incubator” for proposed new
representations. For example, if a proposed canonical alternative
layout ends up being broadly adopted, then that will serve as a signal
that we should consider adding it as a primary layout in the core
spec.

It seems to me that most projects that are implementing Arrow today
are not aiming to provide complete coverage of Arrow; rather they are
adopting Arrow because of its role as a standard and they are
implementing only as much of the Arrow standard as they require to
achieve some goal. I believe that such projects are important Arrow
stakeholders, and I believe that this proposed notion of canonical
alternative layouts will serve them well and will create efficiencies
by standardizing implementations around a shared set of alternatives.

However I think that the documentation for canonical alternative
layouts should strongly encourage implementers to default to using the
primary layouts defined in the core spec and only use alternative
layouts in cases where the primary layouts do not meet their needs.

On Sat, May 27, 2023 at 7:44 PM Micah Kornfield  wrote:
>
> This sounds reasonable to me but my main concern is, I'm not sure there is
> a great mechanism to enforce canonical layouts don't somehow become default
> (or the only implementation).
>
> Even for these new layouts, I think it might be worth rethinking binding a
> layout into the schema versus having a different concept of encoding (and
> changing some of the corresponding data structures).
>
>
> On Mon, May 22, 2023 at 10:37 AM Weston Pace  wrote:
>
> > Trying to settle on one option is a fruitless endeavor.  Each type has pros
> > and cons.  I would also predict that the largest existing usage of Arrow is
> > shuttling data from one system to another.  The newly proposed format
> > doesn't appear to have any significant advantage for that use case (if
> > anything, the existing format is arguably better as it is more compact).
> >
> > I am very biased towards historical precedent and avoiding breaking
> > changes.
> >
> > We have "canonical extension types", perhaps it is time for "canonical
> > alternative layouts".  We could define it as such:
> >
> >  * There are one or more primary layouts
> >* Existing layouts are automatically considered primary layouts, even if
> > they wouldn't
> >  have been primary layouts initially (e.g. large list)
> >  * A new layout, if it is semantically equivalent to another, is considered
> > an alternative layout
> >  * An alternative layout still has the same requirements for adoption (two
> > implementations and a vote)
> >* An implementation should not feel pressured to rush and implement the
> > new layout.
> >  It would be good if they contribute in the discussion and consider the
> > layout and vote
> >  if they feel it would be an acceptable design.
> >  * We can define and vote and approve as many canonical alternative layouts
> > as we want:
> >* A canonical alternative layout should, at a minimum, have some
> >  reasonable justification, such as improved performance for algorithm X
> >  * Arrow implementations MUST support the primary layouts
> >  * An Arrow implementation MAY support a canonical alternative, however:
> >* An Arrow implementation MUST first support the primary layout
> >* An Arrow implementation MUST support conversion to/from the primary
> > and canonical layout
> >* An Arrow implementation's APIs MUST only provide data in the
> > alternative
> >  layout if it is explicitly asked for (e.g. schema inference should
> > prefer the primary layout).
> >  * We can still vote for new primary layouts (e.g. promoting a canonical
> > alternative) but, in these
> > votes we don't only consider the value (e.g. performance) of the layout
> > but also the interoperability.
> > In other words, a layout can only become a primary layout if there is
> > significant evidence that most
> > implementations plan to adopt it.
> >
> > This lets us evolve support for new layouts more naturally.  We can
> > generally assume that users will not, initially, be aware of these
> > alternative layouts.  However, everything will just work.  They may start
> > to see a performance penalty stemming from a lack of support for these
> > layouts.  If this performance penalty becomes significant then they will
> > discover it and become aware of the proble

Re: [DISCUSS][Format] Starting the draft implementation of the ArrayView array format

2023-06-06 Thread Ian Cook

Thanks Weston. That all sounds reasonable to me.

>  with the caveat that the primary layout must be emitted if the user does not 
> specifically request the alternative layout

This implies that each canonical alternative layout would codify a
primary layout as its "fallback." This seems reasonable but it opens
up some cans of worms, such as how two components communicating
through an Arrow interface would negotiate which layout is supported.
I suppose such details should be discussed in a separate thread, but I
raise this here just to point out that it implies an expansion in the
scope of what Arrow interfaces can do.

On Tue, Jun 6, 2023 at 6:17 PM Weston Pace  wrote:
>
> From Micah:
>
> > This sounds reasonable to me but my main concern is, I'm not sure there is
> > a great mechanism to enforce canonical layouts don't somehow become
> default
> > (or the only implementation).
>
> I'm not sure I understand.  Is the concern that an alternative layout is
> eventually
> used more and more by implementations until it is used more often than the
> primary
> layouts?  In that case I think that is ok and we can promote the alternative
> to a primary layout.
>
> Or is the concern that some applications will only support the alternative
> layouts and
> not the primary layout?  In that case I would argue the application is not
> "arrow compatible".
> I don't know that we prevent or enforce this today either.  An author can
> always falsely
> claim they support Arrow even if they are using their own bespoke format.
>
> From Ian:
>
> > It seems to me that most projects that are implementing Arrow today
> > are not aiming to provide complete coverage of Arrow; rather they are
> > adopting Arrow because of its role as a standard and they are
> > implementing only as much of the Arrow standard as they require to
> > achieve some goal. I believe that such projects are important Arrow
> > stakeholders, and I believe that this proposed notion of canonical
> > alternative layouts will serve them well and will create efficiencies
> > by standardizing implementations around a shared set of alternatives.
> >
> > However I think that the documentation for canonical alternative
> > layouts should strongly encourage implementers to default to using the
> > primary layouts defined in the core spec and only use alternative
> > layouts in cases where the primary layouts do not meet their needs.
>
> I'd maybe take a slightly harsher stance.  I don't think an application
> needs to
> support all types.  For example, an Arrow-native string processing library
> might
> not worry about the integer types.  That would be fine.  I think it would
> still
> be fair to call it an "arrow compatible string processing library".
>
> However, an application must support primary layouts in addition to
> alternative
> layouts.  For example, a string processing library that expects all strings
> to be
> delivered as a single buffer sequence of null-terminated strings would not
> be "an
> arrow compatible string processing library" unless it also fully supported
> the
> standard (lengths + data) variable-sized list layout for strings defined at
> [1].
>
> In other words:
>
>  * Only receives and emits alternative layouts - not arrow compatible
>  * Only receives and emits primary layouts - arrow compatible
>  * Receives and emits both primary and alternative layouts - arrow
> compatible†
>
> † - with the caveat that the primary layout must be emitted if the user
> does not
> specifically request the alternative layout.
>
> [1]
> https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout
>
> On Tue, Jun 6, 2023 at 2:45 PM Ian Cook  wrote:
>
> > To clarify why we cannot simply propose adding ListView as a new
> > “canonical extension type”: The extension type mechanism in Arrow
> > depends on the underlying data being organized in an existing Arrow
> > layout—that way an implementation that does not support the extension
> > type can still handle the underlying data. But ListView is a wholly
> > new layout.
> >
> > I strongly agree with Weston’s idea that it is a good time for Arrow
> > to introduce the notion of “canonical alternative layouts.”
> >
> > Taken together, I think that canonical extension types and canonical
> > alternative layouts could serve as an “incubator” for proposed new
> > representations. For example, if a proposed canonical alternative
> > layout ends up being broadly adopted, then that will serve as a signal
> > that we should consider adding it as a primary layout in the core
>

Arrow community meeting June 7 at 16:00 UTC

2023-06-07 Thread Ian Cook

Our next biweekly Arrow community meeting is today at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Meeting notes will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: [Parquet C++] Plan to bump default write version from 2.4 -> 2.6 (include nanoseconds LogicalType)

2023-06-15 Thread Ian Cook

It will still be possible to write files using Parquet 2.4 by
explicitly specifying the 2.4 version to the Parquet writer, correct?
If yes, that provides a simple workaround for users who encounter
compatibility issues.

However we should take care to document this as a potentially breaking
change, and document the workaround in release notes, release blog,
etc.

Ian

On Thu, Jun 15, 2023 at 12:25 PM Joris Van den Bossche
 wrote:
>
> Hi all,
>
> Bringing up https://github.com/apache/arrow/issues/35746 to the
> mailing list: this issue proposes to bump the default Parquet version
> we use for writing to Parquet files in the C++ library (and in the
> various bindings including pyarrow and R arrow) from the current
> default of "2.4" to "2.6".
>
> In practice, the only change is that the writer will, by default,
> write the Timestamp LogicalType with NANOS unit
> (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp)
> if your data uses timestamp("ns") (currently, such data gets coerced
> to microsecond resolution when writing to Parquet).
>
> In theory this could cause compatibility issues if the files you are
> writing need to be read by other Parquet implementations which don't
> yet support nanoseconds. But the Parquet format 2.6 was released in
> Sept 2018, and parquet-mr added support for it in 2018 as well.
>
> Unless there is pushback on this, we are currently planning to make
> this change for the upcoming Arrow 13.0.0 release.
>
> Best,
> Joris

Arrow community meeting June 21 at 16:00 UTC

2023-06-21 Thread Ian Cook

Our next biweekly Arrow community meeting is today at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Meeting notes will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: [ANNOUNCE] New Arrow PMC member: Dewey Dunnington

2023-06-23 Thread Ian Cook

Congratulations Dewey!

On Fri, Jun 23, 2023 at 10:03 AM Matt Topol  wrote:
>
> Congrats Dewey!!
>
> On Fri, Jun 23, 2023, 9:35 AM Dane Pitkin 
> wrote:
>
> > Congrats Dewey!
> >
> > On Fri, Jun 23, 2023 at 9:15 AM Nic Crane  wrote:
> >
> > > Well-deserved Dewey, congratulations!
> > >
> > > On Fri, 23 Jun 2023 at 11:53, Vibhatha Abeykoon 
> > > wrote:
> > >
> > > > Congratulations Dewey!
> > > >
> > > > On Fri, Jun 23, 2023 at 4:16 PM Alenka Frim  > > > .invalid>
> > > > wrote:
> > > >
> > > > > Congratulations Dewey!! 🎉
> > > > >
> > > > > On Fri, Jun 23, 2023 at 12:10 PM Raúl Cumplido <
> > raulcumpl...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Congratulations Dewey!
> > > > > >
> > > > > > El vie, 23 jun 2023, 11:55, Andrew Lamb 
> > > > escribió:
> > > > > >
> > > > > > > The Project Management Committee (PMC) for Apache Arrow has
> > invited
> > > > > > > Dewey Dunnington (paleolimbot) to become a PMC member and we are
> > > > > pleased
> > > > > > to
> > > > > > > announce
> > > > > > > that Dewey Dunnington has accepted.
> > > > > > >
> > > > > > > Congratulations and welcome!
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-23 Thread Ian Cook

Thanks Will for this proposal!

For anyone familiar with PyArrow, this idea has a clear intuitive
logic to it. It provides an expedient solution to the current lack of
a practical means for interchanging "unmaterialized dataframes"
between different Python libraries.

To elaborate on that: If you look at how people use the Arrow Dataset
API—which is implemented in the Arrow C++ library [1] and has bindings
not just for Python [2] but also for Java [3] and R [4]—you'll see
that Dataset is often used simply as a "virtual" variant of Table. It
is used in cases when the data is larger than memory or when it is
desirable to defer reading (materializing) the data into memory.

So we can think of a Table as a materialized dataframe and a Dataset
as an unmaterialized dataframe. That aspect of Dataset is I think what
makes it most attractive as a protocol for enabling interoperability:
it allows libraries to easily "speak Arrow" in cases where
materializing the full data in memory upfront is impossible or
undesirable.

The trouble is that Dataset was not designed to serve as a
general-purpose unmaterialized dataframe. For example, the PyArrow
Dataset constructor [5] exposes options for specifying a list of
source files and a partitioning scheme, which are irrelevant for many
of the applications that Will anticipates. And some work is needed to
reconcile the methods of the PyArrow Dataset object [6] with the
methods of the Table object. Some methods like filter() are exposed by
both and behave lazily on Datasets and eagerly on Tables, as a user
might expect. But many other Table methods are not implemented for
Dataset though they potentially could be, and it is unclear where we
should draw the line between adding methods to Dataset vs. encouraging
new scanner implementations to expose options controlling what lazy
operations should be performed as they see fit.

Will, I see that you've already addressed this issue to some extent in
your proposal. For example, you mention that we should initially
define this protocol to include only a minimal subset of the Dataset
API. I agree, but I think there are some loose ends we should be
careful to tie up. I strongly agree with the comments made by David,
Weston, and Dewey arguing that we should avoid any use of PyArrow
expressions in this API. Expressions are an implementation detail of
PyArrow, not a part of the Arrow standard. It would be much safer for
the initial version of this protocol to not define *any*
methods/arguments that take expressions. This will allow us to take
some more time to finish up the Substrait expression implementation
work that is underway [7][8], then introduce Substrait-based
expressions in a latter version of this protocol. This approach will
better position this protocol to be implemented in other languages
besides Python.

Another concern I have is that we have not fully explained why we want
to use Dataset instead of RecordBatchReader [9] as the basis of this
protocol. I would like to see an explanation of why RecordBatchReader
is not sufficient for this. RecordBatchReader seems like another
possible way to represent "unmaterialized dataframes" and there are
some parallels between RecordBatch/RecordBatchReader and
Fragment/Dataset. We should help developers and users understand why
Arrow needs both of these.

Thanks Will for your thoughtful prose explanations about this proposed
API. After we arrive at a decision about this, I think we should
reproduce some of these explanations in docs, blog posts, cookbook
recipes, etc. because there is some important nuance here that will be
important for integrators of this API to understand.

Ian

[1] https://arrow.apache.org/docs/cpp/api/dataset.html
[2] https://arrow.apache.org/docs/python/dataset.html
[3] https://arrow.apache.org/docs/java/dataset.html
[4] https://arrow.apache.org/docs/r/articles/dataset.html
[5] 
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
[6] https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html
[7] https://github.com/apache/arrow/issues/33985
[8] https://github.com/apache/arrow/issues/34252
[9] 
https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html

On Wed, Jun 21, 2023 at 2:09 PM Will Jones  wrote:
>
> Hello Arrow devs,
>
> I have drafted a PR defining an experimental protocol which would allow
> third-party libraries to imitate the PyArrow Dataset API [5]. This protocol
> is intended to endorse an integration pattern that is starting to be used
> in the Python ecosystem, where some libraries are providing their own
> scanners with this API, while query engines are accepting these as
> duck-typed objects.
>
> To give some background: back at the end of 2021, we collaborated with
> DuckDB to be able to read datasets (an Arrow C++ concept), supporting
> column selection and filter pushdown. This was accomplished by having
> DuckDB manipulating Python (or R) objects to get a RecordBat

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-27 Thread Ian Cook

k it provides
> > the most obvious migration paths for existing producers and consumers.
> > 2. We keep the overall dataset API, but don't introduce the filter and
> > projection arguments until we have Substrait support. I'm not sure what the
> > migration path looks like for producers and consumers, but I think this
> > just implicitly becomes the same as (1), but with worse documentation.
> > 3. We write a protocol completely from scratch, that doesn't try to
> > describe the existing dataset API. Producers and consumers would then
> > migrate to use the new protocol and deprecate their existing dataset
> > integrations. We could introduce a dunder method in that API (sort of like
> > __arrow_array__) that would make the migration seamless from the end-user
> > perspective.
> >
> > *Which do you all think is the best path forward?*
> >
> > Another concern I have is that we have not fully explained why we want
> > > to use Dataset instead of RecordBatchReader [9] as the basis of this
> > > protocol. I would like to see an explanation of why RecordBatchReader
> > > is not sufficient for this. RecordBatchReader seems like another
> > > possible way to represent "unmaterialized dataframes" and there are
> > > some parallels between RecordBatch/RecordBatchReader and
> > > Fragment/Dataset.
> > >
> >
> > This is a good point. I can add a section describing the differences. The
> > main ones I can think of are that: (1) Datasets are "pruneable": one can
> > select a subset of columns and apply a filter on rows to avoid IO and (2)
> > they are splittable and serializable, so that fragments can be distributed
> > amongst processes / workers.
> >
> > Best,
> >
> > Will Jones
> >
> > On Fri, Jun 23, 2023 at 10:48 AM Ian Cook  wrote:
> >
> > > Thanks Will for this proposal!
> > >
> > > For anyone familiar with PyArrow, this idea has a clear intuitive
> > > logic to it. It provides an expedient solution to the current lack of
> > > a practical means for interchanging "unmaterialized dataframes"
> > > between different Python libraries.
> > >
> > > To elaborate on that: If you look at how people use the Arrow Dataset
> > > API—which is implemented in the Arrow C++ library [1] and has bindings
> > > not just for Python [2] but also for Java [3] and R [4]—you'll see
> > > that Dataset is often used simply as a "virtual" variant of Table. It
> > > is used in cases when the data is larger than memory or when it is
> > > desirable to defer reading (materializing) the data into memory.
> > >
> > > So we can think of a Table as a materialized dataframe and a Dataset
> > > as an unmaterialized dataframe. That aspect of Dataset is I think what
> > > makes it most attractive as a protocol for enabling interoperability:
> > > it allows libraries to easily "speak Arrow" in cases where
> > > materializing the full data in memory upfront is impossible or
> > > undesirable.
> > >
> > > The trouble is that Dataset was not designed to serve as a
> > > general-purpose unmaterialized dataframe. For example, the PyArrow
> > > Dataset constructor [5] exposes options for specifying a list of
> > > source files and a partitioning scheme, which are irrelevant for many
> > > of the applications that Will anticipates. And some work is needed to
> > > reconcile the methods of the PyArrow Dataset object [6] with the
> > > methods of the Table object. Some methods like filter() are exposed by
> > > both and behave lazily on Datasets and eagerly on Tables, as a user
> > > might expect. But many other Table methods are not implemented for
> > > Dataset though they potentially could be, and it is unclear where we
> > > should draw the line between adding methods to Dataset vs. encouraging
> > > new scanner implementations to expose options controlling what lazy
> > > operations should be performed as they see fit.
> > >
> > > Will, I see that you've already addressed this issue to some extent in
> > > your proposal. For example, you mention that we should initially
> > > define this protocol to include only a minimal subset of the Dataset
> > > API. I agree, but I think there are some loose ends we should be
> > > careful to tie up. I strongly agree with the comments made by David,
> > > Weston, and Dewey arguing that we should avoid any use of PyArrow
> > > expressions in this

Arrow community meeting July 5 at 16:00 UTC

2023-07-04 Thread Ian Cook

Our next biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Meeting notes will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: [DISCUSS] Canonical alternative layout proposal

2023-07-13 Thread Ian Cook

Thank you Weston for proposing this solution and Neal for describing
its context and implications. I agree with the other replies here—this
seems like an elegant solution to a growing need that could, if left
unaddressed, increase the fragmentation of the ecosystem and reduce
the centrality of the Arrow format.

Greater diversity of layouts is happening. Whether it happens inside
of Arrow or outside of Arrow is up to us. I think we all would like to
see it happen inside of Arrow. This proposal allows for that, while
striking a balance as Raphael describes.

However I think there is still some ambiguity about exactly how an
Arrow implementation that is consuming/producing data would negotiate
with an Arrow implementation or other component that is
producing/consuming data to determine whether an alternative layout is
supported. This was discussed briefly in [5] but I am interested to
see how this negotiation would be implemented in practice in the C
data interface, IPC, Flight, etc.

Ian

[5] https://lists.apache.org/thread/7x2714wookjqgkoykxpq9jtpyrgx2bx2


On Thu, Jul 13, 2023 at 11:00 AM Raphael Taylor-Davies
 wrote:
>
> I like this proposal, I think it strikes a pragmatic balance between
> preserving interoperability whilst still allowing new ideas to be
> incorporated into the standard. Thank you for writing this up.
>
> On 13/07/2023 10:22, Matt Topol wrote:
> > I don't have much to add but I do want to second Jacob's comments. I agree
> > that this is a good way to avoid the fragmentation while keeping Arrow
> > relevant, and likely something we need to do so that we can ensure Arrow
> > remains the way to do this data integration and interoperability.
> >
> > On Wed, Jul 12, 2023 at 9:52 PM Jacob Wujciak-Jens
> >  wrote:
> >
> >> Hello Everyone,
> >>
> >> Thanks for this comprehensive but concise write up Neal! I think this
> >> proposal is a good way to avoid both fragmentation of the arrow ecosystem
> >> as well as its obsolescence. In my opinion of these two problems the
> >> obsolescence is the bigger issue as (as mentioned in the proposal) arrow is
> >> already (close to) being relegated to the sidelines in eco-system defining
> >> projects.
> >>
> >> Jacob
> >>
> >> On Thu, Jul 13, 2023 at 12:03 AM Neal Richardson <
> >> neal.p.richard...@gmail.com> wrote:
> >>
> >>> Hi all,
> >>> As was previously raised in [1] and surfaced again in [2], there is a
> >>> proposal for representing alternative layouts. The intent, as I
> >> understand
> >>> it, is to be able to support memory layouts that some (but perhaps not
> >> all)
> >>> applications of Arrow find valuable, so that these nearly Arrow systems
> >> can
> >>> be fully Arrow-native.
> >>>
> >>> I wanted to start a more focused discussion on it because I think it's
> >>> worth being considered on its own merits, but I also think this gets to
> >> the
> >>> core of what the Arrow project is and should be, and I don't want us to
> >>> lose sight of that.
> >>>
> >>> To restate the proposal from [1]:
> >>>
> >>>   * There are one or more primary layouts
> >>> * Existing layouts are automatically considered primary layouts,
> >>> even if they
> >>> wouldn't have been primary layouts initially (e.g. large list)
> >>>   * A new layout, if it is semantically equivalent to another, is
> >>> considered an
> >>> alternative layout
> >>>   * An alternative layout still has the same requirements for adoption
> >>> (two implementations
> >>> and a vote)
> >>> * An implementation should not feel pressured to rush and implement
> >> the
> >>> new
> >>> layout. It would be good if they contribute in the discussion and
> >> consider
> >>> the layout and vote if they feel it would be an acceptable design.
> >>>   * We can define and vote and approve as many canonical alternative
> >>> layouts as
> >>> we want:
> >>> * A canonical alternative layout should, at a minimum, have some
> >>> reasonable
> >>> justification, such as improved performance for algorithm X
> >>>   * Arrow implementations MUST support the primary layouts
> >>>   * An Arrow implementation MAY support a canonical alternative, however:
> >>> * An Arrow implementation MUST first support the primary layout
> >>> * An Arrow implementation MUST support conversion to/from the primary
> >>> and
> >>> canonical layout
> >>> * An Arrow implementation's APIs MUST only provide data in the
> >>> alternative layout if it is explicitly asked for (e.g. schema inference
> >>> should prefer the primary layout).
> >>>   * We can still vote for new primary layouts (e.g. promoting a
> >>> canonical alternative)
> >>> but, in these votes we don't only consider the value (e.g. performance)
> >> of
> >>> the layout but also the interoperability. In other words, a layout can
> >> only
> >>> become a primary layout if there is significant evidence that most
> >>> implementations
> >>> plan to adopt it.
> >>>
> >>>
> >>> To summarize some of the arguments against the proposal from the previous
>

Arrow community meeting July 19 at 16:00 UTC

2023-07-19 Thread Ian Cook

Our next biweekly Arrow community meeting is today at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Meeting notes will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: [QUESTION] Syndication site(s) for Apache Arrow related content?

2023-07-21 Thread Ian Cook

+1. Something like this would be quite valuable, especially if it
highlighted successful real-world applications of Arrow and helped to
disseminate to a broader audience the news of Arrow's increasing
adoption as a standard.


We should name it "This IntervalUnit in Arrow"


On Fri, Jul 21, 2023 at 3:18 PM Bryce Mecum  wrote:
>
> I'm not aware of one but I'd love to see one get started and would be
> happy to contribute.
>
> Related, I'm aware that Nic Crane and Marlene Mhangami have put
> together resources in the "awesome-x" style for R [1] and Python [2],
> respectively.
>
> [1] https://github.com/thisisnic/awesome-arrow-r
> [2] https://github.com/marlenezw/awesome-arrow-python
>
>
> On Fri, Jul 21, 2023 at 7:27 AM Andrew Lamb  wrote:
> >
> > Hi,
> >
> > Does anyone know a location that collects / syndicates Apache Arrow related 
> > content?
> >
> > Some examples of such a thing are [1] for python and [2]  for Rust [2].
> >
> > Andrew
> >
> > [1]: https://planetpython.org/
> > [2] https://this-week-in-rust.org/

Arrow community meeting August 2 at 16:00 UTC

2023-08-02 Thread Ian Cook

Our next biweekly Arrow community meeting is today at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Meeting notes will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Arrow community meeting August 16 at 16:00 UTC

2023-08-16 Thread Ian Cook

Our next biweekly Arrow community meeting is today at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Meeting notes will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: [Vote][Format] C Data Interface Format string for REE

2023-08-16 Thread Ian Cook

+1 (non-binding)

On Wed, Aug 16, 2023 at 10:16 AM Matt Topol
 wrote:
>
> Hey All,
>
> As proposed by Felipe [1] I'm starting a vote on the proposed update to the
> Format Spec of adding "+r" as the format string for passing Run-End Encoded
> arrays through the Arrow C Data Interface.
>
> A PR containing an update to the C++ Arrow implementation to add support
> for this format string along with documentation updates can be found here
> [2].
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 - I'm in favor of this new C Data Format string
> [ ] +0
> [ ] -1 - I'm against adding this new format string because
>
> Thanks everyone!
>
> --Matt
>
> [1]: https://lists.apache.org/thread/smco2mcmw2ob2msoyo84wd4oz8z5f781
> [2]: https://github.com/apache/arrow/pull/37174

Re: Sort a Table In C++?

2023-08-17 Thread Ian Cook

Li,

Here's a standalone C++ example that constructs a Table and executes
an Acero ExecPlan to sort it:
https://gist.github.com/ianmcook/2aa9aa82e61c3ea4405450b93cf80fbc

Ian

On Thu, Aug 17, 2023 at 4:50 PM Li Jin  wrote:
>
> Hi,
>
> I am writing some C++ test and found myself in need for an c++ function to
> sort an arrow Table. Before I go around implementing one myself, I wonder
> if there is already a function that does that? (I searched the doc but
> didn’t find one).
>
> There is function in Acero can do it but I didn’t find a super easy way to
> wrap a Table as An Acero source node either.
>
> Appreciate it if someone can give some pointers.
>
> Thanks,
> Li

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-08-29 Thread Ian Cook

eader [9] as the basis of this
> > > protocol. I would like to see an explanation of why RecordBatchReader
> > > is not sufficient for this. RecordBatchReader seems like another
> > > possible way to represent "unmaterialized dataframes" and there are
> > > some parallels between RecordBatch/RecordBatchReader and
> > > Fragment/Dataset. We should help developers and users understand why
> > > Arrow needs both of these.
> >
> >
> > Just to clarify, I think there are different use cases. For example, Lance
> > provides its own readers, but PyIceberg does not have any intent to provide
> > its own Parquet readers. Iceberg will generate the list of files that need
> > to be read, and do the filtering/projection/deletes/etc. This would make
> > the Dataset a better choice than the RecordBatchReader.
> >
> > That wouldn't remove the feature from DuckDB, would it? It would just mean
> > > that we recognize that PyArrow expressions don't have well-defined
> > > semantics that we are committing to at this time. As long as we have
> > > `**kwargs` everywhere, we can in the future introduce a
> > > `substrait_filter_expression` or similar argument, while allowing current
> > > implementors to handle `filter` if possible. (As a compromise, we could
> > > reserve `filter` and existing arguments and note that PyArrow Expression
> > > semantics are subject to change without notice?)
> >
> >
> > I think we can even re-use the existing filter argument. The signature
> > would evolve from pc.Expression to Union[pc.Expression,
> > pas.BoundExpressions]. In the case we get an expression, we'll convert it
> > to substrait.
> >
> > Concluding, I think we can do things in parallel, and I don't think they
> > are conflicting. I'm happy to contribute to the PyArrow side to make this
> > happen.
> >
> > Kind regards,
> > Fokko
> >
> > Op wo 28 jun 2023 om 22:47 schreef Will Jones :
> >
> > > >
> > > > That wouldn't remove the feature from DuckDB, would it? It would just
> > > mean
> > > > that we recognize that PyArrow expressions don't have well-defined
> > > > semantics that we are committing to at this time.
> > > >
> > >
> > > That's a fair point, David. I would be fine excluding it from the
> > protocol
> > > initially, and keep the existing integrations in DuckDB, Polars, and
> > > Datafusion "secret" or "not officially supported" for the time being. At
> > > the very least, documenting the pattern to get a Arrow C stream will be a
> > > step forward.
> > >
> > > Best,
> > >
> > > Will Jones
> > >
> > > On Wed, Jun 28, 2023 at 12:35 PM Jonathan Keane 
> > wrote:
> > >
> > > > > I would understand this objection more if DuckDB hasn't been relying
> > on
> > > > > being able to pass PyArrow expressions for 18 months now [1]. Unless,
> > > do
> > > > we
> > > > > just think this isn't widely used enough that we don't care?
> > > >
> > > > This isn't a pro or a con of specifically adopting the PyArrow
> > expression
> > > > semantics as is / with a warning about changing / not at all, but
> > having
> > > > some kind of standardization in this interface would be very nice. This
> > > > even came up while collaborating with the DuckDB folks that using some
> > of
> > > > the expression bits here (and in the R equivalents) was a little bit
> > odd
> > > > and having something like a proper API for that would have made that
> > > > more natural (and likely that would have been used had it existed 18
> > > months
> > > > ago :))
> > > >
> > > > -Jon
> > > >
> > > >
> > > > On Wed, Jun 28, 2023 at 1:17 PM David Li  wrote:
> > > >
> > > > > That wouldn't remove the feature from DuckDB, would it? It would just
> > > > mean
> > > > > that we recognize that PyArrow expressions don't have well-defined
> > > > > semantics that we are committing to at this time. As long as we have
> > > > > `**kwargs` everywhere, we can in the future introduce a
> > > > > `substrait_filter_expression` or similar argument, while allowing
> > > current
> > > > > implementors to handle `filter` if possible. (As a compromis

Arrow community meeting August 30 at 16:00 UTC

2023-08-30 Thread Ian Cook

Our next biweekly Arrow community meeting is today at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Meeting notes will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Arrow community meeting September 13 at 16:00 UTC

2023-09-12 Thread Ian Cook

Our next biweekly Arrow community meeting is tomorrow at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Meeting notes will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Arrow community meeting September 27 at 16:00 UTC

2023-09-27 Thread Ian Cook

Our next biweekly Arrow community meeting is today at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Meeting notes will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: [VOTE][Format] Add ListView and LargeListView Arrays to Arrow Format

2023-09-29 Thread Ian Cook

+1 (non-binding)

Thanks very much Felipe for your persistence and your commitment to
addressing the numerous questions and comments that have been raised
since the beginning of the discussion on this in April.

On Fri, Sep 29, 2023 at 12:34 PM Benjamin Kietzman  wrote:
>
> +1
>
> On Fri, Sep 29, 2023 at 10:51 AM Felipe Oliveira Carvalho <
> felipe...@gmail.com> wrote:
>
> > Yes, ListView is an implementation of Velox's ArrayVector [1] ("vector of
> > arrays"). In Arrow we would naturally refer to them as "array of lists",
> > but `ListArray` is taken by the existing offset-only list formats.
> > Following the pattern adopted by other types in Arrow that use offsets and
> > sizes, we adopt the suffix -View to differentiate list-views from lists.
> >
> > Velox doesn't offer the 64-bit variation, but since Arrow has both List and
> > LargeList, it was natural to pair them with ListView and LargeListView.
> >
> > [2] is a link to the point of a talk by Mark Raasveldt where he describes
> > the DuckDB list representation. Early in the talk, one of the slides [3]
> > mentions how these formats were "co-designed together with Velox team".
> >
> > --
> > Felipe
> >
> > [1]
> > https://facebookincubator.github.io/velox/develop/vectors.html#arrayvector
> > [2] https://youtu.be/bZOvAKGkzpQ?si=wgSwew3Ck8utteOI&t=1569
> > [3] https://15721.courses.cs.cmu.edu/spring2023/slides/22-duckdb.pdf
> >
> > On Fri, Sep 29, 2023 at 9:32 AM Raphael Taylor-Davies
> >  wrote:
> >
> > > Hi Felipe,
> > >
> > > Can I confirm that DuckDB and Velox use the same encoding for these
> > > types, and so we aren't going to run into similar issues as [1]?
> > >
> > > Kind Regards,
> > >
> > > Raphael Taylor-Davies
> > >
> > > [1]: https://lists.apache.org/thread/l8t1vj5x1wdf75mdw3wfjvnxrfy5xomy
> > >
> > > On 29/09/2023 13:09, Felipe Oliveira Carvalho wrote:
> > > > Hello,
> > > >
> > > > I'd like to propose adding ListView and LargeListView arrays to the
> > Arrow
> > > > format.
> > > > Previous discussion in [1][2], columnar format description and
> > > flatbuffers
> > > > changes in [3].
> > > >
> > > > There are implementations available in both C++ [4] and Go [5]. I'm
> > > working
> > > > on the integration tests which I will push to one of the PR branches
> > > before
> > > > they are merged. I've made a graph illustrating how this addition
> > > affects,
> > > > in a backwards compatible way, the type predicates and inheritance
> > chain
> > > on
> > > > the C++ implementation. [6]
> > > >
> > > > The vote will be open for at least 72 hours not counting the weekend.
> > > >
> > > > [ ] +1 add the proposed ListView and LargeListView types to the Apache
> > > > Arrow format
> > > > [ ] -1 do not add the proposed ListView and LargeListView types to the
> > > > Apache Arrow format
> > > > because...
> > > >
> > > > Sincerely,
> > > > Felipe
> > > >
> > > > [1] https://lists.apache.org/thread/r28rw5n39jwtvn08oljl09d4q2c1ysvb
> > > > [2] https://lists.apache.org/thread/dcwdzhz15fftoyj6xp89ool9vdk3rh19
> > > > [3] https://github.com/apache/arrow/pull/37877
> > > > [4] https://github.com/apache/arrow/pull/35345
> > > > [5] https://github.com/apache/arrow/pull/37468
> > > > [6] https://gist.github.com/felipecrv/3c02f3784221d946dec1b031c6d400db
> > > >
> > >
> >

Re: [DISCUSS][C++] Raw pointer string views

2023-09-29 Thread Ian Cook

I strongly agree with Ben's assertion that "the risk of a parallel
ecosystem… is more likely to be provoked by excluding a user's vital
use case [than by implementing support for an unofficial layout
variant]" in the C++ library. But there seems to be a consensus here
that there is a real risk of sowing confusion. Thank you Ben for your
readiness to consider the suggested approaches for reducing this risk.

I would also assert that another way to reduce this risk is to add
some prose to the relevant sections of the columnar format
specification doc to clearly explain that a raw pointers variant of
the layout, while not part of the official spec, may be implemented in
some Arrow libraries.

Ian

On Thu, Sep 28, 2023 at 2:14 PM Felipe Oliveira Carvalho
 wrote:
>
> My take here is that Ben did an excellent job in hiding the fact that C++
> has two variations of the format without leaking the pointer version via
> the interfaces through which Arrow arrays are communicated to other
> implementations.
>
> As things stand right now, there is no zero-copy transfer of pointer-based
> string views. Ben can give the final authoritative answer on this. The idea
> of zero-copy transfers was discussed but decided against to avoid adding a
> format to the spec that can't be implemented by languages that can't cast
> arbitrary memory bytes to objects (the case for many languages that are not
> C or C++).
>
> Having established that the spec is not "polluted" by a format that only
> systems-languages can implement, we can look at the constraint of keeping
> implementations completely faithful to the spec:
>
> Pros:
>  - The reference implementations serve as an alternative to the spec text
> in being a one-to-one translation of the spec
>
> Cons:
> - Performance loss (it's hard to predict how many optimizations can be lost
> by forcing an extra memory indirection when looping)
> - Insensibility to the ergonomics afforded by the language
>
> Variations are bound to happen any time a language doesn't afford good
> usability without conversions every time the data is used. In JavaScript,
> for instance, the use of UTF-16 is much more widespread than the use of
> UTF-8. It would make sense for a JavaScript implementations to keep string
> arrays in UTF-16 at rest.
>
> Sometimes software specs are accompanied by two types of implementations:
> the reference implementation that tries to be simple and didactic; and
> implementations used in practice because they are allowed to deviate
> internally, doing things in a more complicated way than the spec requires,
> to achieve some practical advantage. Are all the implementations in the
> apache/arrow of the first kind?
>
> --
> Felipe
>
> On Thu, Sep 28, 2023 at 1:10 PM Andrew Lamb  wrote:
>
> > > What this PR is creating is an "unofficial" Arrow format, with data
> > types exposed in Arrow C++ that are not part of the Arrow standard, but
> > are exposed as if they were.
> >
> > I agree with Antoine here. It seems a pretty clear cut story of the C++
> > implementation doesn't follow the spec and thus we should either
> > 1.  Update the standard to allow raw pointers
> > 2.  fix the C++ implementation to not have them / treat them as though they
> > were
> >
> > If the core usecase is "arrow has the same in memory format used by DuckDB
> > and Velox, and those systems can't/won't change their implementations" it
> > seems like the only path forward for that usecase is to adopt their model
> > (raw pointers) directly. Maybe I am missing something
> >
> >
> > Andrew
> >
> >
> >
> >
> >
> >
> > On Thu, Sep 28, 2023 at 11:11 AM Raphael Taylor-Davies
> >  wrote:
> >
> > > FWIW Rust wouldn't have issues using raw pointers, I can't speak for
> > other
> > > languages though. They would be more expensive to validate, but
> > validation
> > > is not going to be cheap regardless.
> > >
> > > I could definitely see a world where view types use pointers and IPC
> > > coerces to/from the large non-view types. IPC has to copy the string data
> > > regardless and re-encoding would avoid encoding masked data.
> > >
> > > The notion of supporting both is less of an exciting prospect... I'm also
> > > not sure if it is too late to make changes at this stage.
> > >
> > > On 28 September 2023 15:26:57 BST, Wes McKinney 
> > > wrote:
> > > >hi all,
> > > >
> > > >I'm just catching up on this thread after having taken a look at the
> > > format
> > > >PRs, the C++ implementation PR, and this e-mail thread. So only my $0.02
> > > >from having spent a great deal less time on this project than others.
> > > >
> > > >The original motivation I had for bringing up the idea of adding the
> > > >StringView concept from DuckDB / Velox / UmbraDB to the Arrow in-memory
> > > >format (though not necessarily the IPC format) was to provide a path for
> > > >zero-copy interoperability in some cases with these systems when dealing
> > > >with strings, and to enhance performance within Arrow-applications
> > > (setting
>

Arrow community meeting October 11 at 16:00 UTC

2023-10-11 Thread Ian Cook

Our next biweekly Arrow community meeting is today at 16:00 UTC / 12:00 EDT.

Zoom meeting URL:
https://zoom.us/j/87649033008?pwd=SitsRHluQStlREM0TjJVYkRibVZsUT09
Meeting ID: 876 4903 3008
Passcode: 958092

Meeting notes will be captured in this Google Doc:
https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/
If you plan to attend this meeting, you are welcome to edit the
document to add the topics that you would like to discuss.

Thanks,
Ian

Re: [ANNOUNCE] New Arrow PMC member: Jonathan Keane

2023-10-14 Thread Ian Cook

Congratulations Jonathan!

On Sat, Oct 14, 2023 at 13:24 Andrew Lamb  wrote:

> The Project Management Committee (PMC) for Apache Arrow has invited
> Jonathan Keane to become a PMC member and we are pleased to announce
> that Jonathan Keane has accepted.
>
> Congratulations and welcome!
>
> Andrew
>

Re: [ANNOUNCE] New Arrow committer: Curt Hagenlocher

2023-10-15 Thread Ian Cook

Congratulations Curt!

On Sun, Oct 15, 2023 at 05:32 Andrew Lamb  wrote:

> On behalf of the Arrow PMC, I'm happy to announce that Curt Hagenlocher
> has accepted an invitation to become a committer on Apache
> Arrow. Welcome, and thank you for your contributions!
>
> Andrew
>

Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-23 Thread Ian Cook

Congratulations Xuwei!

On Mon, Oct 23, 2023 at 12:46 AM Sutou Kouhei  wrote:
>
> On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu
> has accepted an invitation to become a committer on Apache
> Arrow. Welcome, and thank you for your contributions!
>
> --
> kou

1 2 3 >

1 - 100 of 273 matches

Mail list logo