Re: Question regarding scope of Arrow

Wes McKinney Wed, 04 Oct 2017 20:10:17 -0700

hi Paddy,

Thanks for bringing this up. Some responses inline

On Wed, Oct 4, 2017 at 10:31 PM, paddy horan <[email protected]> wrote:
> Hi All,
>
> I’m hoping someone on this list can comment on the scope of Arrow.  In the 
> interview with Wes for O’Reilly he spoke about an “operator kernel library”.  
> On the homepage it states that Arrow “enables execution engines to take 
> advantage of the latest SIMD…”.  Is this “operator kernel library” a part of 
> Arrow or will it be a separate “execution engine” library that is built on 
> top of Arrow.  It seems to me that is will be a part of Arrow, is my 
> understanding correct?
>

Yes, this is correct. The idea of these operator kernels is that users
of the Arrow libraries can use them to craft user-facing libraries
with whatever semantics they wish. It wouldn't make sense for there to
be divergent implementations of essential primitive array functions
like:

* Binary arithmetic
* Array manipulations (take, head, tail, repeat)
* Sorting
* Unary mathematics operators (sqrt, exp, log, etc)
* Hash-table based functions (unique, match, isin, dictionary-encode)
* Missing-data aware reductions

Through zero-copy adapter layers we can enable external analytic
kernels (like NumPy ufuncs, for example) to be "plugged in" and layer
on Arrow's missing data after the fact, but in many cases it may be
better to have a native implementation against the Arrow memory
layout.

The intent would be that these kernels are building blocks for
building general Arrow-based execution engines. It is probably going
to make sense to have a canonical implementation of such an "engine"
inside the Arrow project itself.

> If this is the case, what is the scope of such a library?  Taking pandas 2.0 
> as an example, do you plan to have pandas be a wrapper around Arrow?  Arrow 
> being the “libpandas” referred to in the design document for pandas 2.0 maybe?
>

This is probably too nuanced a discussion for a single mailing list
thread; I hope to better articulate the layering of technologies that
will form "pandas2". Arrow shall provide primitive columnar array
analytics and most likely (per comment above) also a single-node graph
dataflow-style execution engine (in the style of TensorFlow and other
frameworks you may be familiar with).

Defining the semantics of how data frames work in Python is quite a
bit of work. For example, how does mutation work? When it data loaded
and materialized into memory? How does spilling to disk or out-of-core
/ streaming algorithms work? There will be many pandas-specific
opinionated design decisions that we will need to make to shape a
Pythonic experience that existing "pandas0" (or pandas1?) users can
pick up easily. We ought not foist these opinionated decisions on the
Arrow project.

On this last point, there was a period of time from 2010 to 2012 where
I was under some amount of criticism from the scientific Python
community for not building parts of pandas as patches into NumPy. I
felt that the changes that would be needed in NumPy to accommodate the
way I wanted pandas to work would be deemed inappropriate by the NumPy
user community. We may face similar challenges here and we'll need to
draw the line so Arrow can stay "pure" and general purpose.

So the TL;DR on this:

- Arrow: memory representation / management, metadata, IO / data
ingest / memory-mapping, efficient + mathematically precise analytics
- pandas/pandas2: user-facing Python semantics, pandas-specific
extensions -- under the hood we can manipulate Arrow memory and hide
internal complexities as needed

> “we have not decided” is a valid response to any/all of the questions above.  
> Apologies if these are basic questions.  I’m excited about the project and 
> where it could go.  I’m an Actuary looking to build an Actuarial modeling 
> library on top of Arrow and I would love to contribute.  However, I feel I 
> have a lot to learn first.  Is there a better forum for basic questions from 
> would be new contributors?  (I won’t be offended if you tell me that there is 
> no forum for basic questions, I understand that momentum is important and you 
> are all busy moving the project toward 1.0)
>

This is the right place for now, and the project's governance is
conducted on this mailing list and other ASF forums like JIRA and the
project GitHub. I am aware that currently the Python-related Arrow
development work going on right now is lower-level than many Python
programmers are accustomed to (and largely in C++ and Cython) so don't
hesitate to ask questions.

I appreciate the question. We have a lot of difficult work ahead of us
but it will continue to be very exciting as the pieces fall into
place. The best part of all this is that we can build a bigger and
stronger community of data system developers by expanding beyond the
Python world -- for example, we already have substantial contributions
from the Ruby community involved in this, and I expect we will see the
diversity of users and programming languages (especially those that
have reasonable C/C++ FFI) increase over time. This was the idea I
hoped to get across in my JupyterCon keynote
(https://www.youtube.com/watch?v=wdmf1msbtVs).

best
Wes

> Thanks for your time,
> Paddy
>
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>

Re: Question regarding scope of Arrow

Reply via email to