I think it's a good idea to have SIMD support inbuilt in Arrow libraries. Simple analytic operations like SUM, MIN, MAX, COUNT, AVG, FILTER (especially for fixed width values and dictionary encoded columns) can be made substantially faster by providing APIs that internally use SIMD (probably through Intel compiler intrinsics) to accomplish these SQL operations on the columnar data structures.
We had taken inspirations from HANA paper when implementing SIMD based scan operations for predicate evaluation on in-memory columnar data at Oracle. http://www.vldb.org/pvldb/2/vldb09-327.pdf On Wed, Oct 4, 2017 at 8:08 PM, Wes McKinney <wesmck...@gmail.com> wrote: > hi Paddy, > > Thanks for bringing this up. Some responses inline > > On Wed, Oct 4, 2017 at 10:31 PM, paddy horan <paddyho...@hotmail.com> > wrote: > > Hi All, > > > > I’m hoping someone on this list can comment on the scope of Arrow. In > the interview with Wes for O’Reilly he spoke about an “operator kernel > library”. On the homepage it states that Arrow “enables execution engines > to take advantage of the latest SIMD…”. Is this “operator kernel library” > a part of Arrow or will it be a separate “execution engine” library that is > built on top of Arrow. It seems to me that is will be a part of Arrow, is > my understanding correct? > > > > Yes, this is correct. The idea of these operator kernels is that users > of the Arrow libraries can use them to craft user-facing libraries > with whatever semantics they wish. It wouldn't make sense for there to > be divergent implementations of essential primitive array functions > like: > > * Binary arithmetic > * Array manipulations (take, head, tail, repeat) > * Sorting > * Unary mathematics operators (sqrt, exp, log, etc) > * Hash-table based functions (unique, match, isin, dictionary-encode) > * Missing-data aware reductions > > Through zero-copy adapter layers we can enable external analytic > kernels (like NumPy ufuncs, for example) to be "plugged in" and layer > on Arrow's missing data after the fact, but in many cases it may be > better to have a native implementation against the Arrow memory > layout. > > The intent would be that these kernels are building blocks for > building general Arrow-based execution engines. It is probably going > to make sense to have a canonical implementation of such an "engine" > inside the Arrow project itself. > > > If this is the case, what is the scope of such a library? Taking pandas > 2.0 as an example, do you plan to have pandas be a wrapper around Arrow? > Arrow being the “libpandas” referred to in the design document for pandas > 2.0 maybe? > > > > This is probably too nuanced a discussion for a single mailing list > thread; I hope to better articulate the layering of technologies that > will form "pandas2". Arrow shall provide primitive columnar array > analytics and most likely (per comment above) also a single-node graph > dataflow-style execution engine (in the style of TensorFlow and other > frameworks you may be familiar with). > > Defining the semantics of how data frames work in Python is quite a > bit of work. For example, how does mutation work? When it data loaded > and materialized into memory? How does spilling to disk or out-of-core > / streaming algorithms work? There will be many pandas-specific > opinionated design decisions that we will need to make to shape a > Pythonic experience that existing "pandas0" (or pandas1?) users can > pick up easily. We ought not foist these opinionated decisions on the > Arrow project. > > On this last point, there was a period of time from 2010 to 2012 where > I was under some amount of criticism from the scientific Python > community for not building parts of pandas as patches into NumPy. I > felt that the changes that would be needed in NumPy to accommodate the > way I wanted pandas to work would be deemed inappropriate by the NumPy > user community. We may face similar challenges here and we'll need to > draw the line so Arrow can stay "pure" and general purpose. > > So the TL;DR on this: > > - Arrow: memory representation / management, metadata, IO / data > ingest / memory-mapping, efficient + mathematically precise analytics > - pandas/pandas2: user-facing Python semantics, pandas-specific > extensions -- under the hood we can manipulate Arrow memory and hide > internal complexities as needed > > > “we have not decided” is a valid response to any/all of the questions > above. Apologies if these are basic questions. I’m excited about the > project and where it could go. I’m an Actuary looking to build an > Actuarial modeling library on top of Arrow and I would love to contribute. > However, I feel I have a lot to learn first. Is there a better forum for > basic questions from would be new contributors? (I won’t be offended if > you tell me that there is no forum for basic questions, I understand that > momentum is important and you are all busy moving the project toward 1.0) > > > > This is the right place for now, and the project's governance is > conducted on this mailing list and other ASF forums like JIRA and the > project GitHub. I am aware that currently the Python-related Arrow > development work going on right now is lower-level than many Python > programmers are accustomed to (and largely in C++ and Cython) so don't > hesitate to ask questions. > > I appreciate the question. We have a lot of difficult work ahead of us > but it will continue to be very exciting as the pieces fall into > place. The best part of all this is that we can build a bigger and > stronger community of data system developers by expanding beyond the > Python world -- for example, we already have substantial contributions > from the Ruby community involved in this, and I expect we will see the > diversity of users and programming languages (especially those that > have reasonable C/C++ FFI) increase over time. This was the idea I > hoped to get across in my JupyterCon keynote > (https://www.youtube.com/watch?v=wdmf1msbtVs). > > best > Wes > > > Thanks for your time, > > Paddy > > > > Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for > Windows 10 > > >