Sharing our experience adopting (py) Arrow in Vaex

Maarten Breddels Thu, 02 Jul 2020 01:33:09 -0700

Hi,

in the process of adding Arrow support in Vaex (natively, not converting to
Numpy as we did before), one of our biggest pain points is (surprisingly)
the name mismatch between NumPy's .tolist() and Arrow's .to_pylist().
Especially in code that deals with both types of arrays, this is a bit of
an annoyance. We actually use tolist() a lot in our unittests as well. I
wonder if this was done with a purposely, or if this is something that
could still be changed/added.


The difference in filter/take vs fancy indexing with [] is ok, it doesn't
happen that often, but I was wondering if this will be added later, or if
this stays as it is.

Another difficult thing is testing for string arrays, since there are two
string types (utf8 and large_utf8) testing if something is of string type
is a bit annoying. I don't plan to have a type system in Vaex itself, so we
leak this to users.
A similar issue is also array testing, testing if something is an arrow
array (chunked or plain) is again a test against two types (e.g.
isinstance(ar, (pa.Array, pa.ChunkedArray)).
I could see some helper functions pa.is_array and pa.is_string (this is
already taken, and I guess only tests for 32bit offset strings arrays)

Overall, we're quite positive, and as you see, the pain points are not
fundamental issue, but annoyances that might be easy to fix, and make
adoption smoother/faster.

cheers,

Maarten Breddels
Software engineer / consultant / data scientist
Python / C++ / Javascript / Jupyter
www.maartenbreddels.com / vaex.io
maartenbredd...@gmail.com +31 6 2464 0838 <+31+6+24640838>
[image: Twitter] <https://twitter.com/maartenbreddels>[image: Github]
<https://github.com/maartenbreddels>[image: LinkedIn]
<https://linkedin.com/in/maartenbreddels>[image: Skype]

Sharing our experience adopting (py) Arrow in Vaex

Reply via email to