Re: PyArrow: Using is_in compute to filter list of strings in a Table

2020-11-04 Thread Niklas B
Never mind, I realized I can use the pyarrow.compute.invert. Thank you again for the super fast answer > On 4 Nov 2020, at 15:13, Niklas B wrote: > > Thank you! This looks awesome. Any good way to inverse the ChunkedArray? I > know I can cast to Numpy (experimental) and do it ther

Re: PyArrow: Using is_in compute to filter list of strings in a Table

2020-11-04 Thread Niklas B
ssues.apache.org/jira/browse/ARROW-9164, and > we should maybe also try to inject the option keywords in the function > docstring. > > Best, > Joris > > On Wed, 4 Nov 2020 at 14:14, Niklas B wrote: > >> Hi, >> >> I’m trying in Python to (without reading e

PyArrow: Using is_in compute to filter list of strings in a Table

2020-11-04 Thread Niklas B
Hi, I’m trying in Python to (without reading entire parquet file into memory) filter out certain rows (based on uuid-strings). My approach is to read each row group, then try to filter it without casting it to pandas (since it’s expensive for data-frames with lots of strings it in). Looking in

Arrow on PyPy3 patch

2020-10-22 Thread Niklas B
Hi, I’ve been (together with the PyPy team) working on getting arrow to build on PyPy3. I’m not looking for full feature capability, but specifically getting it to work with pandas read_parquet/to_parquet which it now does. There were a few roadblocks solved by the awesome Matti Picus on the Py

Re: Using DNF-like filters on a (py)arrow Table already in memory (or probably: convert pyarrow table to UnionDataset)

2020-10-01 Thread Niklas B
1 > 1 2 1 > 2 4 1 > > So still more manual work than just specifying a DNF filter, but normally > all necessary building blocks are available (the goal is certainly to use > those building block in a more general query engine that works for both > in-memory tables as

Using DNF-like filters on a (py)arrow Table already in memory (or probably: convert pyarrow table to UnionDataset)

2020-10-01 Thread Niklas B
Hi, I have an in-memory dataset from Plasma that I need to filter before running `to_pandas()`. It’s a very text heavy dataset with a lot of rows and columns (only about 30% of which is applicable for any operation). Now I know that you use DNF filters to filter a parquet file before reading to

Re: [DISCUSS] Plasma appears to have been forked, consider deprecating pyarrow.serialization

2020-09-27 Thread Niklas B
We to rely heavily on Plasma (we use Ray as well, but also Plasma independent of Ray). I’ve started a thread on ray dev list to see if Rays plasma can be used standalone outside of ray as well. That would allow us who use Plasma to move to a standalone “ray plasma” when/if it’s removed from Arro

Re: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

2020-09-21 Thread Niklas B
thub.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156 > [3] > https://github.com/apache/arrow/blob/a4eb08d54ee0d4c0d0202fa0a2dfa8af7aad7a05/python/pyarrow/memory.pxi#L156 > > On Tue, Sep 15, 2020 at 8:46 AM Niklas B wrote: > >>

PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

2020-09-15 Thread Niklas B
First of all: Thank you so much for all hard work on Arrow, it’s an awesome project. Hi, I'm trying to write a large parquet file onto disk (larger then memory) using PyArrows ParquetWriter and write_table, but even though the file is written incrementally to disk it still appears to keeps th