Hi Niklas, In the datasets project, there is indeed the notion of an in-memory dataset (from a RecordBatch or Table), however, constructing such a dataset is currently not directly exposed in Python (except fro writing it).
But, for RecordBatch/Table objects, you can also directly filter those with a boolean mask. A small example with a Table (but will work the same with RecordBatch): >>> table = pa.table({'a': range(5), 'b': [1, 2, 1, 2, 1]}) >>> table.filter(np.array([True, False, False, False, True])) pyarrow.Table a: int64 b: int64 >>> table.filter(np.array([True, False, False, False, True])).to_pandas() a b 0 0 1 1 4 1 Creating the boolean mask based on the table can be done with the pyarrow.compute module: >>> import pyarrow.compute as pc >>> mask = pc.equal(table['b'], pa.scalar(1)) >>> table.filter(mask).to_pandas() a b 0 0 1 1 2 1 2 4 1 So still more manual work than just specifying a DNF filter, but normally all necessary building blocks are available (the goal is certainly to use those building block in a more general query engine that works for both in-memory tables as file datasets, eventually, but that doesn't exist yet). Best, Joris On Thu, 1 Oct 2020 at 15:01, Niklas B <niklas.biv...@enplore.com> wrote: > Hi, > > I have an in-memory dataset from Plasma that I need to filter before > running `to_pandas()`. It’s a very text heavy dataset with a lot of rows > and columns (only about 30% of which is applicable for any operation). Now > I know that you use DNF filters to filter a parquet file before reading to > memory. I’m now trying to do the same for my pa.Table that is already in > memory. https://issues.apache.org/jira/browse/ARROW-7945 < > https://issues.apache.org/jira/browse/ARROW-7945> indicates that should > be possible to unify datasets after constructions, but my arrow skills > aren’t quite there yet. > > > […] > > [data] = plasma_client.get_buffers([object_id], timeout_ms=100) > > buffer = pa.BufferReader(data) > > reader = pa.RecordBatchStreamReader(buffer) > > record_batch = pa.Table.from_batches(reader) > > > I’ve been reading up on Dataset, UnionDataset and how the ParquetDataset > does it ( > http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset > <http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset>). > My thinking is that if I can cast my table to a UnionDataset I can use the > same _filter() code as ParquetDataset does. But I’m not having any luck > with my (albeit very naive) approach: > > > >>> pyarrow.dataset.UnionDataset(table.schema, [table]) > > But that just gives me: > > > Traceback (most recent call last): > > File "<stdin>", line 1, in <module> > > File "pyarrow/_dataset.pyx", line 429, in > pyarrow._dataset.UnionDataset.__init__ > > TypeError: Cannot convert pyarrow.lib.Table to pyarrow._dataset.Dataset > > > I’m guessing somewhere deep in > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py > < > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py> > shows an example on how to make a Dataset out of a table, but I’m not > finding it. > > Any help would be greatly appreciated :) > > Regards, > Niklas