Hi, I have an in-memory dataset from Plasma that I need to filter before running `to_pandas()`. It’s a very text heavy dataset with a lot of rows and columns (only about 30% of which is applicable for any operation). Now I know that you use DNF filters to filter a parquet file before reading to memory. I’m now trying to do the same for my pa.Table that is already in memory. https://issues.apache.org/jira/browse/ARROW-7945 <https://issues.apache.org/jira/browse/ARROW-7945> indicates that should be possible to unify datasets after constructions, but my arrow skills aren’t quite there yet.
> […] > [data] = plasma_client.get_buffers([object_id], timeout_ms=100) > buffer = pa.BufferReader(data) > reader = pa.RecordBatchStreamReader(buffer) > record_batch = pa.Table.from_batches(reader) I’ve been reading up on Dataset, UnionDataset and how the ParquetDataset does it (http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset <http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset>). My thinking is that if I can cast my table to a UnionDataset I can use the same _filter() code as ParquetDataset does. But I’m not having any luck with my (albeit very naive) approach: > >>> pyarrow.dataset.UnionDataset(table.schema, [table]) But that just gives me: > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File "pyarrow/_dataset.pyx", line 429, in > pyarrow._dataset.UnionDataset.__init__ > TypeError: Cannot convert pyarrow.lib.Table to pyarrow._dataset.Dataset I’m guessing somewhere deep in https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py <https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py> shows an example on how to make a Dataset out of a table, but I’m not finding it. Any help would be greatly appreciated :) Regards, Niklas