Hi,

I have an in-memory dataset from Plasma that I need to filter before running 
`to_pandas()`. It’s a very text heavy dataset with a lot of rows and columns 
(only about 30% of which is applicable for any operation). Now I know that you 
use DNF filters to filter a parquet file before reading to memory. I’m now 
trying to do the same for my pa.Table that is already in memory. 
https://issues.apache.org/jira/browse/ARROW-7945 
<https://issues.apache.org/jira/browse/ARROW-7945> indicates that should be 
possible to unify datasets after constructions, but my arrow skills aren’t 
quite there yet. 

> […]
> [data] = plasma_client.get_buffers([object_id], timeout_ms=100)
> buffer = pa.BufferReader(data)
> reader = pa.RecordBatchStreamReader(buffer)
> record_batch = pa.Table.from_batches(reader)


I’ve been reading up on Dataset, UnionDataset and how the ParquetDataset does 
it (http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset 
<http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset>). 
My thinking is that if I can cast my table to a UnionDataset I can use the same 
_filter() code as ParquetDataset does. But I’m not having any luck with my 
(albeit very naive) approach:

> >>> pyarrow.dataset.UnionDataset(table.schema, [table])

But that just gives me:

> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "pyarrow/_dataset.pyx", line 429, in 
> pyarrow._dataset.UnionDataset.__init__
> TypeError: Cannot convert pyarrow.lib.Table to pyarrow._dataset.Dataset


I’m guessing somewhere deep in 
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py
 
<https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py>
 shows an example on how to make a Dataset out of a table, but I’m not finding 
it.

Any help would be greatly appreciated :)

Regards,
Niklas

Reply via email to