Hi,

I’m trying in Python to (without reading entire parquet file into memory) 
filter out certain rows (based on uuid-strings). My approach is to read each 
row group, then try to filter it without casting it to pandas (since it’s 
expensive for data-frames with lots of strings it in). Looking in the compute 
function list my hope was to be able to use `is_in` operator. How would you 
actually use it? My naive approach would be:

import pyarrow.compute as pc
mask = pc.is_in(table['id’], pa.array([“uuid1”, “uuid2"]))
# somehow invert the mask since it shows the ones that I don’t want

Above gives:

>>> pc.is_in(table["id"], pa.array(["uuid1", "uuid2"]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: wrapper() takes 1 positional argument but 2 were given

Trying to pass anything else into is_in, like an pya.array results into 
segfaults. 

Is above at all possible with is_in? 

Normally I would use pyarrow.parquet.ParquetDataset().filters() but I need to 
apply it per row group, not for the entire file so I can then write the 
modified row group to disk again in another file.

Regards,
Niklas

Reply via email to