Hi, I’m trying in Python to (without reading entire parquet file into memory) filter out certain rows (based on uuid-strings). My approach is to read each row group, then try to filter it without casting it to pandas (since it’s expensive for data-frames with lots of strings it in). Looking in the compute function list my hope was to be able to use `is_in` operator. How would you actually use it? My naive approach would be:
import pyarrow.compute as pc mask = pc.is_in(table['id’], pa.array([“uuid1”, “uuid2"])) # somehow invert the mask since it shows the ones that I don’t want Above gives: >>> pc.is_in(table["id"], pa.array(["uuid1", "uuid2"])) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: wrapper() takes 1 positional argument but 2 were given Trying to pass anything else into is_in, like an pya.array results into segfaults. Is above at all possible with is_in? Normally I would use pyarrow.parquet.ParquetDataset().filters() but I need to apply it per row group, not for the entire file so I can then write the modified row group to disk again in another file. Regards, Niklas