Hi Niklas, The "is_in" docstring is not directly clear about it, but you need to pass the second argument as a keyword argument using "value_set" keyword name. Small example:
In [19]: pc.is_in(pa.array(["a", "b", "c", "d"]), value_set=pa.array(["a", "c"])) Out[19]: <pyarrow.lib.BooleanArray object at 0x7f508af95ac8> [ true, false, true, false ] You can find this keyword in the keywords of pc.SetLookupOptions. We know the docstrings are not yet in a good state. This was recently already improved in https://issues.apache.org/jira/browse/ARROW-9164, and we should maybe also try to inject the option keywords in the function docstring. Best, Joris On Wed, 4 Nov 2020 at 14:14, Niklas B <niklas.biv...@enplore.com> wrote: > Hi, > > I’m trying in Python to (without reading entire parquet file into memory) > filter out certain rows (based on uuid-strings). My approach is to read > each row group, then try to filter it without casting it to pandas (since > it’s expensive for data-frames with lots of strings it in). Looking in the > compute function list my hope was to be able to use `is_in` operator. How > would you actually use it? My naive approach would be: > > import pyarrow.compute as pc > mask = pc.is_in(table['id’], pa.array([“uuid1”, “uuid2"])) > # somehow invert the mask since it shows the ones that I don’t want > > Above gives: > > >>> pc.is_in(table["id"], pa.array(["uuid1", "uuid2"])) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > TypeError: wrapper() takes 1 positional argument but 2 were given > > Trying to pass anything else into is_in, like an pya.array results into > segfaults. > > Is above at all possible with is_in? > > Normally I would use pyarrow.parquet.ParquetDataset().filters() but I need > to apply it per row group, not for the entire file so I can then write the > modified row group to disk again in another file. > > Regards, > Niklas