Never mind, I realized I can use the pyarrow.compute.invert. Thank you again for the super fast answer
> On 4 Nov 2020, at 15:13, Niklas B <niklas.biv...@enplore.com> wrote: > > Thank you! This looks awesome. Any good way to inverse the ChunkedArray? I > know I can cast to Numpy (experimental) and do it there, but would love a > native arrow version :) > >> On 4 Nov 2020, at 14:45, Joris Van den Bossche >> <jorisvandenboss...@gmail.com> wrote: >> >> Hi Niklas, >> >> The "is_in" docstring is not directly clear about it, but you need to pass >> the second argument as a keyword argument using "value_set" keyword name. >> Small example: >> >> In [19]: pc.is_in(pa.array(["a", "b", "c", "d"]), value_set=pa.array(["a", >> "c"])) >> Out[19]: >> <pyarrow.lib.BooleanArray object at 0x7f508af95ac8> >> [ >> true, >> false, >> true, >> false >> ] >> >> You can find this keyword in the keywords of pc.SetLookupOptions. >> We know the docstrings are not yet in a good state. This was recently >> already improved in https://issues.apache.org/jira/browse/ARROW-9164, and >> we should maybe also try to inject the option keywords in the function >> docstring. >> >> Best, >> Joris >> >> On Wed, 4 Nov 2020 at 14:14, Niklas B <niklas.biv...@enplore.com> wrote: >> >>> Hi, >>> >>> I’m trying in Python to (without reading entire parquet file into memory) >>> filter out certain rows (based on uuid-strings). My approach is to read >>> each row group, then try to filter it without casting it to pandas (since >>> it’s expensive for data-frames with lots of strings it in). Looking in the >>> compute function list my hope was to be able to use `is_in` operator. How >>> would you actually use it? My naive approach would be: >>> >>> import pyarrow.compute as pc >>> mask = pc.is_in(table['id’], pa.array([“uuid1”, “uuid2"])) >>> # somehow invert the mask since it shows the ones that I don’t want >>> >>> Above gives: >>> >>>>>> pc.is_in(table["id"], pa.array(["uuid1", "uuid2"])) >>> Traceback (most recent call last): >>> File "<stdin>", line 1, in <module> >>> TypeError: wrapper() takes 1 positional argument but 2 were given >>> >>> Trying to pass anything else into is_in, like an pya.array results into >>> segfaults. >>> >>> Is above at all possible with is_in? >>> >>> Normally I would use pyarrow.parquet.ParquetDataset().filters() but I need >>> to apply it per row group, not for the entire file so I can then write the >>> modified row group to disk again in another file. >>> >>> Regards, >>> Niklas >