Thank you! This looks awesome. Any good way to inverse the ChunkedArray? I know I can cast to Numpy (experimental) and do it there, but would love a native arrow version :)
> On 4 Nov 2020, at 14:45, Joris Van den Bossche <jorisvandenboss...@gmail.com> > wrote: > > Hi Niklas, > > The "is_in" docstring is not directly clear about it, but you need to pass > the second argument as a keyword argument using "value_set" keyword name. > Small example: > > In [19]: pc.is_in(pa.array(["a", "b", "c", "d"]), value_set=pa.array(["a", > "c"])) > Out[19]: > <pyarrow.lib.BooleanArray object at 0x7f508af95ac8> > [ > true, > false, > true, > false > ] > > You can find this keyword in the keywords of pc.SetLookupOptions. > We know the docstrings are not yet in a good state. This was recently > already improved in https://issues.apache.org/jira/browse/ARROW-9164, and > we should maybe also try to inject the option keywords in the function > docstring. > > Best, > Joris > > On Wed, 4 Nov 2020 at 14:14, Niklas B <niklas.biv...@enplore.com> wrote: > >> Hi, >> >> I’m trying in Python to (without reading entire parquet file into memory) >> filter out certain rows (based on uuid-strings). My approach is to read >> each row group, then try to filter it without casting it to pandas (since >> it’s expensive for data-frames with lots of strings it in). Looking in the >> compute function list my hope was to be able to use `is_in` operator. How >> would you actually use it? My naive approach would be: >> >> import pyarrow.compute as pc >> mask = pc.is_in(table['id’], pa.array([“uuid1”, “uuid2"])) >> # somehow invert the mask since it shows the ones that I don’t want >> >> Above gives: >> >>>>> pc.is_in(table["id"], pa.array(["uuid1", "uuid2"])) >> Traceback (most recent call last): >> File "<stdin>", line 1, in <module> >> TypeError: wrapper() takes 1 positional argument but 2 were given >> >> Trying to pass anything else into is_in, like an pya.array results into >> segfaults. >> >> Is above at all possible with is_in? >> >> Normally I would use pyarrow.parquet.ParquetDataset().filters() but I need >> to apply it per row group, not for the entire file so I can then write the >> modified row group to disk again in another file. >> >> Regards, >> Niklas