Hi Niklas,

The "is_in" docstring is not directly clear about it, but you need to pass
the second argument as a keyword argument using "value_set" keyword name.
Small example:

In [19]: pc.is_in(pa.array(["a", "b", "c", "d"]), value_set=pa.array(["a",
"c"]))
Out[19]:
<pyarrow.lib.BooleanArray object at 0x7f508af95ac8>
[
  true,
  false,
  true,
  false
]

You can find this keyword in the keywords of pc.SetLookupOptions.
We know the docstrings are not yet in a good state. This was recently
already improved in https://issues.apache.org/jira/browse/ARROW-9164, and
we should maybe also try to inject the option keywords in the function
docstring.

Best,
Joris

On Wed, 4 Nov 2020 at 14:14, Niklas B <niklas.biv...@enplore.com> wrote:

> Hi,
>
> I’m trying in Python to (without reading entire parquet file into memory)
> filter out certain rows (based on uuid-strings). My approach is to read
> each row group, then try to filter it without casting it to pandas (since
> it’s expensive for data-frames with lots of strings it in). Looking in the
> compute function list my hope was to be able to use `is_in` operator. How
> would you actually use it? My naive approach would be:
>
> import pyarrow.compute as pc
> mask = pc.is_in(table['id’], pa.array([“uuid1”, “uuid2"]))
> # somehow invert the mask since it shows the ones that I don’t want
>
> Above gives:
>
> >>> pc.is_in(table["id"], pa.array(["uuid1", "uuid2"]))
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> TypeError: wrapper() takes 1 positional argument but 2 were given
>
> Trying to pass anything else into is_in, like an pya.array results into
> segfaults.
>
> Is above at all possible with is_in?
>
> Normally I would use pyarrow.parquet.ParquetDataset().filters() but I need
> to apply it per row group, not for the entire file so I can then write the
> modified row group to disk again in another file.
>
> Regards,
> Niklas

Reply via email to