Re: PyArrow: Using is_in compute to filter list of strings in a Table

Niklas B Wed, 04 Nov 2020 06:13:58 -0800

Thank you! This looks awesome. Any good way to inverse the ChunkedArray? I know 
I can cast to Numpy (experimental) and do it there, but would love a native 
arrow version :)


> On 4 Nov 2020, at 14:45, Joris Van den Bossche <jorisvandenboss...@gmail.com> 
> wrote:
> 
> Hi Niklas,
> 
> The "is_in" docstring is not directly clear about it, but you need to pass
> the second argument as a keyword argument using "value_set" keyword name.
> Small example:
> 
> In [19]: pc.is_in(pa.array(["a", "b", "c", "d"]), value_set=pa.array(["a",
> "c"]))
> Out[19]:
> <pyarrow.lib.BooleanArray object at 0x7f508af95ac8>
> [
>  true,
>  false,
>  true,
>  false
> ]
> 
> You can find this keyword in the keywords of pc.SetLookupOptions.
> We know the docstrings are not yet in a good state. This was recently
> already improved in https://issues.apache.org/jira/browse/ARROW-9164, and
> we should maybe also try to inject the option keywords in the function
> docstring.
> 
> Best,
> Joris
> 
> On Wed, 4 Nov 2020 at 14:14, Niklas B <niklas.biv...@enplore.com> wrote:
> 
>> Hi,
>> 
>> I’m trying in Python to (without reading entire parquet file into memory)
>> filter out certain rows (based on uuid-strings). My approach is to read
>> each row group, then try to filter it without casting it to pandas (since
>> it’s expensive for data-frames with lots of strings it in). Looking in the
>> compute function list my hope was to be able to use `is_in` operator. How
>> would you actually use it? My naive approach would be:
>> 
>> import pyarrow.compute as pc
>> mask = pc.is_in(table['id’], pa.array([“uuid1”, “uuid2"]))
>> # somehow invert the mask since it shows the ones that I don’t want
>> 
>> Above gives:
>> 
>>>>> pc.is_in(table["id"], pa.array(["uuid1", "uuid2"]))
>> Traceback (most recent call last):
>>  File "<stdin>", line 1, in <module>
>> TypeError: wrapper() takes 1 positional argument but 2 were given
>> 
>> Trying to pass anything else into is_in, like an pya.array results into
>> segfaults.
>> 
>> Is above at all possible with is_in?
>> 
>> Normally I would use pyarrow.parquet.ParquetDataset().filters() but I need
>> to apply it per row group, not for the entire file so I can then write the
>> modified row group to disk again in another file.
>> 
>> Regards,
>> Niklas

Re: PyArrow: Using is_in compute to filter list of strings in a Table

Reply via email to