Re: PyArrow: Using is_in compute to filter list of strings in a Table

Niklas B Wed, 04 Nov 2020 06:24:02 -0800

Never mind, I realized I can use the pyarrow.compute.invert. Thank you again 
for the super fast answer


> On 4 Nov 2020, at 15:13, Niklas B <niklas.biv...@enplore.com> wrote:
> 
> Thank you! This looks awesome. Any good way to inverse the ChunkedArray? I 
> know I can cast to Numpy (experimental) and do it there, but would love a 
> native arrow version :)
> 
>> On 4 Nov 2020, at 14:45, Joris Van den Bossche 
>> <jorisvandenboss...@gmail.com> wrote:
>> 
>> Hi Niklas,
>> 
>> The "is_in" docstring is not directly clear about it, but you need to pass
>> the second argument as a keyword argument using "value_set" keyword name.
>> Small example:
>> 
>> In [19]: pc.is_in(pa.array(["a", "b", "c", "d"]), value_set=pa.array(["a",
>> "c"]))
>> Out[19]:
>> <pyarrow.lib.BooleanArray object at 0x7f508af95ac8>
>> [
>> true,
>> false,
>> true,
>> false
>> ]
>> 
>> You can find this keyword in the keywords of pc.SetLookupOptions.
>> We know the docstrings are not yet in a good state. This was recently
>> already improved in https://issues.apache.org/jira/browse/ARROW-9164, and
>> we should maybe also try to inject the option keywords in the function
>> docstring.
>> 
>> Best,
>> Joris
>> 
>> On Wed, 4 Nov 2020 at 14:14, Niklas B <niklas.biv...@enplore.com> wrote:
>> 
>>> Hi,
>>> 
>>> I’m trying in Python to (without reading entire parquet file into memory)
>>> filter out certain rows (based on uuid-strings). My approach is to read
>>> each row group, then try to filter it without casting it to pandas (since
>>> it’s expensive for data-frames with lots of strings it in). Looking in the
>>> compute function list my hope was to be able to use `is_in` operator. How
>>> would you actually use it? My naive approach would be:
>>> 
>>> import pyarrow.compute as pc
>>> mask = pc.is_in(table['id’], pa.array([“uuid1”, “uuid2"]))
>>> # somehow invert the mask since it shows the ones that I don’t want
>>> 
>>> Above gives:
>>> 
>>>>>> pc.is_in(table["id"], pa.array(["uuid1", "uuid2"]))
>>> Traceback (most recent call last):
>>> File "<stdin>", line 1, in <module>
>>> TypeError: wrapper() takes 1 positional argument but 2 were given
>>> 
>>> Trying to pass anything else into is_in, like an pya.array results into
>>> segfaults.
>>> 
>>> Is above at all possible with is_in?
>>> 
>>> Normally I would use pyarrow.parquet.ParquetDataset().filters() but I need
>>> to apply it per row group, not for the entire file so I can then write the
>>> modified row group to disk again in another file.
>>> 
>>> Regards,
>>> Niklas
>

Re: PyArrow: Using is_in compute to filter list of strings in a Table

Reply via email to