[ 
https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634264#comment-17634264
 ] 

Joris Van den Bossche commented on ARROW-15716:
-----------------------------------------------

To just OR-combine the different expressions for each of the paths, you can do 
this automatically with {{reduce()}} and a list comprehension calling 
Partitioning.parse on each of the paths (without having to resort to 
{{_get_partition_keys}} and {{filters_to_expression}}). Using your example:

{code}
paths = ['path/to/data/month_id=202105/v1-manual__2022-11-06T22:50:20.parquet',
         'path/to/data/month_id=202106/v1-manual__2022-11-06T22:50:20.parquet',
         'path/to/data/month_id=202107/v1-manual__2022-11-06T22:50:20..parquet']
partitioning = ds.partitioning(pa.schema([('month_id', 'int64')]), 
flavor="hive")


>>> import operator
>>> import functools
>>> functools.reduce(operator.or_, [partitioning.parse(file) for file in paths])
<pyarrow.compute.Expression (((month_id == 202105) or (month_id == 202106)) or 
(month_id == 202107))>
{code}

I think this is what Weston is suggesting to do. It doesn't necessarily give 
the most efficient filter expression, but that's a direct translation of the 
subset of paths (if there are many paths, it might be more efficient with isin 
or a greater/smaller compare kernel)

> [Dataset][Python] Parse a list of fragment paths to gather filters
> ------------------------------------------------------------------
>
>                 Key: ARROW-15716
>                 URL: https://issues.apache.org/jira/browse/ARROW-15716
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Lance Dacey
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Is it possible for partitioning.parse() to be updated to parse a list of 
> paths instead of just a single path? 
> I am passing the .paths from file_visitor to downstream tasks to process data 
> which was recently saved, but I can run into problems with this if I 
> overwrite data with delete_matching in order to consolidate small files since 
> the paths won't exist. 
> Here is the output of my current approach to use filters instead of reading 
> the paths directly:
> {code:python}
> # Fragments saved during write_dataset 
> ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-0.parquet']
> # Run partitioning.parse() on each fragment 
> [<pyarrow.compute.Expression (date_id == 20210813)>, 
> <pyarrow.compute.Expression (date_id == 20210114)>, 
> <pyarrow.compute.Expression (date_id == 20210114)>, 
> <pyarrow.compute.Expression (date_id == 20210114)>]
> # Format those expressions into a list of tuples
> [('date_id', 'in', [20210114, 20210813])]
> # Convert to an expression which is used as a filter in .to_table()
> is_in(date_id, {value_set=int64:[
>   20210114,
>   20210813
> ], skip_nulls=false})
> {code}
> My hope would be to do something like filt_exp = partitioning.parse(paths) 
> which would return a dataset expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to