[ https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634264#comment-17634264 ]
Joris Van den Bossche commented on ARROW-15716: ----------------------------------------------- To just OR-combine the different expressions for each of the paths, you can do this automatically with {{reduce()}} and a list comprehension calling Partitioning.parse on each of the paths (without having to resort to {{_get_partition_keys}} and {{filters_to_expression}}). Using your example: {code} paths = ['path/to/data/month_id=202105/v1-manual__2022-11-06T22:50:20.parquet', 'path/to/data/month_id=202106/v1-manual__2022-11-06T22:50:20.parquet', 'path/to/data/month_id=202107/v1-manual__2022-11-06T22:50:20..parquet'] partitioning = ds.partitioning(pa.schema([('month_id', 'int64')]), flavor="hive") >>> import operator >>> import functools >>> functools.reduce(operator.or_, [partitioning.parse(file) for file in paths]) <pyarrow.compute.Expression (((month_id == 202105) or (month_id == 202106)) or (month_id == 202107))> {code} I think this is what Weston is suggesting to do. It doesn't necessarily give the most efficient filter expression, but that's a direct translation of the subset of paths (if there are many paths, it might be more efficient with isin or a greater/smaller compare kernel) > [Dataset][Python] Parse a list of fragment paths to gather filters > ------------------------------------------------------------------ > > Key: ARROW-15716 > URL: https://issues.apache.org/jira/browse/ARROW-15716 > Project: Apache Arrow > Issue Type: Wish > Components: Python > Affects Versions: 7.0.0 > Reporter: Lance Dacey > Assignee: Vibhatha Lakmal Abeykoon > Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Is it possible for partitioning.parse() to be updated to parse a list of > paths instead of just a single path? > I am passing the .paths from file_visitor to downstream tasks to process data > which was recently saved, but I can run into problems with this if I > overwrite data with delete_matching in order to consolidate small files since > the paths won't exist. > Here is the output of my current approach to use filters instead of reading > the paths directly: > {code:python} > # Fragments saved during write_dataset > ['dev/dataset/fragments/date_id=20210813/data-0.parquet', > 'dev/dataset/fragments/date_id=20210114/data-2.parquet', > 'dev/dataset/fragments/date_id=20210114/data-1.parquet', > 'dev/dataset/fragments/date_id=20210114/data-0.parquet'] > # Run partitioning.parse() on each fragment > [<pyarrow.compute.Expression (date_id == 20210813)>, > <pyarrow.compute.Expression (date_id == 20210114)>, > <pyarrow.compute.Expression (date_id == 20210114)>, > <pyarrow.compute.Expression (date_id == 20210114)>] > # Format those expressions into a list of tuples > [('date_id', 'in', [20210114, 20210813])] > # Convert to an expression which is used as a filter in .to_table() > is_in(date_id, {value_set=int64:[ > 20210114, > 20210813 > ], skip_nulls=false}) > {code} > My hope would be to do something like filt_exp = partitioning.parse(paths) > which would return a dataset expression. -- This message was sent by Atlassian Jira (v8.20.10#820010)