Thanks for the reply. That helps.. On Thu, Mar 27, 2025, 7:29 AM Wenchen Fan <cloud0...@gmail.com> wrote:
> The file source in Spark has not been migrated to DS v2 yet and uses > dedicated catalyst rules to do runtime filtering, e.g. PartitionPruning > and PlanDynamicPruningFilters > > On Thu, Mar 27, 2025 at 6:53 PM Asif Shahid <asif.sha...@gmail.com> wrote: > >> Hi Experts, >> Could you please allow me to pick your brain on the following: >> >> For Hive Tables ( managed), the scan operator is FileSourceScanExec. >> Is there any particular reason why its underlying HadoopFSRelations' >> field, FileFormat does not implement an interface like >> SupportsRuntimeFiltering ? >> Like Scan contained in BatchScanExec, FileSourceScanExec may also >> benefit from pushdown of run time filters in skipping chunks white reading >> say Parquet Format? >> The reason for my asking is that I have been working ,personally, on >> pushdown of BrodacastHashJoin's buildside set (converted to SortedSet) and >> pushed as a Runtime Filter to iceberg as Scan DataSource layer , for >> filtering at various stages ( something akin to DPP but for non partitioned >> columns) , (https://github.com/apache/spark/pull/49209 ). >> >> I am thinking of doing the same for Hive based relations , using Parquet >> ( for starts). I believe parquet has max/min data available per chunk , >> and want to utilize it for pruning. >> >> I know that it works fine for iceberg formatted data, and was wondering >> if you see any issue in doing the same for FileSourceScanExec with parquet >> format data? >> >> Regards >> Asif >> >