Re: Requesting advice, thought

Asif Shahid Thu, 27 Mar 2025 07:50:08 -0700

Thanks for the reply. That helps..

On Thu, Mar 27, 2025, 7:29 AM Wenchen Fan <cloud0...@gmail.com> wrote:


> The file source in Spark has not been migrated to DS v2 yet and uses
> dedicated catalyst rules to do runtime filtering, e.g. PartitionPruning
> and PlanDynamicPruningFilters
>
> On Thu, Mar 27, 2025 at 6:53 PM Asif Shahid <asif.sha...@gmail.com> wrote:
>
>> Hi Experts,
>> Could you please allow me to pick your brain on the  following:
>>
>> For Hive Tables ( managed), the scan operator is FileSourceScanExec.
>> Is there any particular reason why its underlying HadoopFSRelations'
>> field, FileFormat does not implement an interface like
>> SupportsRuntimeFiltering ?
>> Like Scan contained in BatchScanExec,  FileSourceScanExec may also
>> benefit from pushdown of run time filters in skipping chunks white reading
>> say Parquet Format?
>> The reason for my asking is that I have been working ,personally, on
>> pushdown of BrodacastHashJoin's buildside set (converted to SortedSet) and
>> pushed as a Runtime Filter to iceberg as Scan DataSource layer , for
>> filtering at various stages ( something akin to DPP but for non partitioned
>> columns) , (https://github.com/apache/spark/pull/49209 ).
>>
>> I am thinking of doing the same for Hive based relations , using Parquet
>> ( for starts). I believe parquet has  max/min data available per chunk ,
>> and want to utilize it for pruning.
>>
>> I know that it works fine for iceberg formatted data, and was wondering
>> if you see any issue in doing the same for FileSourceScanExec with parquet
>> format data?
>>
>> Regards
>> Asif
>>
>

Re: Requesting advice, thought

Reply via email to