adriangb opened a new issue, #17954:
URL: https://github.com/apache/datafusion/issues/17954

   Consider the scenario of:
   
   ```sql
   SELECT *
   FROM large_table
   JOIN small_table ON large_table.id = small_table.id
   WHERE small_table.name = 'Adrian';
   ```
   
   As per [our recent blog 
post](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/) we will 
first scan `small_table`, find the `id` for `'Adrian'` and then scan 
`large_table` with that information available. But what if we had an external 
table level point lookup index for `large_table.id`? We won't be able to use 
that during the scan.
   
   One option is to add hooks to the parquet readers that get called before 
each scan, something like:
   
   
   ```rust
   trait ScanPlanUpdater {
      async fn rescan(&self, file: PartitionedFile, plan: FileScanPlan) -> 
Result<FileScanPlan>;
   }
   ```
   
   Then we call this before we do any more work on this file to allow checking 
the point lookup index. The main issue with this option is that it could result 
in *a lot more* of lookups into the point lookup index than if it was done once 
at the table level. Maybe implementations of `ScanPlanUpdater` can have some 
sort of cache? I don't see a way to do it at the table level, the concept of a 
table is long gone by this point and I can't think of a low friction way to 
apply a filter to an entire `DataSourceExec`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to