phillipleblanc opened a new issue, #15173:
URL: https://github.com/apache/datafusion/issues/15173

   ### Is your feature request related to a problem or challenge?
   
   The `ListingTableProvider` in DataFusion provides an implementation of a 
`TableProvider` that organizes a collection of (potentially hive partitioned) 
files in an object store into a single table.
   
   Similar to how hive partitions are injected into the listing table schema, 
but they don't actually exist in the physical parquet files - I'd like to be 
able to request the ListingTable to inject metadata columns that get their data 
from the `ObjectMeta` provided by the object store crate. Then I can query for 
and filter on the columns `location`, `size` and `last_modified`).
   
   I'd also like queries that filter on the metadata columns to be able to 
prune out files, similar to partition pruning. I.e. if I do `SELECT * FROM 
my_listing_table WHERE last_modified > '2025-03-10'` then only files that were 
modified after `'2025-03-10'` should be passed to the FileScanConfig to be read.
   
   My scenario is I'd like to be able to efficiently ingest files from an 
object store bucket that I haven't seen before - and filtering on 
`last_modified` seems like a good solution.
   
   This could potentially fold into the work ongoing in #13975 / #14057 / 
#14362 to mark these columns as proper system/metadata columns - but it 
fundamentally isn't blocked on that work. Since this would be an opt-in from 
the consumer, automatic filtering out on a `SELECT *` doesn't seem required.
   
   ### Describe the solution you'd like
   
   A new API on the `ListingOptions` struct that is passed to a 
`ListingTableConfig` which is passed to `ListingTable::try_new`.
   
   ```rust
       /// Set metadata columns on [`ListingOptions`] and returns self.
       ///
       /// "metadata columns" are columns that are computed from the 
`ObjectMeta` of the files from object store.
       ///
       /// Available metadata columns:
       /// - `location`: The full path to the object
       /// - `last_modified`: The last modified time
       /// - `size`: The size in bytes of the object
       ///
       /// For example, given the following files in object store:
       ///
       /// ```text
       /// /mnt/nyctaxi/tripdata01.parquet
       /// /mnt/nyctaxi/tripdata02.parquet
       /// /mnt/nyctaxi/tripdata03.parquet
       /// ```
       ///
       /// If the `last_modified` field in the `ObjectMeta` for 
`tripdata01.parquet` is `2024-01-01 12:00:00`,
       /// then the table schema will include a column named `last_modified` 
with the value `2024-01-01 12:00:00`
       /// for all rows read from `tripdata01.parquet`.
       ///
       /// | <other columns> | last_modified         |
       /// |-----------------|-----------------------|
       /// | ...             | 2024-01-01 12:00:00   |
       /// | ...             | 2024-01-02 15:30:00   |
       /// | ...             | 2024-01-03 09:15:00   |
       ///
       /// # Example
       /// ```
       /// # use std::sync::Arc;
       /// # use datafusion::datasource::{listing::ListingOptions, 
file_format::parquet::ParquetFormat};
       ///
       /// let listing_options = ListingOptions::new(Arc::new(
       ///     ParquetFormat::default()
       ///   ))
       ///   .with_metadata_cols(vec![MetadataColumn::LastModified]);
       ///
       /// assert_eq!(listing_options.metadata_cols, 
vec![MetadataColumn::LastModified]);
       /// ```
       pub fn with_metadata_cols(mut self, metadata_cols: Vec<MetadataColumn>) 
-> Self {
           self.metadata_cols = metadata_cols;
           self
       }
   ```
   
   The definition for `MetadataColumn` is a simple enum:
   
   ```rust
   /// A metadata column that can be used to filter files
   #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
   pub enum MetadataColumn {
       /// The location of the file in object store
       Location,
       /// The last modified timestamp of the file
       LastModified,
       /// The size of the file in bytes
       Size,
   }
   ```
   
   The order of the `MetadataColumn` passed into `with_metadata_cols` denotes 
the order it will appear in the table schema. Metadata columns will be added 
after partition columns.
   
   ### Describe alternatives you've considered
   
   I considered what it might look like to make `ListingTableProvider` more 
extensible to be able to implement these changes without a core DataFusion 
change. I wasn't able to come up with anything simpler than the above though.
   
   Another option might be to make a lot of the internals of 
ListingTableProvider public so that it is easier for people to maintain their 
own customized versions of ListingTableProvider.
   
   ### Additional context
   
   I've already implemented this in my project, I will be upstreaming my change 
and linking to this issue. To view what this looks like already implemented, 
see: https://github.com/spiceai/datafusion/pull/74
   
   And to see the changes needed to integrate with it from a consuming project, 
see: https://github.com/spiceai/spiceai/pull/4970 (It is quite contained, which 
I'm happy with)
   
   This change will have no visible effect on consumers - they need to 
explicitly opt-in to see the metadata columns.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to