adriangb commented on code in PR #15057:
URL: https://github.com/apache/datafusion/pull/15057#discussion_r1992124608


##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -83,10 +87,23 @@ pub(super) struct ParquetOpener {
     pub enable_bloom_filter: bool,
     /// Schema adapter factory
     pub schema_adapter_factory: Arc<dyn SchemaAdapterFactory>,
+    /// Filter expression rewriter factory
+    pub filter_expression_rewriter: Option<Arc<dyn FileExpressionRewriter>>,
 }
 
 impl FileOpener for ParquetOpener {
     fn open(&self, file_meta: FileMeta) -> Result<FileOpenFuture> {
+        // Note about schemas: we are actually dealing with _4_ different 
schemas here:
+        // - The table schema as defined by the TableProvider. This is what 
the user sees, what they get when they `SELECT * FROM table`, etc.
+        // - The "virtual" file schema: this is the table schema minus any 
hive partition columns. This is what the file schema is coerced to.
+        // - The physical file schema: this is the schema as defined by the 
parquet file. This is what the parquet file actually contains.
+        // - The filter schema: a hybrid of the virtual file schema and the 
physical file schema.
+        //   If a filter is rewritten to reference columns that are in the 
physical file schema but not the virtual file schema, we need to add those 
columns to the filter schema so that the filter can be evaluated.
+        //   This schema is generated by taking any columns from the virtual 
file schema that are referenced by the filter and adding any columns from the 
physical file schema that are referenced by the filter but not in the virtual 
file schema.
+        //   Columns from the virtual file schema are added in the order they 
appear in the virtual file schema.
+        //   The columns from the physical file schema are always added to the 
end of the schema, in the order they appear in the physical file schema.
+        //
+        // I think it might be wise to do some renaming of parameters where 
possible, e.g. rename `file_schema` to `table_schema_without_partition_columns` 
and `physical_file_schema` or something like that.

Review Comment:
   This is an interesting bit to ponder upon



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to