[I] Support default values for columns in SchemaAdapter [datafusion]

via GitHub Thu, 13 Mar 2025 22:08:49 -0700


adriangb opened a new issue, #15220:
URL: https://github.com/apache/datafusion/issues/15220


   ### Is your feature request related to a problem or challenge?
   
   From conversation with Andrew a couple days ago he mentioned this was an 
open feature request however I could not find an issue. @alamb do you remember 
who else was asking for this?
   
   We have an implementation of this internally, it actually is more generic 
because we use it to generate columns from other columns, but it covers the use 
case of default values and it would be easy to make that API simple.
   
   Essentially we declare:
   
   ```rust
   pub trait MissingColumnGeneratorFactory: Debug + Send + Sync {
       /// Create a [`MissingColumnGenerator`] for the given `field` and 
`file_schema`.
       /// Returns None if the column cannot be generated by this generator.
       /// Otherwise, returns a [`MissingColumnGenerator`] that can generate 
the missing column.
       fn create(
           &self,
           field: &Field,
           file_schema: &Schema,
       ) -> Option<Arc<dyn MissingColumnGenerator + Send + Sync>>;
   }
   
   pub trait MissingColumnGenerator: Debug + Send + Sync {
       /// Generate a missing column for the given `field` from the provided 
`batch`.
       /// When this method is called `batch` will contain all of the columns 
declared as dependencies in `dependencies`.
       /// If the column cannot be generated, this method should return an 
error.
       /// Otherwise, it should return the generated column as an `ArrayRef`.
       /// No casting or post processing is done by this method, so the 
generated column should match the data type
       /// of the `field` it is being generated, otherwise an Err will be 
returned upstream.
       /// There is no guarantee about the order of the columns in the provided 
RecordBatch.
       fn generate(&self, batch: RecordBatch) -> 
datafusion_common::Result<ArrayRef>;
   
       /// Returns a list of column names that this generator depends on to 
generate the missing column.
       /// This is used when creating the `RecordBatch` to ensure that all 
dependencies are present before calling `generate`.
       /// The dependencies do not need to be declared in any particular order.
       fn dependencies(&self) -> Vec<String>;
   }
   ```
   
   And then you pass in one or more `MissingColumnGeneratorFactory` into 
`SchemaAdapterFactory`.
   
   There was _a lot_ of pain figuring out how to properly adjust projections to 
take into account the injected dependency columns, but we've done that work 
already on our end.
   
   The other thing to note is that adjustments are needed in filter pushdown, 
specifically here: 
https://github.com/apache/datafusion/blob/8061485be3b197d40bb35be09f9cf0a282c99bcd/datafusion/datasource-parquet/src/row_filter.rs#L355-L384
   
   This last bit applies no matter if simpler defaults are being generated or 
more complex derived columns.
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Support default values for columns in SchemaAdapter [datafusion]

Reply via email to