adriangb commented on issue #13270:
URL: https://github.com/apache/datafusion/issues/13270#issuecomment-2651549536

   I think the fundamental issue is that the partition columns are specified on 
a per-exec basis via `FileScanConfig`. The only solutions I can think of are:
   - Change the APIs to allow multiple `FileScanConfig`'s to be supplied. This 
bring about issues of making sure the output schemas all match so they can be 
unioned, etc.
   - Move partition column generation into `SchemaAdapter`. The issue with this 
is that `SchemaAdapter` exists at a lower level than the concept of partition 
columns and it might be inappropriate to put that logic in there directly. But 
at the same time `FileScanConfig` and `ParquetExec` (recently folded into 
`ParquetDataSource`?) exist above the level of a single file, and partitioning 
can be as granular as a single file. I think the solution here would be to add 
hooks into `SchemaAdapter` to be able to handle missing columns so that the 
exec can inject information on how to generate partition columns from file 
paths. It could do that very dynamically on a per-file basis with no config or 
we could say that you have to pass the union of all columns that might be 
partition columns along with their field types and then if a file has only a 
subset of those that's okay, but we error or fill in nulls if we encounter a 
missing column that was not declared as a partition column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to