adriangb commented on issue #13270: URL: https://github.com/apache/datafusion/issues/13270#issuecomment-2651549536
I think the fundamental issue is that the partition columns are specified on a per-exec basis via `FileScanConfig`. The only solutions I can think of are: - Change the APIs to allow multiple `FileScanConfig`'s to be supplied. This bring about issues of making sure the output schemas all match so they can be unioned, etc. - Move partition column generation into `SchemaAdapter`. The issue with this is that `SchemaAdapter` exists at a lower level than the concept of partition columns and it might be inappropriate to put that logic in there directly. But at the same time `FileScanConfig` and `ParquetExec` (recently folded into `ParquetDataSource`?) exist above the level of a single file, and partitioning can be as granular as a single file. I think the solution here would be to add hooks into `SchemaAdapter` to be able to handle missing columns so that the exec can inject information on how to generate partition columns from file paths. It could do that very dynamically on a per-file basis with no config or we could say that you have to pass the union of all columns that might be partition columns along with their field types and then if a file has only a subset of those that's okay, but we error or fill in nulls if we encounter a missing column that was not declared as a partition column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org