kosiew commented on code in PR #23189:
URL: https://github.com/apache/datafusion/pull/23189#discussion_r3510905547
##########
datafusion/proto/src/physical_plan/from_proto.rs:
##########
Review Comment:
I think this introduces a wire compatibility regression for older serialized
plans.
Before this change, `from_proto` copied `proto.partitioned_by_file_group`
into `FileScanConfig`, and `FileScanConfig::output_partitioning()` converted
that legacy flag into `Partitioning::Hash(...)` over the table partition
columns.
After this change, `to_proto` always writes `partitioned_by_file_group:
None`, while `from_proto` only parses `proto.output_partitioning`. As a result,
an older serialized plan with `partitioned_by_file_group=true` and no
`output_partitioning` now deserializes without any declared output partitioning
and falls back to `UnknownPartitioning(file_groups.len())`.
That silently drops the declared partitioning, which can re-enable avoidable
repartitioning and change planning behavior.
Could we keep `output_partitioning` as the canonical internal
representation, but add a legacy read fallback here? When `output_partitioning`
is absent, we can reconstruct `Partitioning::Hash` from `table_partition_cols`
and `file_groups.len()`.
##########
datafusion/datasource/src/file_stream/mod.rs:
##########
Review Comment:
Small naming suggestion: a few test-only helpers or members still use
`partitioned_by_file_group`. It might be worth renaming them to something like
`declared_output_partitioning` so the terminology stays consistent with the new
model.
Not blocking since the behavior already looks correct.
##########
datafusion/proto/tests/cases/roundtrip_physical_plan.rs:
##########
@@ -4219,24 +4219,6 @@ fn roundtrip_file_scan_config(scan_config:
FileScanConfig) -> Result<FileScanCon
Ok(file_scan_config.clone())
}
-#[test]
-fn roundtrip_parquet_exec_partitioned_by_file_group() -> Result<()> {
- let file_schema =
- Arc::new(Schema::new(vec![Field::new("col", DataType::Utf8, false)]));
- let file_source = Arc::new(ParquetSource::new(Arc::clone(&file_schema)));
- let scan_config =
- FileScanConfigBuilder::new(ObjectStoreUrl::local_filesystem(),
file_source)
- .with_file_groups(vec![FileGroup::new(vec![PartitionedFile::new(
- "/path/to/file.parquet".to_string(),
- 1024,
- )])])
- .with_partitioned_by_file_group(true)
- .build();
-
-
assert!(roundtrip_file_scan_config(scan_config)?.partitioned_by_file_group);
- Ok(())
-}
-
#[test]
Review Comment:
It would be great to add a focused backward compatibility test for this
case. The test could construct a `protobuf::FileScanExecConf` with only
`partitioned_by_file_group=true`, leave `output_partitioning` unset, and verify
that the parsed `FileScanConfig` still reports hash output partitioning.
That would help prevent this regression from coming back while keeping
`output_partitioning` as the single source of truth internally.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]