peter-toth commented on PR #54330: URL: https://github.com/apache/spark/pull/54330#issuecomment-3999793601
> Also can we add more tests: > > * Empty grouped partitions: Plan that yields groupedPartitions.isEmpty (ie, partitioned table but no partition values inserted) I added an empty partitioned table test in https://github.com/apache/spark/pull/54330/commits/4a904ad98a3870e99136cf88a797903feb6496b8, but it seems we prevent returing `KeyedPartitioning` without partitions. This is not new behaviour, it worked the same way with `KeyGroupedPartitioning` before this PR. If we removed that `inputPartitions.nonEmpty` guard from `BatchScanExec` and then the 2 shuffles would disappear, but no `GroupPartitionsExec` is added as those are not needed. Maybe the only way to get `GroupPartitionsExec` with empty `groupedPartitions` is to enable `spark.sql.sources.v2.bucketing.partition.filter.enabled` and use disjoint set of keys in join legs to get empty `expectedPartitionKeys`. > * Runtime filtering + KeyedPartitioning: Keyed scan with runtime filtering and Option[InputPartition] (including None for filtered-out keys); check result correctness and that no partition is read for filtered keys. I've extended the 2 existing "dynamic partition filtering" tests in https://github.com/apache/spark/pull/54330/commits/5a4ecd1815b2f5af5700d3a1d65107d2ac67b928. > * OrderedDistribution: Require ordering on a keyed scan and assert that a single GroupPartitionsExec is inserted with the expected keys. I already added `GroupPartitionsExec` check to the existing `SPARK-48655: order by on partition keys should not introduce additional shuffle` test. It is an extensive tests with 5 different orderings, which seems to render expected key checks pointless. > * Multi table joins (looks like there's already tests, but maybe we can check the plan shape for the GroupPartitionExec) The 2 new "Multi table join" join tests do that, but I can add more or extend them if you miss something. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
