andygrove commented on PR #1777: URL: https://github.com/apache/datafusion-ballista/pull/1777#issuecomment-4545706630
> Good call. Pushed `ballista/client/tests/multi_file_scan.rs` (commit [b68f73e](https://github.com/apache/datafusion-ballista/commit/b68f73e3515face5eb03cf2038ca4e305240ab25)) with two standalone-cluster regression tests that exercise multi-file parquet scans: > > * `multi_file_parquet_scan_counts_every_row_exactly_once` — writes 6 parquet files, 7 rows each (42 rows total), and asserts `SELECT COUNT(*), SUM(value)` returns 42 / `sum(0..42)`. > * `multi_file_parquet_group_by_returns_each_value_once` — `GROUP BY value` after the scan and asserts every key shows up exactly once. > > Both tests fail under this branch and I left them `#[ignore]`d so they document the failure mode without blocking CI. The first one returns 252 rows instead of 42 (= 6 tasks × 42 rows). The metrics on the leaf stage make the cause explicit: `files_opened=36, files_processed=36` for 6 input files. > > Tracing it back to DataFusion 54: `FileScanConfig::create_sibling_state` now hands out a `SharedWorkSource` populated with every file in the scan, and `FileScanConfig::open_with_args` wires that shared queue into the partition's `FileStreamBuilder`. In a single-process DataFusion run that's safe — every partition of the same DataSourceExec instance shares one queue and they drain it cooperatively. Under Ballista each task deserialises its _own_ copy of the plan, owns its own shared queue containing every file, and executes a single partition that drains the whole queue locally. So this isn't quite the same shape as the bug datafusion-distributed hit (`PartitionIsolatorExec` using `task_index` at execution), but the root cause — DF 54 no longer assuming pre-baked partition→files at execution — bites Ballista too. > > Likely fix is the same shape as datafusion-distributed PR #467: pre-split `FileScanConfig.file_groups` per task before serialising the plan, so each task ships with a single-partition config and the shared queue only contains that partition's files. I'd prefer to land that as a follow-up PR rather than expand this one; happy to file an issue and link the test if that works for you. Sorry, Claude posted this without permission. Will attempt to fix in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
