zhuqi-lucas commented on issue #16836: URL: https://github.com/apache/datafusion/issues/16836#issuecomment-3113933948
> We do not have a reproducer yet. However, I would like to share some more data points with you. We suspect the issue is related to partitioning for two reasons: > > 1. If we set the target partitions to 1 with `let config = SessionConfig::new().with_target_partitions(1);`, the issue does not happen. > 2. In the following call to `execute_input_stream()` the partition of the input stream is hard coded to 0. To my beginner's eyes it looks like that only partition 0 of the input stream is read and inserted into the sink table. Could that be? > > > > [datafusion/datafusion/datasource/src/sink.rs](https://github.com/apache/datafusion/blob/dbc03fa4f6d47c8f3b97f3a3d979945b2b7ccce7/datafusion/datasource/src/sink.rs#L227) > > > Line 227 > in > [dbc03fa](/apache/datafusion/commit/dbc03fa4f6d47c8f3b97f3a3d979945b2b7ccce7) > > > > > > > > let data = execute_input_stream( > > A useful info might also be the workaround we found. Instead of using `provider.insert_into()` we use `df.clone().write_table(sink_table_name, ...)`. > > I hope this is helpful! Thank you, i think the partition 0 hard coded is right, DataSinkExec is designed as the final, global “write” step for DML operations (INSERT, COPY, EXPORT, etc.), and it intentionally runs on only one partition. So it's possible before you sink, you don't merge into one partition data, and it requires: required_input_distribution = SinglePartition. So when you setting to partition 1, it will not happen, it's safe to into one partition. Or call the higher‑level df.write_table(...), which includes that coalesce for you, so it will automatically into one partition before sink i think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org