zhuqi-lucas commented on issue #16836:
URL: https://github.com/apache/datafusion/issues/16836#issuecomment-3113933948

   > We do not have a reproducer yet. However, I would like to share some more 
data points with you. We suspect the issue is related to partitioning for two 
reasons:
   > 
   > 1. If we set the target partitions to 1 with `let config = 
SessionConfig::new().with_target_partitions(1);`, the issue does not happen.
   > 2. In the following call to `execute_input_stream()` the partition of the 
input stream is hard coded to 0. To my beginner's eyes it looks like that only 
partition 0 of the input stream is read and inserted into the sink table. Could 
that be?
   >    
   >      
   >        
   >          
[datafusion/datafusion/datasource/src/sink.rs](https://github.com/apache/datafusion/blob/dbc03fa4f6d47c8f3b97f3a3d979945b2b7ccce7/datafusion/datasource/src/sink.rs#L227)
   >        
   >        
   >             Line 227
   >          in
   >          
[dbc03fa](/apache/datafusion/commit/dbc03fa4f6d47c8f3b97f3a3d979945b2b7ccce7)
   >        
   >      
   >      
   >        
   >    
   >            
   >              
   >               let data = execute_input_stream(
   > 
   > A useful info might also be the workaround we found. Instead of using 
`provider.insert_into()` we use `df.clone().write_table(sink_table_name, ...)`.
   > 
   > I hope this is helpful!
   
   Thank you, i think the partition 0 hard coded is right, DataSinkExec is 
designed as the final, global “write” step for DML operations (INSERT, COPY, 
EXPORT, etc.), and it intentionally runs on only one partition.
   
   So it's possible before you sink, you don't merge into one partition data, 
and it requires: required_input_distribution = SinglePartition. So when you 
setting to partition 1, it will not happen, it's safe to into one partition.
   
   Or call the higher‑level df.write_table(...), which includes that coalesce 
for you, so it will automatically into one partition before sink i think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to