Re: [PR] Improve collection during repr and repr_html [datafusion-python]

via GitHub Sat, 05 Apr 2025 10:04:51 -0700


konjac commented on code in PR #1036:
URL: 
https://github.com/apache/datafusion-python/pull/1036#discussion_r2007565173



##########
src/dataframe.rs:
##########
@@ -771,3 +871,82 @@ fn record_batch_into_schema(
 
     RecordBatch::try_new(schema, data_arrays)
 }
+
+/// This is a helper function to return the first non-empty record batch from 
executing a DataFrame.
+/// It additionally returns a bool, which indicates if there are more record 
batches available.
+/// We do this so we can determine if we should indicate to the user that the 
data has been
+/// truncated. This collects until we have achived both of these two conditions
+///
+/// - We have collected our minimum number of rows
+/// - We have reached our limit, either data size or maximum number of rows
+///
+/// Otherwise it will return when the stream has exhausted. If you want a 
specific number of
+/// rows, set min_rows == max_rows.
+async fn collect_record_batches_to_display(
+    df: DataFrame,
+    min_rows: usize,
+    max_rows: usize,
+) -> Result<(Vec<RecordBatch>, bool), DataFusionError> {
+    let mut stream = df.execute_stream().await?;

Review Comment:
   In my proposed PR #1015 , I uses `execute_stream_partitioned` instead.
   
   `execute_stream` will append `CoalescePartitionsExec` to merge partitions 
into a single partition(code 
https://github.com/apache/datafusion/blob/74aeb91fd94109d05178555d83e812e6e0712573/datafusion/physical-plan/src/execution_plan.rs#L887C1-L889C1
 ). This will load unnecessary partitions.
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve collection during repr and repr_html [datafusion-python]

Reply via email to