konjac commented on code in PR #1036: URL: https://github.com/apache/datafusion-python/pull/1036#discussion_r2007565173
########## src/dataframe.rs: ########## @@ -771,3 +871,82 @@ fn record_batch_into_schema( RecordBatch::try_new(schema, data_arrays) } + +/// This is a helper function to return the first non-empty record batch from executing a DataFrame. +/// It additionally returns a bool, which indicates if there are more record batches available. +/// We do this so we can determine if we should indicate to the user that the data has been +/// truncated. This collects until we have achived both of these two conditions +/// +/// - We have collected our minimum number of rows +/// - We have reached our limit, either data size or maximum number of rows +/// +/// Otherwise it will return when the stream has exhausted. If you want a specific number of +/// rows, set min_rows == max_rows. +async fn collect_record_batches_to_display( + df: DataFrame, + min_rows: usize, + max_rows: usize, +) -> Result<(Vec<RecordBatch>, bool), DataFusionError> { + let mut stream = df.execute_stream().await?; Review Comment: In my proposed PR #1015 , I uses `execute_stream_partitioned` instead. `execute_stream` will append `CoalescePartitionsExec` to merge partitions into a single partition(code https://github.com/apache/datafusion/blob/74aeb91fd94109d05178555d83e812e6e0712573/datafusion/physical-plan/src/execution_plan.rs#L887C1-L889C1 ). This will load unnecessary partitions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org