suibianwanwank commented on issue #15406: URL: https://github.com/apache/datafusion/issues/15406#issuecomment-2762027162
Hi, @2010YOUY01. I have read and tried to understand both SortMergeJoinStream and GroupedHashAggregateStream (though I still have some uncertainties). I have some initial thoughts on organizing the structure, but I’m not sure about the right level of granularity. One idea is to introduce a SortMergeState, which would roughly comprise: ```rust struct SortMergeState { /// State of streamed pub streamed_state: StreamedState, /// State of buffered pub buffered_state: BufferedState, /// Staging output size, including output batches and staging joined results. /// Increased when we put rows into buffer and decreased after we actually output batches. /// Used to trigger output when sufficient rows are ready pub output_size: usize, /// Target output batch size pub batch_size: usize, /// Current processing record batch of streamed pub streamed_batch: StreamedBatch, /// Current buffered data pub buffered_data: BufferedData, /// (used in outer join) Is current streamed row joined at least once? pub streamed_joined: bool, /// (used in outer join) Is current buffered batches joined at least once? pub buffered_joined: bool, /// A unique number for each batch pub streamed_batch_counter: AtomicUsize, /// Staging output array builders pub staging_output_record_batches: JoinedRecordBatches, /// Output buffer. Currently used by filtering as it requires double buffering /// to avoid small/empty batches. Non-filtered join outputs directly from `staging_output_record_batches.batches` pub output: RecordBatch, } ``` These contain the parts of the stream, buffer, and output that change during the merge process. Another idea is to organize the contents of the buffer, stream and output separately, which would roughly comprise: ```rust struct StreamContext { /// State of streamed pub streamed_state: StreamedState, /// Current processing record batch of streamed pub streamed_batch: StreamedBatch, /// (used in outer join) Is current streamed row joined at least once? pub streamed_joined: bool, /// Join key columns of streamed pub on_streamed: Vec<PhysicalExprRef>, /// Streamed data stream pub streamed: SendableRecordBatchStream, /// Input schema of streamed pub streamed_schema: SchemaRef, } struct BufferContext { //... } struct OutputState { //... } ``` Do you have any suggestions on which approach would be clearer and more readable? :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org