Re: [I] Organize fields inside `SortMergeJoinStream` [datafusion]

via GitHub Fri, 28 Mar 2025 13:09:25 -0700


suibianwanwank commented on issue #15406:
URL: https://github.com/apache/datafusion/issues/15406#issuecomment-2762027162


   Hi, @2010YOUY01. I have read and tried to understand both 
SortMergeJoinStream and GroupedHashAggregateStream (though I still have some 
uncertainties). I have some initial thoughts on organizing the structure, but 
I’m not sure about the right level of granularity.
   
   One idea is to introduce a SortMergeState, which would roughly comprise:
   ```rust
   struct SortMergeState {
       /// State of streamed
       pub streamed_state: StreamedState,
       /// State of buffered
       pub buffered_state: BufferedState,
       /// Staging output size, including output batches and staging joined 
results.
       /// Increased when we put rows into buffer and decreased after we 
actually output batches.
       /// Used to trigger output when sufficient rows are ready
       pub output_size: usize,
       /// Target output batch size
       pub batch_size: usize,
       /// Current processing record batch of streamed
       pub streamed_batch: StreamedBatch,
       /// Current buffered data
       pub buffered_data: BufferedData,
       /// (used in outer join) Is current streamed row joined at least once?
       pub streamed_joined: bool,
       /// (used in outer join) Is current buffered batches joined at least 
once?
       pub buffered_joined: bool,
       /// A unique number for each batch
       pub streamed_batch_counter: AtomicUsize,
       /// Staging output array builders
       pub staging_output_record_batches: JoinedRecordBatches,
       /// Output buffer. Currently used by filtering as it requires double 
buffering
       /// to avoid small/empty batches. Non-filtered join outputs directly 
from `staging_output_record_batches.batches`
       pub output: RecordBatch,
   }
   ```
   These contain the parts of the stream, buffer, and output that change during 
the merge process.
   
   Another idea is to organize the contents of the buffer, stream and output 
separately, which would roughly comprise:
   ```rust
   struct StreamContext {
       /// State of streamed
       pub streamed_state: StreamedState,
       /// Current processing record batch of streamed
       pub streamed_batch: StreamedBatch,
       /// (used in outer join) Is current streamed row joined at least once?
       pub streamed_joined: bool,
       /// Join key columns of streamed
       pub on_streamed: Vec<PhysicalExprRef>,
       /// Streamed data stream
       pub streamed: SendableRecordBatchStream,
       /// Input schema of streamed
       pub streamed_schema: SchemaRef,
   }
   
   struct BufferContext {
   //...
   }
   
   struct OutputState {
   //...
   }
   ```
   
   Do you have any suggestions on which approach would be clearer and more 
readable?  :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Organize fields inside `SortMergeJoinStream` [datafusion]

Reply via email to