jayzhan211 commented on PR #13540:
URL: https://github.com/apache/datafusion/pull/13540#issuecomment-2504130372
> Thank you for the comments @jayzhan211 , I have updated.
>
> Now I think the concurrent generator might not be very straightforward,
I'll first leave some rationale here. If we can come to agreement I'll add more
doc to it (else revert back to `Box` version for simplicity)
>
> ### Rationale for `StreamingMemoryStream` design
> Suppose we want to run `select ... from generate_series(1,100)` in two
partitions And the underlying batch generator is wrapped with Mutex
>
> ```
> pub struct StreamingMemoryStream {
> ...
> generator: Arc<Mutex<dyn StreamingBatchGenerator>>,
> }
> ```
>
> It's possible to implement the UDTF in 3 ways: 1.
>
> ```
> [generator_1_50] --- [StreamingMemoryStream stream1] --> xxStream1
> [generator_50_100] --- [StreamingMemoryStream stream2] --> xxStream1
> ```
>
>
> ```
> [generator_1_100] --- [StreamingMemoryStream stream1] --> Repartition -->
xxStream1
> |->
xxStream2
> ```
>
>
> ```
> [generator_1_100] --- [StreamingMemoryStream stream1] --> xxStream1
> |-- [StreamingMemoryStream stream2] --> xxStream2
> ```
>
> 1 and 2 is the common pattern for datafusion scanning operators to do
plan-time parallelism, `generator` won't be accessed by multiple threads thus
`Mutex` is redundant 3 make the `StreamingBatchGenerator` being able to
concurrently accessed by multiple streams.
>
> The `Mutex` is added to make it possible for case 3 (so the interface can
be more general-purpose for future use cases)
Given the quick look, I'm not sure whether we need 3rd case. It seems the
1st case runs execution parallelly too. I would need to think about the
advantage of the 3rd case over 1st case.
For 3rd case, if we need it, we might use `Arc<RwLock<dyn T>>` if
`generate_next_batch` takes `&self`. It might be more efficiently.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]