zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922200568
> > For example, FileStream could count how many batches it sends back-to-back without yielding, and after a certain threshold, it yields. WDYT? > > Perhaps DataSourceExec could wrap the stream returned by the DataSource with the YieldStream introduced in this PR so? > > Thanks @zhuqi-lucas. The problem is clearly visible here, and the solution makes sense. It doesn't sacrifice performance as seen in the benchmarks, and not introduce any complexity. > > However, I'm wondering if this issue could arise in other places as well. For example, in Sort streams, one-side collecting joins, large window frames, etc. In short, many streams could suffer from the same problem. Rather than wrapping each of these individually and spreading this workaround like a virus across all pipeline-breaking streams, I think we should address it at the source level. If sources yield control periodically, regardless of the pipeline, we could solve this issue with a single, centralized fix. For example, FileStream could count how many batches it sends back-to-back without yielding, and after a certain threshold, it yields. WDYT? > > I'm not sure but repartition yield can also be removed maybe if we do such Thank you @berkaysynnada , @pepijnve , i agree it's a better idea if we can add YieldStream to DataSource, i will try to address this good suggestion. And repartition yield can also be removed maybe if we do such way, it looks like a additional benefit from it, i will investigate also. May be a follow-up or i can add in this PR also. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org