ameyc opened a new issue, #11404: URL: https://github.com/apache/datafusion/issues/11404
### Is your feature request related to a problem or challenge? Hi DataFusion community, Last month on June 24th, 2024 we demonstrated our POC Stream Processing system on top of DataFusion at the inaugural DataFusion meetup which generated some interest from the attendees as well as Data Infra community at large. Streaming processing is an ask that has come time and up again, and @alamb we re-open the disucssions that were first initiated by [metesynnada](https://github.com/metesynnada) in #4285. Folks at Synnada put together a [detailed proposal here](https://synnada.notion.site/EPIC-Long-running-stateful-execution-support-for-unbounded-data-with-mini-batches-a416b29ae9a5438492663723dbeca805). Remarkably, our POC ended up following similar general principles outlined in the doc. The POC currently supports Streaming Windowed aggregations with watermark tracking and check pointing. Currently state of the play for streaming in DataFusion per our understanding as it stands is as follows - 1. You can write sources that operate in `ExecutionMode::Unbounded` 2. Some complex operators such as `SymmetricHashJoinExec` & `AggregationExec' already operate in this mode (unwatermarked) 3. You can await results of the continuous computations in batches using `df.execute_stream().await.unwrap()` I know Synnada folks have done a lot of work already, so feel free to chime in. We wanted to gauge interest in the community for reviving #4285 . Streaming Processing may appear a somewhat _niche_ use case, much of work within DataFusion is highly relevant as well as can be enabled with minimal changes to the DF upstream which may benefit developers of other use cases as well. Stream Processing workloads are continuous in nature and as such require some operators to be stateful as to keep track of things such as _watermarks_ as well as _checkpoint state_ for failure recovery. A pluggable state backend support would be ideal, this also may be useful for operators that spill to disk. To that end we opened some tickets - 1. [StateBackend in DataFusion's RuntimeEnv](https://github.com/apache/datafusion/issues/11365) 2. [Make Accumulators and ScalarValue serializable](https://github.com/apache/datafusion/issues/11369) 3. [Deterministic IDs for ExecutionPlan](https://github.com/apache/datafusion/issues/11364) 4. [Add StreamingWindowExec to DataFusion physical plan to support aggregations over unbounded data](https://github.com/apache/datafusion/issues/11364) Additionally, for places where changes are too streaming specific, would love to get input on how to best overlay our streaming project on top of DataFusion, @alamb had some tips from InfluxDB but a _how to_ would be very nice. Happy to take a stab at putting together a gentle introduction for future developers from our learnings. Thanks, @ameyc & @emgeee ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
