ameyc opened a new issue, #11404:
URL: https://github.com/apache/datafusion/issues/11404

   ### Is your feature request related to a problem or challenge?
   
   Hi DataFusion community,
   
   Last month on June 24th, 2024 we demonstrated our POC Stream Processing 
system on top of DataFusion at the inaugural DataFusion meetup which generated 
some interest from the attendees as well as Data Infra community at large. 
Streaming processing is an ask that has come time and up again, and @alamb we 
re-open the disucssions that were first initiated by 
[metesynnada](https://github.com/metesynnada) in #4285. Folks at Synnada put 
together a [detailed proposal 
here](https://synnada.notion.site/EPIC-Long-running-stateful-execution-support-for-unbounded-data-with-mini-batches-a416b29ae9a5438492663723dbeca805).
 Remarkably, our POC ended up following similar general principles outlined in 
the doc. The POC currently supports Streaming Windowed aggregations with 
watermark tracking and check pointing.
   
   Currently state of the play for streaming in DataFusion per our 
understanding as it stands is as follows -
   
   1. You can write sources that operate in `ExecutionMode::Unbounded`
   2. Some complex operators such as `SymmetricHashJoinExec` & 
`AggregationExec' already operate in this mode (unwatermarked)
   3. You can await results of the continuous computations in batches using 
`df.execute_stream().await.unwrap()`
   
   I know Synnada folks have done a lot of work already, so feel free to chime 
in. We wanted to gauge interest in the community for reviving #4285 . Streaming 
Processing may appear a somewhat _niche_ use case, much of work within 
DataFusion is highly relevant as well as can be enabled with minimal changes to 
the DF upstream which may benefit developers of other use cases as well.
   
   Stream Processing workloads are continuous in nature and as such require 
some operators to be stateful as to keep track of things such as _watermarks_ 
as well as _checkpoint state_ for failure recovery. A pluggable state backend 
support would be ideal, this also may be useful for operators that spill to 
disk.
   
   To that end we opened some tickets -
   
   1. [StateBackend in DataFusion's 
RuntimeEnv](https://github.com/apache/datafusion/issues/11365) 
   2. [Make Accumulators and ScalarValue 
serializable](https://github.com/apache/datafusion/issues/11369)
   3. [Deterministic IDs for 
ExecutionPlan](https://github.com/apache/datafusion/issues/11364)
   4. [Add StreamingWindowExec to DataFusion physical plan to support 
aggregations over unbounded 
data](https://github.com/apache/datafusion/issues/11364)
   
   Additionally, for places where changes are too streaming specific, would 
love to get input on how to best overlay our streaming project on top of 
DataFusion, @alamb had some tips from InfluxDB but a _how to_ would be very 
nice. Happy to take a stab at putting together a gentle introduction for future 
developers from our learnings.
   
   Thanks,
   
   @ameyc & @emgeee 
   
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to