[I] [DISCUSSION] Support for Streaming in DataFusion [datafusion]

via GitHub Wed, 10 Jul 2024 18:12:45 -0700


ameyc opened a new issue, #11404:
URL: https://github.com/apache/datafusion/issues/11404

### Is your feature request related to a problem or challenge?

Hi DataFusion community,

Last month on June 24th, 2024 we demonstrated our POC Stream Processing
system on top of DataFusion at the inaugural DataFusion meetup which generated
some interest from the attendees as well as Data Infra community at large.
Streaming processing is an ask that has come time and up again, and @alamb we
re-open the disucssions that were first initiated by
[metesynnada](https://github.com/metesynnada) in #4285. Folks at Synnada put
together a [detailed proposal
here](https://synnada.notion.site/EPIC-Long-running-stateful-execution-support-for-unbounded-data-with-mini-batches-a416b29ae9a5438492663723dbeca805).
Remarkably, our POC ended up following similar general principles outlined in
the doc. The POC currently supports Streaming Windowed aggregations with
watermark tracking and check pointing.

Currently state of the play for streaming in DataFusion per our
understanding as it stands is as follows -

1. You can write sources that operate in `ExecutionMode::Unbounded`
2. Some complex operators such as `SymmetricHashJoinExec` &
`AggregationExec' already operate in this mode (unwatermarked)
3. You can await results of the continuous computations in batches using
`df.execute_stream().await.unwrap()`

I know Synnada folks have done a lot of work already, so feel free to chime
in. We wanted to gauge interest in the community for reviving #4285 . Streaming
Processing may appear a somewhat _niche_ use case, much of work within
DataFusion is highly relevant as well as can be enabled with minimal changes to
the DF upstream which may benefit developers of other use cases as well.

Stream Processing workloads are continuous in nature and as such require
some operators to be stateful as to keep track of things such as _watermarks_
as well as _checkpoint state_ for failure recovery. A pluggable state backend
support would be ideal, this also may be useful for operators that spill to
disk.

To that end we opened some tickets -

1. [StateBackend in DataFusion's
RuntimeEnv](https://github.com/apache/datafusion/issues/11365)
2. [Make Accumulators and ScalarValue
serializable](https://github.com/apache/datafusion/issues/11369)
3. [Deterministic IDs for
ExecutionPlan](https://github.com/apache/datafusion/issues/11364)
4. [Add StreamingWindowExec to DataFusion physical plan to support
aggregations over unbounded
data](https://github.com/apache/datafusion/issues/11364)

Additionally, for places where changes are too streaming specific, would
love to get input on how to best overlay our streaming project on top of
DataFusion, @alamb had some tips from InfluxDB but a _how to_ would be very
nice. Happy to take a stab at putting together a gentle introduction for future
developers from our learnings.

Thanks,

@ameyc & @emgeee

### Describe the solution you'd like

_No response_

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [DISCUSSION] Support for Streaming in DataFusion [datafusion]

Reply via email to