Apache Flink and serious streaming stateful processing

Krzysztof Zarzycki Tue, 30 Jun 2015 05:08:53 -0700

Greetings!
I'm extremely interested in Apache Flink, I think you're doing really a
great job! But please allow me to share two things that I would require
from Apache Flink to consider it as groundbreaking (it is what I need for
Streaming framework):


1. Stream backpressure. When stream processing part does not keep up,
please pause receiving new data. This is a serious problem in other
frameworks, like Spark Streaming. Please see the ticket in Spark about it:
https://issues.apache.org/jira/browse/SPARK-7398
2. Support for (serious) stateful processing. What I mean by that is to be
able to keep state of the application in key-value stores, out-of-core, in
embedded mode. I want to be able to keep, let's say history of events from
last two months, grouped & accessible by user_id, and don't want to use
external database for that (e.g. Cassandra). Communicating with external
database would kill my performance especially when *reprocessing*
historical data. And I definitely don't want to escape to batch processing
(like in Lambda Architecture).

These two are the most important (IMHO) lacks in Spark Streaming and are
the reasons I'm not using it. These two are supported by Samza, which in
code and API is not excellent, but at least allows serious stream
processing, that does not require repeating the processing pipeline in
batch (Hadoop).

I'm looking forward to seeing features like these in Flink. Or they are
already there and I'm just missing something?

Thanks!
Krzysztof Zarzycki

Apache Flink and serious streaming stateful processing

Reply via email to