Re: Apache Flink and serious streaming stateful processing

Stephan Ewen Tue, 30 Jun 2015 06:00:45 -0700

Hi Krzysztof,

Thanks for the kind words! I think that Flink is to a good extend set up
and provide what you are looking for. The remaining gaps are WIP.

Let me elaborate a bit on Gyula's answers:

1)
Backpressure is very much there, it has always been working well, also
better than in Storm, as far as I can tell.

The way Flink builds streams is by shipping buffers through logical
channels, which are multiplexed through back-pressured network channels.
The buffers (on both sender and receiver side) come from managed bounded
buffer pools. As soon as receivers slot down, some bounded amount of data
will queue up in the buffer pool on both sender and receiver side, but then
the producing operator will block until space in the buffer pool is
available.

The back pressure goes back all the way to the sources, and eventually the
source will stop grabbing more data, and will leave it (for example in
Kafka).

2)
This part is currently under evolution, API wise. It would be good to get
your input to make sure we validate the design with real-world use cases. Let
me make sure we get correctly what you want to do.
You want to do stateful computation and use key/value state abstraction,
but the state should not go into an external key value store. It should be
maintained in Flink, but in a out-of-core enabled fashion.

You can do much (but not all) of that right now. You can keep a hash map in
your application and make state changes on it. The hash map can be backed
up by the fault tolerance system.
This will, however, only work up to a certain size. The benefit is
(compared to Samza) that state restoring is quite fast.

We are working on making incrementally backed-up key/value state a
first-class citizen in Flink, but is is still WIP.

For now, concerning out-of-core state, you can experiment with embedding an
out-of-core key/value database in your operators, something like
http://www.mapdb.org
Because operators are long lived (unlike in mini batches), this db will
keep existing as well. You can even write a method that lets Flink back
this up periodically into HDFS. It should work as long as the checkpoint
interval is not too high.

Let us know how far that gets you. We will also keep you posted with
advances in the state abstraction.

Greetings,
Stephan

On Tue, Jun 30, 2015 at 2:23 PM, Gyula Fóra <gyula.f...@gmail.com> wrote:

> Hi Krzysztof,
>
> Thank you for your questions, we are happy to help you getting started.
>
> Regarding your questions:
>
> 1. There is backpressure for the streams, so if the downstream operators
> cannot keep up the sources will slow down.
>
> 2. We have support for stateful processing in Flink in many ways you have
> described in your question. Unfortunately the docs are down currently but
> you should check out the 'Stateful processing' section in the 0.10 docs
> (once its back online). We practically support an OperatorState interface
> which let's you keep partitioned state by some key and access it from
> runtime operators. The states declared using these interfaces are
> checkpointed and will be restored on failure. Currently all the states are
> stored in-memory but we are planning to extend it to allow writing state
> updates to external systems.
>
> I will send you some pointers once the docs are up again.
>
> Cheers,
> Gyula
>
> Krzysztof Zarzycki <k.zarzy...@gmail.com> ezt írta (időpont: 2015. jún.
> 30., K, 14:07):
>
>> Greetings!
>> I'm extremely interested in Apache Flink, I think you're doing really a
>> great job! But please allow me to share two things that I would require
>> from Apache Flink to consider it as groundbreaking (it is what I need for
>> Streaming framework):
>>
>> 1. Stream backpressure. When stream processing part does not keep up,
>> please pause receiving new data. This is a serious problem in other
>> frameworks, like Spark Streaming. Please see the ticket in Spark about it:
>> https://issues.apache.org/jira/browse/SPARK-7398
>> 2. Support for (serious) stateful processing. What I mean by that is to
>> be able to keep state of the application in key-value stores, out-of-core,
>> in embedded mode. I want to be able to keep, let's say history of events
>> from last two months, grouped & accessible by user_id, and don't want to
>> use external database for that (e.g. Cassandra). Communicating with
>> external database would kill my performance especially when *reprocessing*
>> historical data. And I definitely don't want to escape to batch processing
>> (like in Lambda Architecture).
>>
>> These two are the most important (IMHO) lacks in Spark Streaming and are
>> the reasons I'm not using it. These two are supported by Samza, which in
>> code and API is not excellent, but at least allows serious stream
>> processing, that does not require repeating the processing pipeline in
>> batch (Hadoop).
>>
>> I'm looking forward to seeing features like these in Flink. Or they are
>> already there and I'm just missing something?
>>
>> Thanks!
>> Krzysztof Zarzycki
>>
>

Re: Apache Flink and serious streaming stateful processing

Reply via email to