Re: 20 times higher throughput with Window function vs fold function, intended?

Timo Walther Wed, 29 Mar 2017 03:22:14 -0700

Hi Kamil,

the performance implications might be the result of which state theunderlying functions are using internally. WindowFunctions use ListStateor ReducingState, fold() uses FoldingState. It also depends on the sizeof your state and the state backend you are using. I recommend thefollowing documentation page. The FoldingState might be deprecated soon,once a better alternative is available:https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/state.html#using-managed-keyed-state


I hope that helps.

Regards,
Timo

Am 29/03/17 um 11:27 schrieb Kamil Dziublinski:

Hi guys,

I’m using flink on production in Mapp. We recently swapped from storm.
Before I have put this live I was doing performance tests and I foundsomething that “feels” a bit off.I have a simple streaming job reading from kafka, doing window for 3seconds and then storing into hbase.
Initially we had this second step written with a fold function, sinceI thought performance and resource wise it’s a better idea.But I couldn’t reach more than 120k writes per second to HBase and Ithought hbase sink is a bottlenck here. But then I tried doing thesame with window function and my performance jumped to 2 millionswrites per second. Just wow :) Comparing to storm where I had max 320kper second it is amazing.
Both fold and window functions were doing the same thing, takingtogether all the records for the same tenant and user (key by is usedfor that) and putting it in one batched object with arraylists for themutations on user profile. After that passing this object to the sink.I can post the code if its needed.
In case of fold I was just adding profile mutation to the list and incase of window function iterating over all of it and returning thisbatched entity in one go.
I’m wondering if this is expected to have 20 times slower performancejust by using fold function. I would like to know what is so costlyabout this, as intuitively I would expect fold function being a betterchoice here since I assume that window function is using more memoryfor buffering.
Also my colleagues when they were doing PoC on flink evaluation theywere seeing very similar results to what I am seeing now. But theywere still using fold function. This was on flink version 1.0.3 andnow I am using 1.2.0. So perhaps there is some regression?
Please let me know what you think.

Cheers,
Kamil.

Re: 20 times higher throughput with Window function vs fold function, intended?

Reply via email to