Re: 20 times higher throughput with Window function vs fold function, intended?

Ted Yu Thu, 30 Mar 2017 02:52:08 -0700

Kamil:
In the upcoming hbase 2.0 release, there are more write path optimizations 
which would boost write performance further.


FYI 

> On Mar 30, 2017, at 1:07 AM, Kamil Dziublinski <kamil.dziublin...@gmail.com> 
> wrote:
> 
> Hey guys,
> 
> Sorry for confusion it turned out that I had a bug in my code, when I was not 
> clearing this list in my batch object on each apply call. Forgot it has to be 
> added since its different than fold.
> Which led to so high throughput. When I fixed this I was back to 160k per 
> sec. I'm still investigating how I can speed it up.
> 
> As a side note its quite interesting that hbase was able to do 2millions puts 
> per second. But most of them were already stored with previous call so 
> perhaps internally he is able to distinguish in memory if a put was stored or 
> not. Not sure.
> 
> Anyway my claim about window vs fold performance difference was wrong. So 
> forget about it ;)
> 
>> On Wed, Mar 29, 2017 at 12:21 PM, Timo Walther <twal...@apache.org> wrote:
>> Hi Kamil,
>> 
>> the performance implications might be the result of which state the 
>> underlying functions are using internally. WindowFunctions use ListState or 
>> ReducingState, fold() uses FoldingState. It also depends on the size of your 
>> state and the state backend you are using. I recommend the following 
>> documentation page. The FoldingState might be deprecated soon, once a better 
>> alternative is available: 
>> https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/state.html#using-managed-keyed-state
>> 
>> I hope that helps.
>> 
>> Regards,
>> Timo
>> 
>>> Am 29/03/17 um 11:27 schrieb Kamil Dziublinski:
>>> Hi guys,
>>> 
>>> I’m using flink on production in Mapp. We recently swapped from storm.
>>> Before I have put this live I was doing performance tests and I found 
>>> something that “feels” a bit off.
>>> I have a simple streaming job reading from kafka, doing window for 3 
>>> seconds and then storing into hbase.
>>> 
>>> Initially we had this second step written with a fold function, since I 
>>> thought performance and resource wise it’s a better idea. 
>>> But I couldn’t reach more than 120k writes per second to HBase and I 
>>> thought hbase sink is a bottlenck here. But then I tried doing the same 
>>> with window function and my performance jumped to 2 millions writes per 
>>> second. Just wow :) Comparing to storm where I had max 320k per second it 
>>> is amazing.
>>> 
>>> Both fold and window functions were doing the same thing, taking together 
>>> all the records for the same tenant and user (key by is used for that) and 
>>> putting it in one batched object with arraylists for the mutations on user 
>>> profile. After that passing this object to the sink. I can post the code if 
>>> its needed. 
>>> 
>>> In case of fold I was just adding profile mutation to the list and in case 
>>> of window function iterating over all of it and returning this batched 
>>> entity in one go.
>>> 
>>> I’m wondering if this is expected to have 20 times slower performance just 
>>> by using fold function. I would like to know what is so costly about this, 
>>> as intuitively I would expect fold function being a better choice here 
>>> since I assume that window function is using more memory for buffering.
>>> 
>>> Also my colleagues when they were doing PoC on flink evaluation they were 
>>> seeing very similar results to what I am seeing now. But they were still 
>>> using fold function. This was on flink version 1.0.3 and now I am using 
>>> 1.2.0. So perhaps there is some regression?
>>> 
>>> Please let me know what you think.
>>> 
>>> Cheers,
>>> Kamil.
>

Re: 20 times higher throughput with Window function vs fold function, intended?

Reply via email to