Re: Flink Kafka more consumers than partitions

Sameer Wadkar Wed, 03 Aug 2016 02:38:36 -0700

What is the parallelism of the sink or the operator which writes to the sinks 
in the first case. HBase puts are constrained by the following:
1. How your regions are distributed. Are you pre-splitting your regions for the 
table. Do you know the number of regions your Hbase tables are split into. 
2. Are all the sinks writing to all the regions. Meaning are you getting 
records in the sink operator which could potentially go to any region. This can 
become a big bottleneck if you have 40 sinks talking to all regions. I 
pre-split my regions based on key salting and use custom partitioning to ensure 
each sink operator writes to only a few regions and my performance shot up by 
several orders. 
3. You are probably doing this but adding puts in batches helps in general but 
having each batch contain puts for too many regions hurts.


If the source parallelism is the same as the parallelism of other operators the 
40 sinks communicating to all regions might be a problem. When you go down to 
20 sinks you actually might be getting better performance due to lesser 
resource contention on HBase. 

Sent from my iPhone

> On Aug 3, 2016, at 4:14 AM, neo21 zerro <neo21_ze...@yahoo.com> wrote:
> 
> Hello everybody, 
> 
> I'm using Flink Kafka consumer 0.8.x with kafka 0.8.2 and flink 1.0.3 on YARN.
> In kafka I have a topic which have 20 partitions and my flink topology reads 
> from kafka (source) and writes to hbase (sink).
> 
> when: 
>     1. flink source has parallelism set to 40 (20 of the tasks are idle) I 
> see 10.000 requests/sec on hbase
>     2. flink source has parallelism set to 20 (exact number of partitions) I 
> see 100.0000 requests/sec on hbase (so a 10x improvement)
> 
> 
> It's clear that hbase is not the limiting factor in my topology. 
> Assumption: Flink backpressure mechanism kicks in in the 1. case more 
> aggressively and it's limiting the ingestion of tuples in the topology. 
> 
> The question: In the first case, why are those 20 sources which are sitting 
> idle contributing so much to the backpressure? 
> 
> 
> Thanks guys!

Re: Flink Kafka more consumers than partitions

Reply via email to