Re: Flink Kafka more consumers than partitions

Stephan Ewen Wed, 03 Aug 2016 03:08:02 -0700

Hi!

Are you running on ProcessingTime or on EventTime?


Thanks,
Stephan


On Wed, Aug 3, 2016 at 11:57 AM, neo21 zerro <neo21_ze...@yahoo.com> wrote:

> Hi guys,
>
> Thanks for getting back to me.
>
> So to clarify:
>     Topology wise flink kafka source (does avro deserialization and small
> map) -> window operator which does batching for 3 seconds -> hbase sink
>
> Experiments:
>
> 1. flink source: parallelism 40 (20 idle tasks) -> window operator:
> parallelism 160 -> hbase sink: parallelism 160
>     - roughly 10.000 requests/sec on hbase
> 2. flink source: parallelism 20 -> window operator: parallelism 160 ->
> hbase sink: parallelism 160
>     - roughly 100.000 requests/sec on hbase (10x more)
>
> @Stephan as described below the parallelism of the sink was kept the same.
> I agree with you that there is nothing to backpressue on the source ;)
> However, my understanding right now is that only backpressure can be the
> explanation for this situation. Since we just change the source
> parallelism, other things like hbase parallelism  are kept the same.
>
> @Sameer all of those things are valid points. We make sure that we reduce
> the row locking by partitioning the data on the hbase sinks. We are just
> after why this limitation is happening. And since the same setup is used
> but just the source parallelism is changed I don't expect this to be a
> hbase issue.
>
> Thanks guys!
>
>
>
> On Wednesday, August 3, 2016 11:38 AM, Sameer Wadkar <sam...@axiomine.com>
> wrote:
> What is the parallelism of the sink or the operator which writes to the
> sinks in the first case. HBase puts are constrained by the following:
> 1. How your regions are distributed. Are you pre-splitting your regions
> for the table. Do you know the number of regions your Hbase tables are
> split into.
> 2. Are all the sinks writing to all the regions. Meaning are you getting
> records in the sink operator which could potentially go to any region. This
> can become a big bottleneck if you have 40 sinks talking to all regions. I
> pre-split my regions based on key salting and use custom partitioning to
> ensure each sink operator writes to only a few regions and my performance
> shot up by several orders.
> 3. You are probably doing this but adding puts in batches helps in general
> but having each batch contain puts for too many regions hurts.
>
> If the source parallelism is the same as the parallelism of other
> operators the 40 sinks communicating to all regions might be a problem.
> When you go down to 20 sinks you actually might be getting better
> performance due to lesser resource contention on HBase.
>
> Sent from my iPhone
>
>
> > On Aug 3, 2016, at 4:14 AM, neo21 zerro <neo21_ze...@yahoo.com> wrote:
> >
> > Hello everybody,
> >
> > I'm using Flink Kafka consumer 0.8.x with kafka 0.8.2 and flink 1.0.3 on
> YARN.
> > In kafka I have a topic which have 20 partitions and my flink topology
> reads from kafka (source) and writes to hbase (sink).
> >
> > when:
> >     1. flink source has parallelism set to 40 (20 of the tasks are idle)
> I see 10.000 requests/sec on hbase
> >     2. flink source has parallelism set to 20 (exact number of
> partitions) I see 100.0000 requests/sec on hbase (so a 10x improvement)
> >
> >
> > It's clear that hbase is not the limiting factor in my topology.
> > Assumption: Flink backpressure mechanism kicks in in the 1. case more
> aggressively and it's limiting the ingestion of tuples in the topology.
> >
> > The question: In the first case, why are those 20 sources which are
> sitting idle contributing so much to the backpressure?
> >
> >
> > Thanks guys!
>

Re: Flink Kafka more consumers than partitions

Reply via email to