Re: Diagnosing bottlenecks in Flink jobs

JING ZHANG Wed, 16 Jun 2021 23:30:57 -0700

Hi Dan,
It's better to split the Kafka partition into multiple partitions.
Here is a way to try without splitting the Kafka partition. Add a rebalance
shuffle between source and the downstream operators, set multiple
parallelism for the downstream operators. But this way would introduce
extra cpu cost for serialize/deserialize and extra network cost for shuffle
data. I'm not sure the benefits of this method can offset the additional
costs.


Best,
JING ZHANG

Dan Hill <quietgol...@gmail.com> 于2021年6月17日周四 下午1:49写道：

> Thanks, JING ZHANG!
>
> I have one subtask for one Kafka source that is getting backpressure.  Is
> there an easy way to split a single Kafka partition into multiple
> subtasks?  Or do I need to split the Kafka partition?
>
> On Wed, Jun 16, 2021 at 10:29 PM JING ZHANG <beyond1...@gmail.com> wrote:
>
>> Hi Dan,
>> Would you please describe what's the problem about your job? High latency
>> or low throughput?
>> Please first check the job throughput and latency .
>> If the job throughput matches the speed of sources producing data and the
>> latency metric is good, maybe the job works well without bottlenecks.
>> If you find unnormal throughput or latency, please try the following
>> points:
>> 1. check the back pressure
>> 2. check whether checkpoint duration is long and whether the checkpoint
>> size is expected
>>
>> Please share the details for deeper analysis in this email if you find
>> something abnormal about  the job.
>>
>> Best,
>> JING ZHANG
>>
>> Dan Hill <quietgol...@gmail.com> 于2021年6月17日周四 下午12:44写道：
>>
>>> We have a job that has been running but none of the AWS resource metrics
>>> for the EKS, EC2, MSK and EBS show any bottlenecks.  I have multiple 8
>>> cores allocated but only ~2 cores are used.  Most of the memory is not
>>> consumed.  MSK does not show much use.  EBS metrics look mostly idle.  I
>>> assumed I'd be able to see whichever resources is a bottleneck.
>>>
>>> Is there a good way to diagnose where the bottleneck is for a Flink job?
>>>
>>

Re: Diagnosing bottlenecks in Flink jobs

Reply via email to