Re: Flink job performance

Zhanghao Chen Mon, 15 Apr 2024 05:11:27 -0700

The exception basically says the remote TM is unreachable, probably terminated 
due to some other reasons. This may not be the root cause. Is there any other 
exceptions in the log? Also, since the overall resource usage is almost full, 
could you try allocating more CPUs and see if the instability persists?

Best,
Zhanghao Chen
________________________________
From: Oscar Perez <[email protected]>
Sent: Monday, April 15, 2024 19:24
To: Zhanghao Chen <[email protected]>
Cc: Oscar Perez via user <[email protected]>
Subject: Re: Flink job performance

Hei, ok that is weird. Let me resend them.

Regards,
Oscar

On Mon, 15 Apr 2024 at 14:00, Zhanghao Chen 
<[email protected]<mailto:[email protected]>> wrote:
Hi, there seems to be sth wrong with the two images attached in the latest 
email. I cannot open them.

Best,
Zhanghao Chen
________________________________
From: Oscar Perez via user <[email protected]<mailto:[email protected]>>
Sent: Monday, April 15, 2024 15:57
To: Oscar Perez via user <[email protected]<mailto:[email protected]>>; 
pi-team <[email protected]<mailto:[email protected]>>; Hermes Team 
<[email protected]<mailto:[email protected]>>
Subject: Flink job performance

Hi community!

We have an interesting problem with Flink after increasing parallelism in a 
certain way. Here is the summary:

1)  We identified that our job bottleneck were some Co-keyed process operators 
that were affecting on previous operators causing backpressure.
2( What we did was to increase the parallelism to all the operators from 6 to 
12 but keeping 6 these operators that read from kafka. The main reason was that 
all our topics have 6 partitions so increasing the parallelism will not yield 
better performance

See attached job layout prior and after the changes:
What happens was that some operations that were chained in the same operator 
like reading - filter - map - filter now are rebalanced and the overall 
performance of the job is suffering (keeps throwing exceptions now and then)

Is the rebalance operation going over the network or this happens in the same 
node? How can we effectively improve performance of this job with the given 
resources?

Thanks for the input!
Regards,
Oscar

Re: Flink job performance

Reply via email to