Re: Resolving Flink Backpressure

Samrat Deb Mon, 30 Dec 2024 20:46:37 -0800

Hi Mohammad,

please share the logs in text format .


On Mon, Dec 30, 2024 at 1:06 PM Mohammad Aamir Iqubal <aamir.y...@gmail.com>
wrote:

> Hi Samrat,
>
> I think the issue is the amount of data collected while check pointing
> with 29 KB message the application is tuned properly. But with 102 KB we
> are not able to tune. PFA screenshot from logs.
>
> Regards,
> Md Aamir Iqubal
> +918409561989
>
> On Fri, 27 Dec 2024 at 9:01 AM, Samrat Deb <decordea...@gmail.com> wrote:
>
>> Hi Md Aamir,
>>
>> Apologies for the delayed response due to the festive season. Thank you
>> for
>> providing the additional details.
>>
>> The information shared is helpful but still sparse to pinpoint the root
>> cause of the backpressure issue. To proceed further, it would be great if
>> you could share the logs from the job (ensuring that no sensitive
>> information is included). Logs can often reveal valuable insights into
>> bottlenecks or errors occurring during execution.Additionally, it would be
>> helpful to know more about the type of state maintained in the
>> ProcessFunction. Is it keyed state, or are you using other forms of state
>> management? Also, is the ProcessFunction parallelizable, or does it
>> involve
>> operations that could inherently limit parallelism?
>>
>> Here are some general guidelines for debugging ProcessFunction in such
>> scenarios:
>>
>>    1.
>>
>>    Use the Flink Web UI to inspect the ProcessFunction operator's subtask
>>    performance. Check for uneven load distribution across subtasks, which
>> can
>>    indicate issues like data skew.
>>    2.
>>
>>    If your function uses state, verify that the state usage is optimized.
>>    Large or inefficient state can increase checkpointing time and memory
>>    pressure.
>>    3.
>>
>>    Track the latency and throughput of the process operator. Sudden spikes
>>    or sustained high latency often correlate with bottlenecks.
>>    4.
>>
>>    Review the logic inside the ProcessFunction. Look for computationally
>>    expensive operations, excessive object creation, or blocking calls that
>>    could limit processing speed.
>>    5.
>>
>>    Consider increasing fetch.size and batch.size in your Kafka consumer,
>>    and tweak the producer’s batch configurations like linger.ms or
>>    compression.type. These can reduce the frequency of network I/O and
>>    improve throughput.
>>    6.
>>
>>    Re-evaluate the parallelism of the process operator. It may need to be
>>    adjusted independently of the source and sink.
>>
>> If possible, also include metrics or screenshots from the Flink Web UI
>> showing operator backpressure or task-level statistics.
>>
>> Bests,
>> Samrat
>> On Tue, Dec 24, 2024 at 11:13 AM Mohammad Aamir Iqubal <
>> aamir.y...@gmail.com>
>> wrote:
>>
>> > Hi Samrat,
>> >
>> > PFB details:
>> >
>> > 1)1.17 version 2)We are just modifying the json field so the
>> modification
>> > in existing json field we are managing if we talk about
>> state3)Compression
>> > Type snappy used, default fetch and batch size are used 4)process
>> operator
>> > is the bottleneck hence source is getting back pressured 5) yes 6)NO
>> 7)Yes
>> >
>> > Please let me know if you need any more details.
>> >
>> > Regards,
>> > Md Aamir Iqubal
>> > +918409561989
>> >
>> > On Thu, 19 Dec 2024 at 11:56 AM, Samrat Deb <decordea...@gmail.com>
>> wrote:
>> >
>> > > Hi Md Aamir,
>> > >
>> > >
>> > > Thank you for providing the details of your streaming application
>> setup.
>> > >
>> > > To assist you better in identifying and resolving the backpressure
>> > issue, I
>> > > have a few follow-up questions:
>> > >
>> > >
>> > >
>> > >    1. Which version of Apache Flink are you using?
>> > >    2. Is the ProcessFunction stateful? If yes, what kind of state is
>> > >    managed?
>> > >    3. What are the configurations for the Kafka producer and consumer
>> > >    (e.g., fetch.size, batch.size, linger.ms, acks, compression.type
>> > etc)?
>> > >    4. Have you examined the backpressure monitoring graphs in the
>> Flink
>> > Web
>> > >    UI? Which operator (source, process, or sink) is experiencing the
>> > >    bottleneck?
>> > >    5. Have you tried explicitly increasing the parallelism of the sink
>> > >    operator to match or exceed the source parallelism?
>> > >    6. Although I’m not deeply familiar with Kerberos integration, are
>> > there
>> > >    any errors or warnings in the logs related to Kerberos
>> authentication?
>> > >    7. Have you verified if Kerberos ticket renewal contributes to any
>> > >    performance overhead for the sink operator?
>> > >
>> > >
>> > > Bests,
>> > > Samrat
>> > >
>> > > On Wed, Dec 18, 2024 at 3:25 PM Mohammad Aamir Iqubal <
>> > > aamir.y...@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi Team,
>> > > >
>> > > > I am running a streaming application in performance environment. The
>> > > source
>> > > > is Kafka and sink is also Kafka but the sink Kafka is secured by
>> > > kerberos.
>> > > > Message size is 102kb and source parallelism is 16
>> > > > process parallelism is 80 and sink parallelism is 12.
>> > > > I am using process function to remove few fields from json and mask
>> > some
>> > > > fields. When we send the data with around 100 TPS the application
>> goes
>> > > into
>> > > > backpressure. Please let me know how to fix the issue.
>> > > >
>> > > >
>> > > > Regards,
>> > > > Md Aamir Iqubal
>> > > > +918409561989
>> > > >
>> > >
>> >
>>
>

Re: Resolving Flink Backpressure

Reply via email to