Re: Resolving Flink Backpressure

Samrat Deb Thu, 26 Dec 2024 19:31:10 -0800

Hi Md Aamir,

Apologies for the delayed response due to the festive season. Thank you for
providing the additional details.

The information shared is helpful but still sparse to pinpoint the root
cause of the backpressure issue. To proceed further, it would be great if
you could share the logs from the job (ensuring that no sensitive
information is included). Logs can often reveal valuable insights into
bottlenecks or errors occurring during execution.Additionally, it would be
helpful to know more about the type of state maintained in the
ProcessFunction. Is it keyed state, or are you using other forms of state
management? Also, is the ProcessFunction parallelizable, or does it involve
operations that could inherently limit parallelism?

Here are some general guidelines for debugging ProcessFunction in such
scenarios:

   1.

   Use the Flink Web UI to inspect the ProcessFunction operator's subtask
   performance. Check for uneven load distribution across subtasks, which can
   indicate issues like data skew.
   2.

   If your function uses state, verify that the state usage is optimized.
   Large or inefficient state can increase checkpointing time and memory
   pressure.
   3.

   Track the latency and throughput of the process operator. Sudden spikes
   or sustained high latency often correlate with bottlenecks.
   4.

   Review the logic inside the ProcessFunction. Look for computationally
   expensive operations, excessive object creation, or blocking calls that
   could limit processing speed.
   5.

   Consider increasing fetch.size and batch.size in your Kafka consumer,
   and tweak the producer’s batch configurations like linger.ms or
   compression.type. These can reduce the frequency of network I/O and
   improve throughput.
   6.

   Re-evaluate the parallelism of the process operator. It may need to be
   adjusted independently of the source and sink.

If possible, also include metrics or screenshots from the Flink Web UI
showing operator backpressure or task-level statistics.

Bests,
Samrat
On Tue, Dec 24, 2024 at 11:13 AM Mohammad Aamir Iqubal <aamir.y...@gmail.com>
wrote:

> Hi Samrat,
>
> PFB details:
>
> 1)1.17 version 2)We are just modifying the json field so the modification
> in existing json field we are managing if we talk about state3)Compression
> Type snappy used, default fetch and batch size are used 4)process operator
> is the bottleneck hence source is getting back pressured 5) yes 6)NO 7)Yes
>
> Please let me know if you need any more details.
>
> Regards,
> Md Aamir Iqubal
> +918409561989
>
> On Thu, 19 Dec 2024 at 11:56 AM, Samrat Deb <decordea...@gmail.com> wrote:
>
> > Hi Md Aamir,
> >
> >
> > Thank you for providing the details of your streaming application setup.
> >
> > To assist you better in identifying and resolving the backpressure
> issue, I
> > have a few follow-up questions:
> >
> >
> >
> >    1. Which version of Apache Flink are you using?
> >    2. Is the ProcessFunction stateful? If yes, what kind of state is
> >    managed?
> >    3. What are the configurations for the Kafka producer and consumer
> >    (e.g., fetch.size, batch.size, linger.ms, acks, compression.type
> etc)?
> >    4. Have you examined the backpressure monitoring graphs in the Flink
> Web
> >    UI? Which operator (source, process, or sink) is experiencing the
> >    bottleneck?
> >    5. Have you tried explicitly increasing the parallelism of the sink
> >    operator to match or exceed the source parallelism?
> >    6. Although I’m not deeply familiar with Kerberos integration, are
> there
> >    any errors or warnings in the logs related to Kerberos authentication?
> >    7. Have you verified if Kerberos ticket renewal contributes to any
> >    performance overhead for the sink operator?
> >
> >
> > Bests,
> > Samrat
> >
> > On Wed, Dec 18, 2024 at 3:25 PM Mohammad Aamir Iqubal <
> > aamir.y...@gmail.com>
> > wrote:
> >
> > > Hi Team,
> > >
> > > I am running a streaming application in performance environment. The
> > source
> > > is Kafka and sink is also Kafka but the sink Kafka is secured by
> > kerberos.
> > > Message size is 102kb and source parallelism is 16
> > > process parallelism is 80 and sink parallelism is 12.
> > > I am using process function to remove few fields from json and mask
> some
> > > fields. When we send the data with around 100 TPS the application goes
> > into
> > > backpressure. Please let me know how to fix the issue.
> > >
> > >
> > > Regards,
> > > Md Aamir Iqubal
> > > +918409561989
> > >
> >
>

Re: Resolving Flink Backpressure

Reply via email to