How to Rate limit number of messages while writing Spark DataFrame to kafka topic

Divyanshu kumar Thu, 07 Nov 2024 02:53:47 -0800

0
<https://stackoverflow.com/posts/79138442/timeline>


I have a spark dataframe which I am writing to Kafka topic using the below
script.

val ProducerConfig: Map[String, String] = Map(
    s"kafka.${ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG}" ->
Integer.MAX_VALUE.toString,
    s"kafka.${ProducerConfig.RETRIES_CONFIG}" -> Integer.MAX_VALUE.toString,
    s"kafka.${ProducerConfig.MAX_BLOCK_MS_CONFIG}" -> Long.MaxValue.toString,
    s"kafka.${ProducerConfig.ACKS_CONFIG}" -> "all",
    s"kafka.${ProducerConfig.MAX_REQUEST_SIZE_CONFIG}" ->
(100*1024*1024).toString // 100 mb
  )
df.write.format("kafka")
    .option("kafka.bootstrap.servers", broker)
    .options(ProducerConfig)
    .option("topic", topic)
    .save()

My DataFrame is static and has 500k records. The above script is dumping
all the records to the topic at once. I want a rate limiter where in 10
seconds only 20k messages should be written to the topic. So each batch
ingestion should wait at least 10 seconds after writing messages to the
topic.

One approach is to divide the DataFrame in chunks and dump it in batches
with a sleep time of 10 seconds. I need an alternative approach which
should be config driven and i do not have to slice my data. Is there a way
we can ingest the data in batches with some waiting time?

How to Rate limit number of messages while writing Spark DataFrame to kafka topic

Reply via email to