Re: Kafka Connector: producer throttling

2025-04-17 Thread daniel williams
@Abhishek Singla kafka.* options are for streaming applications, not batch. This is clearly documented on the structured streaming pages with the kafka integration. If you are transmitting data in a batch application it's best to do so via a foreachPartition operation as discussed leveraging broad

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread Abhishek Singla
@Ángel Álvarez Pascua Thanks, however I am thinking of some other solution which does not involve saving the dataframe result. Will update this thread with details soon. @daniel williams Thanks, I will surely check spark-testing-base out. Regards, Abhishek Singla On Thu, Apr 17, 2025 at 11:59 

Re: Kafka Connector: producer throttling

2025-04-17 Thread Rommel Yuan
I noticed this problem like a year ago, I just didn't pursue further due to other issues. The solution is to use broadcast Kafka config and make producer in each partition. But if naturally the config is honored, that would be great. On Thu, Apr 17, 2025, 15:27 Abhishek Singla wrote: > @daniel

Re: Kafka Connector: producer throttling

2025-04-17 Thread Abhishek Singla
@daniel williams It's batch and not streaming. I still don't understand why "kafka.linger.ms" and "kafka.batch.size" are not being honored by the kafka connector. @rommelhol...@gmail.com How did you fix it? @Jungtaek Lim Could you help out here in case I am missing something? Regards, Abhishe

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread daniel williams
I have not. Most of my work and development on Spark has been on the scala side of the house and I've built a suite of tools for Kafka integration with Spark for stream analytics along with spark-testing-base On Thu, Apr 17, 2025 at 12:13 PM Ángel Ál

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread Ángel Álvarez Pascua
Have you used the new equality functions introduced in Spark 3.5? https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.testing.assertDataFrameEqual.html El jue, 17 abr 2025, 13:18, daniel williams escribió: > Good call out. Yeah, once you take your work out of Spark it’s all on

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread daniel williams
Good call out. Yeah, once you take your work out of Spark it’s all on you. Any level partitions operations (e.g. map, flat map, foreach) ends up as a lambda in catalyst. I’ve found, however, not using explode and doing things procedurally at this point with a sufficient amount of unit testing helps