Spark 2.2 structured streaming with mapGroupsWithState + window functions

2017-08-28 Thread daniel williams
Hi all, I've been looking heavily into Spark 2.2 to solve a problem I have by specifically using mapGroupsWithState. What I've discovered is that a *groupBy(window(..))* does not work when being used with a subsequent *mapGroupsWithState* and produces an AnalysisException of : *"mapGroupsWithSta

flatMapGroupsWithState not timing out (spark 2.2.1)

2018-01-12 Thread daniel williams
Hi, I’m attempting to leverage flatMapGroupsWithState to handle some arbitrary aggregations and am noticing a couple of things: - *ProcessingTimeTimeout* + *setTimeoutDuration* timeout not being honored - *EventTimeTimeout* + watermark value not being honored. - *EventTimeTimeout* + *

Re: Kafka Connector: producer throttling

2025-03-26 Thread daniel williams
If you're using structured streaming you can pass in options as kafka. into options as documented. If you're using Spark in batch form you'll want to do a foreach on a KafkaProducer via a Broadcast. All KafkaProducer specific options

Re: Kafka Connector: producer throttling

2025-04-16 Thread daniel williams
with kafka., now I can see they >> are being set in Producer Config on Spark Startup but still they are not >> being honored. I have set "linger.ms": "1000" and "batch.size": "10". I >> am publishing 10 records and they are flushed to k

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread daniel williams
given the producers built in backoff. This would allow for you to retry n times and then upon final (hopefully not) failure update your dataset for further processing. -dan On Wed, Apr 16, 2025 at 5:04 PM Abhishek Singla wrote: > @daniel williams > > > Operate your producer in transa

Re: Kafka Connector: producer throttling

2025-04-16 Thread daniel williams
"bootstrap.servers": "localhost:9092", >"acks": "all", >"linger.ms": "1000", >"batch.size": "10", >"key.serializer": > "org.apache.kafka.common.serialization.ByteArraySerializer", >&qu

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread daniel williams
13 PM Ángel Álvarez Pascua < angel.alvarez.pas...@gmail.com> wrote: > Have you used the new equality functions introduced in Spark 3.5? > > > https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.testing.assertDataFrameEqual.html > > El jue, 17 abr 2025,

Re: Kafka Connector: producer throttling

2025-04-17 Thread daniel williams
aging broadcast variables to control your KafkaProducers. On Thu, Apr 17, 2025 at 4:27 PM Abhishek Singla wrote: > @daniel williams > It's batch and not streaming. I still don't understand why " > kafka.linger.ms" and "kafka.batch.size" are not being hono

Re: Checkpointing in foreachPartition in Spark batck

2025-04-16 Thread daniel williams
Yes. Operate your producer in transactional mode. Checkpointing is an abstract concept only applicable to streaming. -dan On Wed, Apr 16, 2025 at 7:02 AM Abhishek Singla wrote: > Hi Team, > > We are using foreachPartition to send dataset row data to third system via > HTTP client. The operatio

Re: Kafka Connector: producer throttling

2025-04-16 Thread daniel williams
If you are building a broadcast to construct a producer with a set of options then the producer is merely going to operate how it’s going to be configured - it has nothing to do with spark save that the foreachPartition is constructing it via the broadcast. A strategy I’ve used in the past is to *

Re: Checkpointing in foreachPartition in Spark batck

2025-04-17 Thread daniel williams
en upon final (hopefully not) failure update your >> dataset for further processing. >> >> -dan >> >> >> On Wed, Apr 16, 2025 at 5:04 PM Abhishek Singla < >> abhisheksingla...@gmail.com> wrote: >> >>> @daniel williams >>>

Re: Kafka Connector: producer throttling

2025-02-24 Thread daniel williams
I think you should be using a foreachPartition and a broadcast to build your producer. From there you will have full control of all options and serialization needed via direct access to the KafkaProducer, as well as all options therein associated (e.g. callbacks, interceptors, etc). -dan On Mon,