As Spark uses micro-batch for streaming, it's unavoidable to adjust the batch size properly to achieve your expectation of throughput vs latency. Especially, Spark uses global watermark which doesn't propagate (change) during micro-batch, you'd want to make the batch relatively small to make watermark move forward faster.
On Wed, Jul 1, 2020 at 2:54 AM Eric Beabes <[email protected]> wrote: > While running my Spark (Stateful) Structured Streaming job I am setting > 'maxOffsetsPerTrigger' value to 10 Million. I've noticed that messages are > processed faster if I use a large value for this property. > > What I am also noticing is that until the batch is completely processed, > no messages are getting written to the output Kafka topic. The 'State > timeout' is set to 10 minutes so I am expecting to see at least some of the > messages after 10 minutes or so BUT messages are not getting written until > processing of the next batch is started. > > Is there any property I can use to kinda 'flush' the messages that are > ready to be written? Please let me know. Thanks. > >
