Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

Khalid Mammadov Thu, 10 Jul 2025 10:51:36 -0700

I use this option in development environments where jobs are not actively
running and Kafka topic has retention policy on. Meaning when a streaming
job runs it may find that the last offset it read is not there anymore and
in this case it falls back to starting position (i.e. earliest or latest)
specified. I also always set starting point for streams (even if checkpoint
already exists in which case earliest/latest gets ignored) so when job runs
it either start from starting position specified (i.e. earliest or latest)
or from checkpoint location.


Defaults are (From docs:
https://spark.apache.org/docs/latest/streaming/structured-streaming-kafka-integration.html
):

"latest" for streaming, "earliest" for batch


On Thu, 10 Jul 2025, 11:04 Nimrod Ofek, <ofek.nim...@gmail.com> wrote:

> Hi everyone,
>
> I'm currently working with Spark Structured Streaming integrated with
> Kafka and had some questions regarding the failOnDataLoss option.
>
> The current documentation states:
>
> *"Whether to fail the query when it's possible that data is lost (e.g.,
> topics are deleted, or offsets are out of range). This may be a false
> alarm. You can disable it when it doesn't work as you expected."*ChatGPT
> has some explanation - but I would like to get a more detailed and certain
> answer, and I think that the documentation should have that explanation as
> well.
>
> I’d appreciate some clarification on the following points:
>
>    1.
>
>    What exactly does “this may be a false alarm” mean in this context?
>    Under what circumstances would that occur? What should I expect when that
>    happens?
>    2.
>
>    What does it mean to “fail the query”? Does this imply that the
>    process will skip the problematic offset and continue, or does it stop
>    entirely? How will the next offset get determined? What will happen upon
>    restart?
>    3.
>
>    If the offset is out of range, how does Spark determine the next
>    offset to use? Would it default to latest, earliest, or something else?
>
> Understanding the expected behavior here would really help us configure
> this option appropriately for our use case.
>
> Thanks in advance for your help!
>
> Best regards,
> Nimrod
>

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

Reply via email to