Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

Nimrod Ofek Sun, 13 Jul 2025 06:58:56 -0700

Thanks Khalid,

Some follow ups:


   1. I'm still unsure what will be "false alarms"
   2. When there is data loss on some partitions - will that lead to all
   partitions to get reset?
   3. I had an occurrence - that I set failOnDataloss to false, I set
   policy to earliest (which was about 24 hours) - and the Kafka cluster was
   replaced (Mirror maker mirrored the topic and offsets to a new cluster, the
   old cluster was shut down and the new one got the DNS of the old one) - and
   I had no duplicates and just a minimal data loss of maybe a single batch -
   although I did see in the logs the messages of failOnDataloss - I'm trying
   to understand how that happened...


Thanks!
Nimrod



בתאריך יום ה׳, 10 ביולי 2025, 20:50, מאת Khalid Mammadov ‏<
khalidmammad...@gmail.com>:

> I use this option in development environments where jobs are not actively
> running and Kafka topic has retention policy on. Meaning when a streaming
> job runs it may find that the last offset it read is not there anymore and
> in this case it falls back to starting position (i.e. earliest or latest)
> specified. I also always set starting point for streams (even if checkpoint
> already exists in which case earliest/latest gets ignored) so when job runs
> it either start from starting position specified (i.e. earliest or latest)
> or from checkpoint location.
>
> Defaults are (From docs:
> https://spark.apache.org/docs/latest/streaming/structured-streaming-kafka-integration.html
> ):
>
> "latest" for streaming, "earliest" for batch
>
>
> On Thu, 10 Jul 2025, 11:04 Nimrod Ofek, <ofek.nim...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I'm currently working with Spark Structured Streaming integrated with
>> Kafka and had some questions regarding the failOnDataLoss option.
>>
>> The current documentation states:
>>
>> *"Whether to fail the query when it's possible that data is lost (e.g.,
>> topics are deleted, or offsets are out of range). This may be a false
>> alarm. You can disable it when it doesn't work as you expected."*ChatGPT
>> has some explanation - but I would like to get a more detailed and certain
>> answer, and I think that the documentation should have that explanation as
>> well.
>>
>> I’d appreciate some clarification on the following points:
>>
>>    1.
>>
>>    What exactly does “this may be a false alarm” mean in this context?
>>    Under what circumstances would that occur? What should I expect when that
>>    happens?
>>    2.
>>
>>    What does it mean to “fail the query”? Does this imply that the
>>    process will skip the problematic offset and continue, or does it stop
>>    entirely? How will the next offset get determined? What will happen upon
>>    restart?
>>    3.
>>
>>    If the offset is out of range, how does Spark determine the next
>>    offset to use? Would it default to latest, earliest, or something
>>    else?
>>
>> Understanding the expected behavior here would really help us configure
>> this option appropriately for our use case.
>>
>> Thanks in advance for your help!
>>
>> Best regards,
>> Nimrod
>>
>

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

Reply via email to