Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

Khalid Mammadov Mon, 14 Jul 2025 04:53:05 -0700

1. I think false alarm in this context means you are ok to loose data like
in Dev and Test envs.
2. Not sure
3. Sorry not sure again but guess would be during your failover checkpoint
got out of sync


Sorry, that is all I used this feature for. If you think you can smoothly
fail over to other cluster and back without data loss then I think you
don't even need this option.

On Sun, 13 Jul 2025, 14:58 Nimrod Ofek, <[email protected]> wrote:

> Thanks Khalid,
>
> Some follow ups:
>
>    1. I'm still unsure what will be "false alarms"
>    2. When there is data loss on some partitions - will that lead to all
>    partitions to get reset?
>    3. I had an occurrence - that I set failOnDataloss to false, I set
>    policy to earliest (which was about 24 hours) - and the Kafka cluster was
>    replaced (Mirror maker mirrored the topic and offsets to a new cluster, the
>    old cluster was shut down and the new one got the DNS of the old one) - and
>    I had no duplicates and just a minimal data loss of maybe a single batch -
>    although I did see in the logs the messages of failOnDataloss - I'm trying
>    to understand how that happened...
>
>
> Thanks!
> Nimrod
>
>
>
> בתאריך יום ה׳, 10 ביולי 2025, 20:50, מאת Khalid Mammadov ‏<
> [email protected]>:
>
>> I use this option in development environments where jobs are not actively
>> running and Kafka topic has retention policy on. Meaning when a streaming
>> job runs it may find that the last offset it read is not there anymore and
>> in this case it falls back to starting position (i.e. earliest or latest)
>> specified. I also always set starting point for streams (even if checkpoint
>> already exists in which case earliest/latest gets ignored) so when job runs
>> it either start from starting position specified (i.e. earliest or latest)
>> or from checkpoint location.
>>
>> Defaults are (From docs:
>> https://spark.apache.org/docs/latest/streaming/structured-streaming-kafka-integration.html
>> ):
>>
>> "latest" for streaming, "earliest" for batch
>>
>>
>> On Thu, 10 Jul 2025, 11:04 Nimrod Ofek, <[email protected]> wrote:
>>
>>> Hi everyone,
>>>
>>> I'm currently working with Spark Structured Streaming integrated with
>>> Kafka and had some questions regarding the failOnDataLoss option.
>>>
>>> The current documentation states:
>>>
>>> *"Whether to fail the query when it's possible that data is lost (e.g.,
>>> topics are deleted, or offsets are out of range). This may be a false
>>> alarm. You can disable it when it doesn't work as you expected."*ChatGPT
>>> has some explanation - but I would like to get a more detailed and certain
>>> answer, and I think that the documentation should have that explanation as
>>> well.
>>>
>>> I’d appreciate some clarification on the following points:
>>>
>>>    1.
>>>
>>>    What exactly does “this may be a false alarm” mean in this context?
>>>    Under what circumstances would that occur? What should I expect when that
>>>    happens?
>>>    2.
>>>
>>>    What does it mean to “fail the query”? Does this imply that the
>>>    process will skip the problematic offset and continue, or does it stop
>>>    entirely? How will the next offset get determined? What will happen upon
>>>    restart?
>>>    3.
>>>
>>>    If the offset is out of range, how does Spark determine the next
>>>    offset to use? Would it default to latest, earliest, or something
>>>    else?
>>>
>>> Understanding the expected behavior here would really help us configure
>>> this option appropriately for our use case.
>>>
>>> Thanks in advance for your help!
>>>
>>> Best regards,
>>> Nimrod
>>>
>>

Re: Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

Reply via email to