Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

Nimrod Ofek Thu, 10 Jul 2025 03:04:41 -0700

Hi everyone,

I'm currently working with Spark Structured Streaming integrated with Kafka
and had some questions regarding the failOnDataLoss option.


The current documentation states:

*"Whether to fail the query when it's possible that data is lost (e.g.,
topics are deleted, or offsets are out of range). This may be a false
alarm. You can disable it when it doesn't work as you expected."*ChatGPT
has some explanation - but I would like to get a more detailed and certain
answer, and I think that the documentation should have that explanation as
well.

I’d appreciate some clarification on the following points:

   1.

   What exactly does “this may be a false alarm” mean in this context?
   Under what circumstances would that occur? What should I expect when that
   happens?
   2.

   What does it mean to “fail the query”? Does this imply that the process
   will skip the problematic offset and continue, or does it stop entirely?
   How will the next offset get determined? What will happen upon restart?
   3.

   If the offset is out of range, how does Spark determine the next offset
   to use? Would it default to latest, earliest, or something else?

Understanding the expected behavior here would really help us configure
this option appropriately for our use case.

Thanks in advance for your help!

Best regards,
Nimrod

Clarification on failOnDataLoss Behavior in Spark Structured Streaming with Kafka

Reply via email to