Thanks Khalid, Some follow ups:
1. I'm still unsure what will be "false alarms" 2. When there is data loss on some partitions - will that lead to all partitions to get reset? 3. I had an occurrence - that I set failOnDataloss to false, I set policy to earliest (which was about 24 hours) - and the Kafka cluster was replaced (Mirror maker mirrored the topic and offsets to a new cluster, the old cluster was shut down and the new one got the DNS of the old one) - and I had no duplicates and just a minimal data loss of maybe a single batch - although I did see in the logs the messages of failOnDataloss - I'm trying to understand how that happened... Thanks! Nimrod בתאריך יום ה׳, 10 ביולי 2025, 20:50, מאת Khalid Mammadov < khalidmammad...@gmail.com>: > I use this option in development environments where jobs are not actively > running and Kafka topic has retention policy on. Meaning when a streaming > job runs it may find that the last offset it read is not there anymore and > in this case it falls back to starting position (i.e. earliest or latest) > specified. I also always set starting point for streams (even if checkpoint > already exists in which case earliest/latest gets ignored) so when job runs > it either start from starting position specified (i.e. earliest or latest) > or from checkpoint location. > > Defaults are (From docs: > https://spark.apache.org/docs/latest/streaming/structured-streaming-kafka-integration.html > ): > > "latest" for streaming, "earliest" for batch > > > On Thu, 10 Jul 2025, 11:04 Nimrod Ofek, <ofek.nim...@gmail.com> wrote: > >> Hi everyone, >> >> I'm currently working with Spark Structured Streaming integrated with >> Kafka and had some questions regarding the failOnDataLoss option. >> >> The current documentation states: >> >> *"Whether to fail the query when it's possible that data is lost (e.g., >> topics are deleted, or offsets are out of range). This may be a false >> alarm. You can disable it when it doesn't work as you expected."*ChatGPT >> has some explanation - but I would like to get a more detailed and certain >> answer, and I think that the documentation should have that explanation as >> well. >> >> I’d appreciate some clarification on the following points: >> >> 1. >> >> What exactly does “this may be a false alarm” mean in this context? >> Under what circumstances would that occur? What should I expect when that >> happens? >> 2. >> >> What does it mean to “fail the query”? Does this imply that the >> process will skip the problematic offset and continue, or does it stop >> entirely? How will the next offset get determined? What will happen upon >> restart? >> 3. >> >> If the offset is out of range, how does Spark determine the next >> offset to use? Would it default to latest, earliest, or something >> else? >> >> Understanding the expected behavior here would really help us configure >> this option appropriately for our use case. >> >> Thanks in advance for your help! >> >> Best regards, >> Nimrod >> >