Hi Jerry- This is exactly the type of help we're seeking, to confirm the FilestreamSink was not utilized on our test runs.
Our team is going to work towards implementing this and re-running our experiments across the versions. If everything comes back with similar results, we will reach back out to share more artifacts with this thread. Thank you Jerry. From: Jerry Peng <jerry.boyang.p...@gmail.com> Date: Thursday, September 14, 2023 at 1:10 PM To: Craig Alfieri <craig.alfi...@antithesis.com> Cc: user@spark.apache.org <user@spark.apache.org> Subject: Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2 Hi Craig, Thank you for bringing this to the community's attention! Do you have any example code you can share that we can use to reproduce this issue? By the way, how did you determine duplicates in the output? The FileStreamSink provides exactly-once writes ONLY if you read the output with the FileStreamSource or the FileSource (batch). A log is used to determine what data is committed or not and those aforementioned sources know how to use that log to read the data "exactly-once". Best, Jerry On Thu, Sep 14, 2023 at 9:48 AM Craig Alfieri <craig.alfi...@antithesis.com<mailto:craig.alfi...@antithesis.com>> wrote: Hello Spark Community- As part of a research effort, our team here at Antithesis tests for correctness/fault tolerance of major OSS projects. Our team recently was testing Spark’s Structured Streaming, and we came across a data duplication bug we’d like to work with the teams on to resolve. Our intention is to utilize this as a future case study for our platform, but prior to doing so we like to have a resolution in place so that an announcement isn’t alarming to the user base. Attached is a high level .pdf that reviews the High Availability set-up put under test. This was also tested across the three latest versions, and the same behavior was observed. We can reproduce this error readily, since our environment is fully deterministic, we are just not Spark experts and would like to work with someone in the community to resolve this. Please let us know at your earliest convenience. Best [signature_2327449931] Craig Alfieri c: 917.841.1652 craig.alfi...@antithesis.com<mailto:craig.alfi...@antithesis.com> New York, NY. Antithesis.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.antithesis.com_&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=1FbSpGgVIpZO4QkQDmXk7jc1BFVciZWVioOvdJ86ubY&m=5SVjNvtYuy6icWSaP0lwjzTQw1Cc7JQO9QVaxn5KxqTdH8HC1HHURutlp5rgiaMH&s=SRmgBE5ImnGZ-GuqL3X6Q_6NPYiay1gLRbcUUofPIHo&e=> We can't talk about most of the bugs that we've found for our customers, but some customers like to speak about their work with us: https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis ----------------------------- This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity for whom they are addressed. If you received this message in error, please notify the sender and remove it from your system. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> -- *-----------------------------* *This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity for whom they are addressed. If you received this message in error, please notify the sender and remove it from your system.*