Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

Craig Alfieri Thu, 14 Sep 2023 11:24:09 -0700

Hi Jerry- This is exactly the type of help we're seeking, to confirm the 
FilestreamSink was not utilized on our test runs.


Our team is going to work towards implementing this and re-running our 
experiments across the versions.

If everything comes back with similar results, we will reach back out to share 
more artifacts with this thread.

Thank you Jerry.


From: Jerry Peng <[email protected]>
Date: Thursday, September 14, 2023 at 1:10 PM
To: Craig Alfieri <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 
3.2.4, and 3.3.2
Hi Craig,

Thank you for bringing this to the community's attention! Do you have any 
example code you can share that we can use to reproduce this issue?  By the 
way, how did you determine duplicates in the output?  The FileStreamSink 
provides exactly-once writes ONLY if you read the output with the 
FileStreamSource or the FileSource (batch).  A log is used to determine what 
data is committed or not and those aforementioned sources know how to use that 
log to read the data "exactly-once".

Best,

Jerry

On Thu, Sep 14, 2023 at 9:48 AM Craig Alfieri 
<[email protected]<mailto:[email protected]>> wrote:
Hello Spark Community-

As part of a research effort, our team here at Antithesis tests for 
correctness/fault tolerance of major OSS projects.
Our team recently was testing Spark’s Structured Streaming, and we came across 
a data duplication bug we’d like to work with the teams on to resolve.

Our intention is to utilize this as a future case study for our platform, but 
prior to doing so we like to have a resolution in place so that an announcement 
isn’t alarming to the user base.

Attached is a high level .pdf that reviews the High Availability set-up put 
under test.
This was also tested across the three latest versions, and the same behavior 
was observed.

We can reproduce this error readily, since our environment is fully 
deterministic, we are just not Spark experts and would like to work with 
someone in the community to resolve this.

Please let us know at your earliest convenience.

Best

[signature_2327449931]
Craig Alfieri
c: 917.841.1652
[email protected]<mailto:[email protected]>
New York, NY.
Antithesis.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.antithesis.com_&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=1FbSpGgVIpZO4QkQDmXk7jc1BFVciZWVioOvdJ86ubY&m=5SVjNvtYuy6icWSaP0lwjzTQw1Cc7JQO9QVaxn5KxqTdH8HC1HHURutlp5rgiaMH&s=SRmgBE5ImnGZ-GuqL3X6Q_6NPYiay1gLRbcUUofPIHo&e=>

We can't talk about most of the bugs that we've found for our customers,
but some customers like to speak about their work with us:
https://github.com/mongodb/mongo/wiki/Testing-MongoDB-with-Antithesis



-----------------------------
This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity for whom they are addressed. If 
you received this message in error, please notify the sender and remove it from 
your system.

---------------------------------------------------------------------
To unsubscribe e-mail: 
[email protected]<mailto:[email protected]>

-- 

*-----------------------------*
*This email and any files transmitted with 
it are confidential and intended solely for the use of the individual or 
entity for whom they are addressed. If you received this message in error, 
please notify the sender and remove it from your system.*

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

Reply via email to