Re: Missing data in spark output

2022-10-25 Thread Steve Loughran
v1 on gcs isn't safe either as promotion from task attempt to
successful task is a dir rename; fast and atomic on hdfs, O(files) and
nonatomic on GCS.

if i can get that hadoop 3.3.5 rc out soon, the manifest committer will be
there to test  https://issues.apache.org/jira/browse/MAPREDUCE-7341

until then, as chris says, turn off speculative execution

On Fri, 21 Oct 2022 at 23:39, Chris Nauroth  wrote:

> Some users have observed issues like what you're describing related to the
> job commit algorithm, which is controlled by configuration
> property spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.
> Hadoop's default value for this setting is 2. You can find a description of
> the algorithms in Hadoop's configuration documentation:
>
>
> https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
>
> Algorithm version 2 is faster, because the final task output file renames
> can be issued in parallel by individual tasks. Unfortunately, there have
> been reports of it causing side effects like what you described, especially
> if there are a lot of task attempt retries or speculative execution
> (configuration property spark.speculation set to true instead of the
> default false). You could try switching to algorithm version 1. The
> drawback is that it's slower, because the final output renames are executed
> single-threaded at the end of the job. The performance impact is more
> noticeable for jobs with many tasks, and the effect is amplified when using
> cloud storage as opposed to HDFS running in the same network.
>
> If you are using speculative execution, then you could also potentially
> try turning that off.
>
> Chris Nauroth
>
>
> On Wed, Oct 19, 2022 at 8:18 AM Martin Andersson <
> martin.anders...@kambi.com> wrote:
>
>> Is your spark job batch or streaming?
>> --
>> *From:* Sandeep Vinayak 
>> *Sent:* Tuesday, October 18, 2022 19:48
>> *To:* dev@spark.apache.org 
>> *Subject:* Missing data in spark output
>>
>>
>> EXTERNAL SENDER. Do not click links or open attachments unless you
>> recognize the sender and know the content is safe. DO NOT provide your
>> username or password.
>>
>> Hello Everyone,
>>
>> We are recently observing an intermittent data loss in the spark with
>> output to GCS (google cloud storage). When there are missing rows, they are
>> accompanied by duplicate rows. The re-run of the job doesn't have any
>> duplicate or missing rows. Since it's hard to debug, we are first trying to
>> understand the potential theoretical root cause of this issue, can this be
>> a GCS specific issue where GCS might not be handling the consistencies
>> well? Any tips will be super helpful.
>>
>> Thanks,
>>
>>


3.3.1 Release

2022-10-25 Thread Pastrana, Rodrigo (RIS-BCT)
Thanks to all involved with the 3.3.1 release. Is there a target date for the 
official release? Thanks!

[VOTE][RESULT] Release Spark 3.3.1 (RC4)
The vote passes with 11 +1s (6 binding +1s).
Thanks to all who helped with the release!

(* = binding)
+1:
- Sean Owen (*)
- Yang,Jie
- Dongjoon Hyun (*)
- L. C. Hsieh (*)
- Gengliang Wang (*)
- Thomas graves (*)
- Chao Sun
- Wenchen Fan (*)
- Yikun Jiang
- Cheng Pan
- Yuming Wang

+0: None

-1: None



The information contained in this e-mail message is intended only for the 
personal and confidential use of the recipient(s) named above. This message may 
be an attorney-client communication and/or work product and as such is 
privileged and confidential. If the reader of this message is not the intended 
recipient or an agent responsible for delivering it to the intended recipient, 
you are hereby notified that you have received this document in error and that 
any review, dissemination, distribution, or copying of this message is strictly 
prohibited. If you have received this communication in error, please notify us 
immediately by e-mail, and delete the original message.


Re: 3.3.1 Release

2022-10-25 Thread Dongjoon Hyun
It's released Today, Pastrana.

https://downloads.apache.org/spark/spark-3.3.1/
https://spark.apache.org/news/spark-3-3-1-released.html
https://spark.apache.org/releases/spark-release-3-3-1.html
https://spark.apache.org/docs/3.3.1/
https://pypi.org/project/pyspark/3.3.1/

I guess the release manager will announce it officially after finalizing by
uploading to DockerHub.

https://hub.docker.com/r/apache/spark/tags

Dongjoon.


On Tue, Oct 25, 2022 at 1:14 PM Pastrana, Rodrigo (RIS-BCT)
 wrote:

> Thanks to all involved with the 3.3.1 release. Is there a target date for
> the official release? Thanks!
>
>
>
> *[VOTE][RESULT] Release Spark 3.3.1 (RC4)*
>
> The vote passes with 11 +1s (6 binding +1s).
>
> Thanks to all who helped with the release!
>
>
>
> (* = binding)
>
> +1:
>
> - Sean Owen (*)
>
> - Yang,Jie
>
> - Dongjoon Hyun (*)
>
> - L. C. Hsieh (*)
>
> - Gengliang Wang (*)
>
> - Thomas graves (*)
>
> - Chao Sun
>
> - Wenchen Fan (*)
>
> - Yikun Jiang
>
> - Cheng Pan
>
> - Yuming Wang
>
>
>
> +0: None
>
>
>
> -1: None
>
>
>
> --
> The information contained in this e-mail message is intended only for the
> personal and confidential use of the recipient(s) named above. This message
> may be an attorney-client communication and/or work product and as such is
> privileged and confidential. If the reader of this message is not the
> intended recipient or an agent responsible for delivering it to the
> intended recipient, you are hereby notified that you have received this
> document in error and that any review, dissemination, distribution, or
> copying of this message is strictly prohibited. If you have received this
> communication in error, please notify us immediately by e-mail, and delete
> the original message.
>


[ANNOUNCE] Apache Spark 3.3.1 released

2022-10-25 Thread Yuming Wang
We are happy to announce the availability of Apache Spark 3.3.1!

Spark 3.3.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.3 maintenance branch of Spark. We strongly
recommend all 3.3 users to upgrade to this stable release.

To download Spark 3.3.1, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-3-1.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.


Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-25 Thread Dongjoon Hyun
It's great. Thank you so much, Yuming!

Dongjoon

On Tue, Oct 25, 2022 at 11:23 PM Yuming Wang  wrote:

> We are happy to announce the availability of Apache Spark 3.3.1!
>
> Spark 3.3.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.3 maintenance branch of Spark. We strongly
> recommend all 3.3 users to upgrade to this stable release.
>
> To download Spark 3.3.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-3-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
>
>