Re: [VOTE] Release Spark 3.1.1 (RC1)

Kent Yao Wed, 03 Feb 2021 07:04:02 -0800

Sending https://github.com/apache/spark/pull/31460

Based my research so far, when there is there is an existingio.file.buffer.size in hive-site.xml, the hadoopConf finallly get reset by that.

In many real-world cases, when interacting with hive catalog through Spark SQL, users may just share thehive-site.xm for their hive jobs and make a copy to SPARK_HOM/conf w/o modification. In Spark, when we generate Hadoop configurations, we will usespark.buffer.size(65536) to resetio.file.buffer.size(4096). But when we load the hive-site.xml, we may ignore this behavior and reset io.file.buffer.size again according to hive-site.xml.

The PR fixes:

1. The configuration priority for setting Hadoop and Hive config here is not right, while literally, the order should be spark > spark.hive > spark.hadoop > hive > hadoop
2. This breaks spark.buffer.size congfig's behavior for tuning the IO performance w/ HDFS if there is an existing io.file.buffer.size in hive-site.xml

Kent Yao

@ Data Science Center, Hangzhou Research Institute, NetEase Corp.

a spark enthusiast

kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.

On 02/3/2021 15:36，Maxim Gekk<[email protected]> wrote：

Hi All,

> Also I am investigating a performance regression in some TPC-DS queries (q88 for instance) that is caused by a recent commit in 3.1 ...

I have found that the perf regression is caused by the Hadoop config:
io.file.buffer.size = 4096
Before the commit https://github.com/apache/spark/commit/278f6f45f46ccafc7a31007d51ab9cb720c9cb14, we had:
io.file.buffer.size = 65536

Maxim Gekk
Software Engineer
Databricks, Inc.

On Wed, Feb 3, 2021 at 2:37 AM Hyukjin Kwon <[email protected]> wrote:
Yeah, agree. I changed. Thanks for the heads up. Tom.

2021년 2월 3일 (수) 오전 8:31, Tom Graves <[email protected]>님이 작성:

ok thanks for the update. That is marked as an improvement, if its a blocker can we mark it as such and describe why. I searched jiras and didn't see any critical or blockers open.

Tom

On Tuesday, February 2, 2021, 05:12:24 PM CST, Hyukjin Kwon <[email protected]> wrote:

There is one here: https://github.com/apache/spark/pull/31440. There look several issues being identified (to confirm that this is an issue in OSS too), and fixed in parallel.
There are a bit of unexpected delays here as several issues more were found. I will try to file and share relevant JIRAs as soon as I can confirm.

2021년 2월 3일 (수) 오전 2:36, Tom Graves <[email protected]>님이 작성:

Just curious if we have an update on next rc? is there a jira for the tpcds issue?

Thanks,
Tom

On Wednesday, January 27, 2021, 05:46:27 PM CST, Hyukjin Kwon <[email protected]> wrote:

Just to share the current status, most of the known issues were resolved. Let me know if there are some more.
One thing left is a performance regression in TPCDS being investigated. Once this is identified (and fixed if it should be), I will cut another RC right away.
I roughly expect to cut another RC next Monday.

Thanks guys.

2021년 1월 27일 (수) 오전 5:26, Terry Kim <[email protected]>님이 작성:
Hi,

Please check if the following regression should be included: https://github.com/apache/spark/pull/31352

Thanks,
Terry

On Tue, Jan 26, 2021 at 7:54 AM Holden Karau <[email protected]> wrote:
If were ok waiting for it, I’d like to get
https://github.com/apache/spark/pull/31298 in as well (it’s not a regression but it is a bug fix).

On Tue, Jan 26, 2021 at 6:38 AM Hyukjin Kwon <[email protected]> wrote:
It looks like a cool one but it's a pretty big one and affects the plans considerably ... maybe it's best to avoid adding it into 3.1.1 in particular during the RC period if this isn't a clear regression that affects many users.

2021년 1월 26일 (화) 오후 11:23, Peter Toth <[email protected]>님이 작성:
Hey,

Sorry for chiming in a bit late, but I would like to suggest my PR (https://github.com/apache/spark/pull/28885) for review and inclusion into 3.1.1.

Currently, invalid reuse reference nodes appear in many queries, causing performance issues and incorrect explain plans. Now that https://github.com/apache/spark/pull/31243 got merged these invalid references can be easily found in many of our golden files on master: https://github.com/apache/spark/pull/28885#issuecomment-767530441.
But the issue isn't master (3.2) specific, actually it has been there since 3.0 when Dynamic Partition Pruning was added.
So it is not a regression from 3.0 to 3.1.1, but in some cases (like TPCDS q23b) it is causing performance regression from 2.4 to 3.x.

Thanks,
Peter

On Tue, Jan 26, 2021 at 6:30 AM Hyukjin Kwon <[email protected]> wrote:
Guys, I plan to make an RC as soon as we have no visible issues. I have merged a few correctness issues. There look:
- https://github.com/apache/spark/pull/31319 waiting for a review (I will do it too soon).
- https://github.com/apache/spark/pull/31336
- I know Max's investigating the perf regression one which hopefully will be fixed soon.

Are there any more blockers or correctness issues? Please ping me or say it out here.
I would like to avoid making an RC when there are clearly some issues to be fixed.
If you're investigating something suspicious, that's fine too. It's better to make sure we're safe instead of rushing an RC without finishing the investigation.

Thanks all.

2021년 1월 22일 (금) 오후 6:19, Hyukjin Kwon <[email protected]>님이 작성:
Sure, thanks guys. I'll start another RC after the fixes. Looks like we're almost there.

On Fri, 22 Jan 2021, 17:47 Wenchen Fan, <[email protected]> wrote:
BTW, there is a correctness bug being fixed at https://github.com/apache/spark/pull/30788 . It's not a regression, but the fix is very simple and it would be better to start the next RC after merging that fix.

On Fri, Jan 22, 2021 at 3:54 PM Maxim Gekk <[email protected]> wrote:
Also I am investigating a performance regression in some TPC-DS queries (q88 for instance) that is caused by a recent commit in 3.1, highly likely in the period from 19th November, 2020 to 18th December, 2020.

Maxim Gekk
Software Engineer
Databricks, Inc.

On Fri, Jan 22, 2021 at 10:45 AM Wenchen Fan <[email protected]> wrote:
-1 as I just found a regression in 3.1. A self-join query works well in 3.0 but fails in 3.1. It's being fixed at https://github.com/apache/spark/pull/31287

On Fri, Jan 22, 2021 at 4:34 AM Tom Graves <[email protected]> wrote:

+1

built from tarball, verified sha and regular CI and tests all pass.

Tom

On Monday, January 18, 2021, 06:06:42 AM CST, Hyukjin Kwon <[email protected]> wrote:

Please vote on releasing the following candidate as Apache Spark version 3.1.1.

The vote is open until January 22nd 4PM PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.1.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.1.1-rc1 (commit 53fe365edb948d0e05a5ccb62f349cd9fcb4bb5d):
https://github.com/apache/spark/tree/v3.1.1-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1364

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-docs/

The list of bug fixes going into 3.1.1 can be found at the following URL:
https://s.apache.org/41kf2

This release is using the release script of the tag v3.1.1-rc1.

FAQ

===================
What happened to 3.1.0?
===================

There was a technical issue during Apache Spark 3.1.0 preparation, and it was discussed and decided to skip 3.1.0.
Please see https://spark.apache.org/news/next-official-release-spark-3.1.1.html for more details.

=========================
How can I help test this release?
=========================

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC via "pip install https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/pyspark-3.1.1.tar.gz"
and see if anything important breaks.
In the Java/Scala, you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

===========================================
What should happen to JIRA tickets still targeting 3.1.1?
===========================================

The current list of open tickets targeted at 3.1.1 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target Version/s" = 3.1.1

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==================
But my bug isn't fixed?
==================

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.

--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

--------------------------------------------------------------------- To unsubscribe e-mail: [email protected]

Re: [VOTE] Release Spark 3.1.1 (RC1)

Reply via email to