Re: Apache Spark 3.2 Expectation

Takeshi Yamamuro Sun, 28 Feb 2021 04:21:40 -0800

Thanks, Dongjoon, for the discussion.
I would like to add Gengliang's work: SPARK-34246 New type coercion syntax
rules in ANSI mode
I think it is worth describing it in the next release note, too.


Bests,
Takeshi

On Sat, Feb 27, 2021 at 11:41 AM Yi Wu <yi...@databricks.com> wrote:

> +1 to continue the incompleted push-based shuffle.
>
> --
> Yi
>
> On Fri, Feb 26, 2021 at 1:26 AM Mridul Muralidharan <mri...@gmail.com>
> wrote:
>
>>
>>
>> Nit: Java 17 -> should be available by Sept 2021 :-)
>> Adoption would also depend on some of our nontrivial dependencies
>> supporting it - it might be a stretch to get it in for Apache Spark 3.2 ?
>>
>> Features:
>> Push based shuffle and disaggregated shuffle should also be in 3.2
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>>> December 2020, March seems to be a good time to share our thoughts and
>>> aspirations on Apache Spark 3.2.
>>>
>>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>>> seems to be the last minor release of this year. Given the timeframe, we
>>> might consider the following. (This is a small set. Please add your
>>> thoughts to this limited list.)
>>>
>>> # Languages
>>>
>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>>> and investigating the publishing issue. Thank you for your contributions
>>> and feedback on this.
>>>
>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>>> Java 11, we need lots of support from our dependencies. Let's see.
>>>
>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>
>>> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
>>> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
>>> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
>>> better drop it from the releasing work item list officially.
>>>
>>> # Dependencies
>>>
>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
>>> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>>> shaded clients via SPARK-33212. So far, there is one on-going report at
>>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>>> we can move toward Hadoop 3.3.2.
>>>
>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>>> official dependency via SPARK-32981. We are steadily improving this area
>>> and will consume Hive 2.3.9 if available.
>>>
>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>>> support K8s model 1.19.
>>>
>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>>> with Kafka Client 2.8 hopefully.
>>>
>>> # Some Features
>>>
>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>>> and up-coming storage partitioned join SPIP can be delivered as a part of
>>> Spark 3.2 and become an additional foundation.
>>>
>>> - Columnar Encryption: As of today, Apache Spark master branch supports
>>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>>> Apache Spark 3.2 is going to be the first release to have this feature
>>> officially. Any feedback is welcome.
>>>
>>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
>>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
>>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
>>> too. I'm expecting more benefits.
>>>
>>> - Structure Streaming with RocksDB backend: According to the latest
>>> update, it looks active enough for merging to master branch in Spark 3.2.
>>>
>>> Please share your thoughts and let's build better Apache Spark 3.2
>>> together.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>

-- 
---
Takeshi Yamamuro

Re: Apache Spark 3.2 Expectation

Reply via email to