Re: Apache Spark 3.2 Expectation

Dongjoon Hyun Fri, 26 Feb 2021 10:07:00 -0800

Thank you, Mridul and Sean.

1. Yes, `2017` was a typo. Java 17 is scheduled September 2021. And, of
course, it's a nice-to-have status. :)


2. `Push based shuffle and disaggregated shuffle`. Definitely. Thanks for
sharing,

3. +100 for Apache Spark 3.2.0 in July 2021. Maybe, we need `branch-cut` in
April because we took 3 month for Spark 3.1 release.
    Let's update our release roadmap of the Apache Spark website.

> I'd roughly expect 3.2 in, say, July of this year, given the usual
cadence. No reason it couldn't be a little sooner or later. There is
already some good stuff in 3.2 and will be a good minor release in 5-6
months.

Bests,
Dongjoon.



On Thu, Feb 25, 2021 at 9:33 AM Sean Owen <[email protected]> wrote:

> I'd roughly expect 3.2 in, say, July of this year, given the usual
> cadence. No reason it couldn't be a little sooner or later. There is
> already some good stuff in 3.2 and will be a good minor release in 5-6
> months.
>
> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun <[email protected]>
> wrote:
>
>> Hi, All.
>>
>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>> December 2020, March seems to be a good time to share our thoughts and
>> aspirations on Apache Spark 3.2.
>>
>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>> seems to be the last minor release of this year. Given the timeframe, we
>> might consider the following. (This is a small set. Please add your
>> thoughts to this limited list.)
>>
>> # Languages
>>
>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>> and investigating the publishing issue. Thank you for your contributions
>> and feedback on this.
>>
>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>> Java 11, we need lots of support from our dependencies. Let's see.
>>
>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>> 2021-12-23. So, the deprecation is not required yet, but we had better
>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>
>> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
>> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
>> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
>> better drop it from the releasing work item list officially.
>>
>> # Dependencies
>>
>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile in
>> Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>> shaded clients via SPARK-33212. So far, there is one on-going report at
>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>> we can move toward Hadoop 3.3.2.
>>
>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>> official dependency via SPARK-32981. We are steadily improving this area
>> and will consume Hive 2.3.9 if available.
>>
>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>> support K8s model 1.19.
>>
>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>> with Kafka Client 2.8 hopefully.
>>
>> # Some Features
>>
>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>> and up-coming storage partitioned join SPIP can be delivered as a part of
>> Spark 3.2 and become an additional foundation.
>>
>> - Columnar Encryption: As of today, Apache Spark master branch supports
>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>> Apache Spark 3.2 is going to be the first release to have this feature
>> officially. Any feedback is welcome.
>>
>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
>> too. I'm expecting more benefits.
>>
>> - Structure Streaming with RocksDB backend: According to the latest
>> update, it looks active enough for merging to master branch in Spark 3.2.
>>
>> Please share your thoughts and let's build better Apache Spark 3.2
>> together.
>>
>> Bests,
>> Dongjoon.
>>
>

Re: Apache Spark 3.2 Expectation

Reply via email to