Re: Apache Spark 3.2 Expectation

Gengliang Wang Wed, 16 Jun 2021 07:30:58 -0700

Thanks, Hyukjin.

The expected target branch cut date of Spark 3.2 is *July 1st* on
https://spark.apache.org/versioning-policy.html. However, I notice that
there are still multiple important projects in progress now:


[Core]

   - SPIP: Support push-based shuffle to improve shuffle efficiency
   <https://issues.apache.org/jira/browse/SPARK-30602>

[SQL]

   - Support ANSI SQL INTERVAL types
   <https://issues.apache.org/jira/browse/SPARK-27790>
   - Support Timestamp without time zone data type
   <https://issues.apache.org/jira/browse/SPARK-35662>
   - Aggregate (Min/Max/Count) push down for Parquet
   <https://issues.apache.org/jira/browse/SPARK-34952>

[Streaming]

   - EventTime based sessionization (session window)
   <https://issues.apache.org/jira/browse/SPARK-10816>
   - Add RocksDB StateStore as external module
   <https://issues.apache.org/jira/browse/SPARK-34198>


I wonder whether we should postpone the branch cut date.
cc Min Shen, Yi Wu, Max Gekk, Huaxin Gao, Jungtaek Lim, Yuanjian
Li, Liang-Chi Hsieh, who work on the projects above.

On Tue, Jun 15, 2021 at 4:34 PM Hyukjin Kwon <[email protected]> wrote:

> +1, thanks.
>
> On Tue, 15 Jun 2021, 16:17 Gengliang Wang, <[email protected]> wrote:
>
>> Hi,
>>
>> As the expected release date is close,  I would like to volunteer as the
>> release manager for Apache Spark 3.2.0.
>>
>> Thanks,
>> Gengliang
>>
>> On Mon, Apr 12, 2021 at 1:59 PM Wenchen Fan <[email protected]> wrote:
>>
>>> An update: we found a mistake that we picked the Spark 3.2 release date
>>> based on the scheduled release date of 3.1. However, 3.1 was delayed and
>>> released on March 2. In order to have a full 6 months development for 3.2,
>>> the target release date for 3.2 should be September 2.
>>>
>>> I'm updating the release dates in
>>> https://github.com/apache/spark-website/pull/331
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Thu, Mar 11, 2021 at 11:17 PM Dongjoon Hyun <[email protected]>
>>> wrote:
>>>
>>>> Thank you, Xiao, Wenchen and Hyukjin.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Thu, Mar 11, 2021 at 2:15 AM Hyukjin Kwon <[email protected]>
>>>> wrote:
>>>>
>>>>> Just for an update, I will send a discussion email about my idea late
>>>>> this week or early next week.
>>>>>
>>>>> 2021년 3월 11일 (목) 오후 7:00, Wenchen Fan <[email protected]>님이 작성:
>>>>>
>>>>>> There are many projects going on right now, such as new DS v2 APIs,
>>>>>> ANSI interval types, join improvement, disaggregated shuffle, etc. I 
>>>>>> don't
>>>>>> think it's realistic to do the branch cut in April.
>>>>>>
>>>>>> I'm +1 to release 3.2 around July, but it doesn't mean we have to cut
>>>>>> the branch 3 months earlier. We should make the release process faster 
>>>>>> and
>>>>>> cut the branch around June probably.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 11, 2021 at 4:41 AM Xiao Li <[email protected]> wrote:
>>>>>>
>>>>>>> Below are some nice-to-have features we can work on in Spark 3.2: 
>>>>>>> Lateral
>>>>>>> Join support <https://issues.apache.org/jira/browse/SPARK-28379>,
>>>>>>> interval data type, timestamp without time zone, un-nesting arbitrary
>>>>>>> queries, the returned metrics of DSV2, and error message 
>>>>>>> standardization.
>>>>>>> Spark 3.2 will be another exciting release I believe!
>>>>>>>
>>>>>>> Go Spark!
>>>>>>>
>>>>>>> Xiao
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dongjoon Hyun <[email protected]> 于2021年3月10日周三 下午12:25写道：
>>>>>>>
>>>>>>>> Hi, Xiao.
>>>>>>>>
>>>>>>>> This thread started 13 days ago. Since you asked the community
>>>>>>>> about major features or timelines at that time, could you share your
>>>>>>>> roadmap or expectations if you have something in your mind?
>>>>>>>>
>>>>>>>> > Thank you, Dongjoon, for initiating this discussion. Let us keep
>>>>>>>> it open. It might take 1-2 weeks to collect from the community all the
>>>>>>>> features we plan to build and ship in 3.2 since we just finished the 
>>>>>>>> 3.1
>>>>>>>> voting.
>>>>>>>> > TBH, cutting the branch this April does not look good to me. That
>>>>>>>> means, we only have one month left for feature development of Spark 
>>>>>>>> 3.2. Do
>>>>>>>> we have enough features in the current master branch? If not, are we 
>>>>>>>> able
>>>>>>>> to finish major features we collected here? Do they have a timeline or
>>>>>>>> project plan?
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi, John.
>>>>>>>>>
>>>>>>>>> This thread aims to share your expectations and goals (and maybe
>>>>>>>>> work progress) to Apache Spark 3.2 because we are making this 
>>>>>>>>> together. :)
>>>>>>>>>
>>>>>>>>> Bests,
>>>>>>>>> Dongjoon.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 3, 2021 at 1:59 PM John Zhuge <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Dongjoon,
>>>>>>>>>>
>>>>>>>>>> Is it possible to get ViewCatalog in? The community already had
>>>>>>>>>> fairly detailed discussions.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> John
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, All.
>>>>>>>>>>>
>>>>>>>>>>> Since we have been preparing Apache Spark 3.2.0 in master branch
>>>>>>>>>>> since December 2020, March seems to be a good time to share our 
>>>>>>>>>>> thoughts
>>>>>>>>>>> and aspirations on Apache Spark 3.2.
>>>>>>>>>>>
>>>>>>>>>>> According to the progress on Apache Spark 3.1 release, Apache
>>>>>>>>>>> Spark 3.2 seems to be the last minor release of this year. Given the
>>>>>>>>>>> timeframe, we might consider the following. (This is a small set. 
>>>>>>>>>>> Please
>>>>>>>>>>> add your thoughts to this limited list.)
>>>>>>>>>>>
>>>>>>>>>>> # Languages
>>>>>>>>>>>
>>>>>>>>>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075
>>>>>>>>>>> but slipped out. Currently, we are trying to use Scala 2.13.5 via
>>>>>>>>>>> SPARK-34505 and investigating the publishing issue. Thank you for 
>>>>>>>>>>> your
>>>>>>>>>>> contributions and feedback on this.
>>>>>>>>>>>
>>>>>>>>>>> - Java 17 LTS Support: Java 17 LTS will arrive in September
>>>>>>>>>>> 2017. Like Java 11, we need lots of support from our dependencies. 
>>>>>>>>>>> Let's
>>>>>>>>>>> see.
>>>>>>>>>>>
>>>>>>>>>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends
>>>>>>>>>>> at 2021-12-23. So, the deprecation is not required yet, but we had 
>>>>>>>>>>> better
>>>>>>>>>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>>>>>>>>>
>>>>>>>>>>> - SparkR CRAN publishing: As we know, it's discontinued so far.
>>>>>>>>>>> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN 
>>>>>>>>>>> publishing.
>>>>>>>>>>> If it succeeds to revive it, we can keep publishing. Otherwise, I 
>>>>>>>>>>> believe
>>>>>>>>>>> we had better drop it from the releasing work item list officially.
>>>>>>>>>>>
>>>>>>>>>>> # Dependencies
>>>>>>>>>>>
>>>>>>>>>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop
>>>>>>>>>>> profile in Apache Spark 3.1. Currently, Spark master branch lives 
>>>>>>>>>>> on Hadoop
>>>>>>>>>>> 3.2.2's shaded clients via SPARK-33212. So far, there is one 
>>>>>>>>>>> on-going
>>>>>>>>>>> report at YARN environment. We hope it will be fixed soon at Spark 
>>>>>>>>>>> 3.2
>>>>>>>>>>> timeframe and we can move toward Hadoop 3.3.2.
>>>>>>>>>>>
>>>>>>>>>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by
>>>>>>>>>>> default instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 
>>>>>>>>>>> profile
>>>>>>>>>>> completely via SPARK-32981 and replaced the generated 
>>>>>>>>>>> hive-service-rpc code
>>>>>>>>>>> with the official dependency via SPARK-32981. We are steadily 
>>>>>>>>>>> improving
>>>>>>>>>>> this area and will consume Hive 2.3.9 if available.
>>>>>>>>>>>
>>>>>>>>>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades
>>>>>>>>>>> K8s client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in 
>>>>>>>>>>> order
>>>>>>>>>>> to support K8s model 1.19.
>>>>>>>>>>>
>>>>>>>>>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is
>>>>>>>>>>> using Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to 
>>>>>>>>>>> Kafka 2.7
>>>>>>>>>>> with Scala 2.12.13, but it was reverted later due to Scala 2.12.13 
>>>>>>>>>>> issue.
>>>>>>>>>>> Since KAFKA-12357 fixed the Scala requirement two days ago, Spark 
>>>>>>>>>>> 3.2 will
>>>>>>>>>>> go with Kafka Client 2.8 hopefully.
>>>>>>>>>>>
>>>>>>>>>>> # Some Features
>>>>>>>>>>>
>>>>>>>>>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with
>>>>>>>>>>> Apache Iceberg integration. Especially, we hope the on-going 
>>>>>>>>>>> function
>>>>>>>>>>> catalog SPIP and up-coming storage partitioned join SPIP can be 
>>>>>>>>>>> delivered
>>>>>>>>>>> as a part of Spark 3.2 and become an additional foundation.
>>>>>>>>>>>
>>>>>>>>>>> - Columnar Encryption: As of today, Apache Spark master branch
>>>>>>>>>>> supports columnar encryption via Apache ORC 1.6 and it's documented 
>>>>>>>>>>> via
>>>>>>>>>>> SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar 
>>>>>>>>>>> capability.
>>>>>>>>>>> Hopefully, Apache Spark 3.2 is going to be the first release to 
>>>>>>>>>>> have this
>>>>>>>>>>> feature officially. Any feedback is welcome.
>>>>>>>>>>>
>>>>>>>>>>> - Improved ZStandard Support: Spark 3.2 will bring more benefits
>>>>>>>>>>> for ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer 
>>>>>>>>>>> pool
>>>>>>>>>>> support for all IO operations, 2) SPARK-33978 makes ORC datasource 
>>>>>>>>>>> support
>>>>>>>>>>> ZSTD compression, 3) SPARK-34503 sets ZSTD as the default codec for 
>>>>>>>>>>> event
>>>>>>>>>>> log compression, 4) SPARK-34479 aims to support ZSTD at Avro data 
>>>>>>>>>>> source.
>>>>>>>>>>> Also, the upcoming Parquet 1.12 supports ZSTD (and supports JNI 
>>>>>>>>>>> buffer
>>>>>>>>>>> pool), too. I'm expecting more benefits.
>>>>>>>>>>>
>>>>>>>>>>> - Structure Streaming with RocksDB backend: According to the
>>>>>>>>>>> latest update, it looks active enough for merging to master branch 
>>>>>>>>>>> in Spark
>>>>>>>>>>> 3.2.
>>>>>>>>>>>
>>>>>>>>>>> Please share your thoughts and let's build better Apache Spark
>>>>>>>>>>> 3.2 together.
>>>>>>>>>>>
>>>>>>>>>>> Bests,
>>>>>>>>>>> Dongjoon.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> John Zhuge
>>>>>>>>>>
>>>>>>>>>

Re: Apache Spark 3.2 Expectation

Reply via email to