Re: Apache Spark 3.2 Expectation

Dongjoon Hyun Thu, 11 Mar 2021 07:17:30 -0800

Thank you, Xiao, Wenchen and Hyukjin.

Bests,
Dongjoon.



On Thu, Mar 11, 2021 at 2:15 AM Hyukjin Kwon <[email protected]> wrote:

> Just for an update, I will send a discussion email about my idea late this
> week or early next week.
>
> 2021년 3월 11일 (목) 오후 7:00, Wenchen Fan <[email protected]>님이 작성:
>
>> There are many projects going on right now, such as new DS v2 APIs, ANSI
>> interval types, join improvement, disaggregated shuffle, etc. I don't
>> think it's realistic to do the branch cut in April.
>>
>> I'm +1 to release 3.2 around July, but it doesn't mean we have to cut the
>> branch 3 months earlier. We should make the release process faster and cut
>> the branch around June probably.
>>
>>
>>
>> On Thu, Mar 11, 2021 at 4:41 AM Xiao Li <[email protected]> wrote:
>>
>>> Below are some nice-to-have features we can work on in Spark 3.2: Lateral
>>> Join support <https://issues.apache.org/jira/browse/SPARK-28379>,
>>> interval data type, timestamp without time zone, un-nesting arbitrary
>>> queries, the returned metrics of DSV2, and error message standardization.
>>> Spark 3.2 will be another exciting release I believe!
>>>
>>> Go Spark!
>>>
>>> Xiao
>>>
>>>
>>>
>>>
>>> Dongjoon Hyun <[email protected]> 于2021年3月10日周三 下午12:25写道：
>>>
>>>> Hi, Xiao.
>>>>
>>>> This thread started 13 days ago. Since you asked the community about
>>>> major features or timelines at that time, could you share your roadmap or
>>>> expectations if you have something in your mind?
>>>>
>>>> > Thank you, Dongjoon, for initiating this discussion. Let us keep it
>>>> open. It might take 1-2 weeks to collect from the community all the
>>>> features we plan to build and ship in 3.2 since we just finished the 3.1
>>>> voting.
>>>> > TBH, cutting the branch this April does not look good to me. That
>>>> means, we only have one month left for feature development of Spark 3.2. Do
>>>> we have enough features in the current master branch? If not, are we able
>>>> to finish major features we collected here? Do they have a timeline or
>>>> project plan?
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>>
>>>> On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi, John.
>>>>>
>>>>> This thread aims to share your expectations and goals (and maybe work
>>>>> progress) to Apache Spark 3.2 because we are making this together. :)
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Wed, Mar 3, 2021 at 1:59 PM John Zhuge <[email protected]> wrote:
>>>>>
>>>>>> Hi Dongjoon,
>>>>>>
>>>>>> Is it possible to get ViewCatalog in? The community already had
>>>>>> fairly detailed discussions.
>>>>>>
>>>>>> Thanks,
>>>>>> John
>>>>>>
>>>>>> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> Since we have been preparing Apache Spark 3.2.0 in master branch
>>>>>>> since December 2020, March seems to be a good time to share our thoughts
>>>>>>> and aspirations on Apache Spark 3.2.
>>>>>>>
>>>>>>> According to the progress on Apache Spark 3.1 release, Apache Spark
>>>>>>> 3.2 seems to be the last minor release of this year. Given the 
>>>>>>> timeframe,
>>>>>>> we might consider the following. (This is a small set. Please add your
>>>>>>> thoughts to this limited list.)
>>>>>>>
>>>>>>> # Languages
>>>>>>>
>>>>>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>>>>>> slipped out. Currently, we are trying to use Scala 2.13.5 via 
>>>>>>> SPARK-34505
>>>>>>> and investigating the publishing issue. Thank you for your contributions
>>>>>>> and feedback on this.
>>>>>>>
>>>>>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017.
>>>>>>> Like Java 11, we need lots of support from our dependencies. Let's see.
>>>>>>>
>>>>>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>>>>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>>>>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>>>>>
>>>>>>> - SparkR CRAN publishing: As we know, it's discontinued so far.
>>>>>>> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN 
>>>>>>> publishing.
>>>>>>> If it succeeds to revive it, we can keep publishing. Otherwise, I 
>>>>>>> believe
>>>>>>> we had better drop it from the releasing work item list officially.
>>>>>>>
>>>>>>> # Dependencies
>>>>>>>
>>>>>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop
>>>>>>> profile in Apache Spark 3.1. Currently, Spark master branch lives on 
>>>>>>> Hadoop
>>>>>>> 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going
>>>>>>> report at YARN environment. We hope it will be fixed soon at Spark 3.2
>>>>>>> timeframe and we can move toward Hadoop 3.3.2.
>>>>>>>
>>>>>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>>>>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile 
>>>>>>> completely
>>>>>>> via SPARK-32981 and replaced the generated hive-service-rpc code with 
>>>>>>> the
>>>>>>> official dependency via SPARK-32981. We are steadily improving this area
>>>>>>> and will consume Hive 2.3.9 if available.
>>>>>>>
>>>>>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>>>>>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>>>>>>> support K8s model 1.19.
>>>>>>>
>>>>>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using
>>>>>>> Kafka Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with
>>>>>>> Scala 2.12.13, but it was reverted later due to Scala 2.12.13 issue. 
>>>>>>> Since
>>>>>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>>>>>>> with Kafka Client 2.8 hopefully.
>>>>>>>
>>>>>>> # Some Features
>>>>>>>
>>>>>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with
>>>>>>> Apache Iceberg integration. Especially, we hope the on-going function
>>>>>>> catalog SPIP and up-coming storage partitioned join SPIP can be 
>>>>>>> delivered
>>>>>>> as a part of Spark 3.2 and become an additional foundation.
>>>>>>>
>>>>>>> - Columnar Encryption: As of today, Apache Spark master branch
>>>>>>> supports columnar encryption via Apache ORC 1.6 and it's documented via
>>>>>>> SPARK-34036. Also, upcoming Apache Parquet 1.12 has a similar 
>>>>>>> capability.
>>>>>>> Hopefully, Apache Spark 3.2 is going to be the first release to have 
>>>>>>> this
>>>>>>> feature officially. Any feedback is welcome.
>>>>>>>
>>>>>>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>>>>>>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool 
>>>>>>> support
>>>>>>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>>>>>>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>>>>>>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. 
>>>>>>> Also,
>>>>>>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
>>>>>>> too. I'm expecting more benefits.
>>>>>>>
>>>>>>> - Structure Streaming with RocksDB backend: According to the latest
>>>>>>> update, it looks active enough for merging to master branch in Spark 
>>>>>>> 3.2.
>>>>>>>
>>>>>>> Please share your thoughts and let's build better Apache Spark 3.2
>>>>>>> together.
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> John Zhuge
>>>>>>
>>>>>

Re: Apache Spark 3.2 Expectation

Reply via email to