Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Dongjoon Hyun Sun, 05 Jul 2020 15:06:06 -0700

Thank you for sharing your opinions, Jacky, Maxim, Holden, Jungtaek, Yi,
Tom, Gabor, Felix.


I also want to include both `New Features` and `Improvements` together
according to the above discussion.

When I checked the item status as of today, it looked like the following.
In short, I removed K8s GA and DSv2 Stabilization explicitly from ON-TRACK
list according to the given concerns. For those items, we can try to build
a consensus for Apache Spark 3.2 (June 2021) or later.

ON-TRACK
1. Support Scala 2.13 (SPARK-25075)
2. Use Apache Hadoop 3.2 by default for better cloud support (SPARK-32058)
3. Stage Level Scheduling (SPARK-27495)
4. Support filter pushdown more (CSV is already shipped by SPARK-30323 in
3.0)
    - Support filters pushdown to JSON (SPARK-30648 in 3.1)
    - Support filters pushdown to Avro (SPARK-XXX in 3.1)
    - Support nested attributes of filters pushed down to JSON
5. Support JDBC Kerberos w/ keytab (SPARK-12312)

NICE TO HAVE OR DEFERRED TO APACHE SPARK 3.2
1. Declaring Kubernetes Scheduler GA
    - Should we also consider the shuffle service refactoring to support
pluggable storage engines as targeting the 3.1 release? (Holden)
    - I think pluggable storage in shuffle is essential for k8s GA (Felix)
    - Use remote storage for persisting shuffle data (SPARK-25299)
2. DSv2 Stabilization? (The followings and more)
    - SPARK-31357 Catalog API for view metadata
    - SPARK-31694 Add SupportsPartitions Catalog APIs on DataSourceV2

As we know, we work willingly and voluntarily. If something lands on the
`master` branch before the feature freeze (November), it will be a part of
Apache Spark 3.1, of course.

Thanks,
Dongjoon.

On Sun, Jul 5, 2020 at 12:21 PM Felix Cheung <felixcheun...@hotmail.com>
wrote:

> I think pluggable storage in shuffle is essential for k8s GA
>
> ------------------------------
> *From:* Holden Karau <hol...@pigscanfly.ca>
> *Sent:* Monday, June 29, 2020 9:33 AM
> *To:* Maxim Gekk
> *Cc:* Dongjoon Hyun; dev
> *Subject:* Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)
>
> Should we also consider the shuffle service refactoring to support
> pluggable storage engines as targeting the 3.1 release?
>
> On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk <maxim.g...@databricks.com>
> wrote:
>
>> Hi Dongjoon,
>>
>> I would add:
>> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
>> - Filters pushdown to other datasources like Avro
>> - Support nested attributes of filters pushed down to JSON
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> Hi, All.
>>>
>>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>>> community opinion on Apache Spark 3.1 feature expectations.
>>>
>>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> I'm expecting the following items:
>>>
>>> 1. Support Scala 2.13
>>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>>> 3. Declaring Kubernetes Scheduler GA
>>>     In my perspective, the last main missing piece was Dynamic
>>> allocation and
>>>     - Dynamic allocation with shuffle tracking is already shipped at 3.0.
>>>     - Dynamic allocation with worker decommission/data migration is
>>> targeting 3.1. (Thanks, Holden)
>>> 4. DSv2 Stabilization
>>>
>>> I'm aware of some more features which are on the way currently, but I
>>> love to hear the opinions from the main developers and more over the main
>>> users who need those features.
>>>
>>> Thank you in advance. Welcome for any comments.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

Reply via email to