Re: Thoughts on Spark 3 release, or a preview release

Dongjoon Hyun Wed, 11 Sep 2019 17:07:43 -0700

Thank you, Sean.

I'm also +1 for the following three.


1. Start to ramp down (by the official branch-3.0 cut)
2. Apache Spark 3.0.0-preview in 2019
3. Apache Spark 3.0.0 in early 2020

For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps it
a lot.

After this discussion, can we have some timeline for `Spark 3.0 Release
Window` in our versioning-policy page?

- https://spark.apache.org/versioning-policy.html

Bests,
Dongjoon.


On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <[email protected]> wrote:

> I would love to see Spark + Hadoop + Parquet + Avro compatibility problems
> resolved, e.g.
>
> https://issues.apache.org/jira/browse/SPARK-25588
> https://issues.apache.org/jira/browse/SPARK-27781
>
> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far as
> I know, Parquet has not cut a release based on this new version.
>
> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>
> https://github.com/apache/spark/pull/24851
> https://github.com/apache/spark/pull/24297
>
>    michael
>
>
> On Sep 11, 2019, at 1:37 PM, Sean Owen <[email protected]> wrote:
>
> I'm curious what current feelings are about ramping down towards a
> Spark 3 release. It feels close to ready. There is no fixed date,
> though in the past we had informally tossed around "back end of 2019".
> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
> due.
>
> What are the few major items that must get done for Spark 3, in your
> opinion? Below are all of the open JIRAs for 3.0 (which everyone
> should feel free to update with things that aren't really needed for
> Spark 3; I already triaged some).
>
> For me, it's:
> - DSv2?
> - Finishing touches on the Hive, JDK 11 update
>
> What about considering a preview release earlier, as happened for
> Spark 2, to get feedback much earlier than the RC cycle? Could that
> even happen ... about now?
>
> I'm also wondering what a realistic estimate of Spark 3 release is. My
> guess is quite early 2020, from here.
>
>
>
> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
> uses
> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
> SPARK-28588 Build a SQL reference doc
> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
> SPARK-28684 Hive module support JDK 11
> SPARK-28548 explain() shows wrong result for persisted DataFrames
> after some operations
> SPARK-28372 Document Spark WEB UI
> SPARK-28476 Support ALTER DATABASE SET LOCATION
> SPARK-28264 Revisiting Python / pandas UDF
> SPARK-28301 fix the behavior of table name resolution with multi-catalog
> SPARK-28155 do not leak SaveMode to file source v2
> SPARK-28103 Cannot infer filters from union table with empty local
> relation table properly
> SPARK-28024 Incorrect numeric values when out of range
> SPARK-27936 Support local dependency uploading from --py-files
> SPARK-27884 Deprecate Python 2 support in Spark 3.0
> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
> SPARK-27780 Shuffle server & client should be versioned to enable
> smoother upgrade
> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
> of joined tables > 12
> SPARK-27471 Reorganize public v2 catalog API
> SPARK-27520 Introduce a global config system to replace hadoopConfiguration
> SPARK-24625 put all the backward compatible behavior change configs
> under spark.sql.legacy.*
> SPARK-24640 size(null) returns null
> SPARK-24702 Unable to cast to calendar interval in spark sql.
> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
> SPARK-24941 Add RDDBarrier.coalesce() function
> SPARK-25017 Add test suite for ContextBarrierState
> SPARK-25083 remove the type erasure hack in data source scan
> SPARK-25383 Image data source supports sample pushdown
> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
> default
> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
> efficiency problem
> SPARK-25128 multiple simultaneous job submissions against k8s backend
> cause driver pods to hang
> SPARK-26731 remove EOLed spark jobs from jenkins
> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
> SPARK-21559 Remove Mesos fine-grained mode
> SPARK-24942 Improve cluster resource management with jobs containing
> barrier stage
> SPARK-25914 Separate projection from grouping and aggregate in logical
> Aggregate
> SPARK-26022 PySpark Comparison with Pandas
> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
> SPARK-26221 Improve Spark SQL instrumentation and metrics
> SPARK-26425 Add more constraint checks in file streaming source to
> avoid checkpoint corruption
> SPARK-25843 Redesign rangeBetween API
> SPARK-25841 Redesign window function rangeBetween API
> SPARK-25752 Add trait to easily whitelist logical operators that
> produce named output from CleanupAliases
> SPARK-23210 Introduce the concept of default value to schema
> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
> aggregate
> SPARK-25531 new write APIs for data source v2
> SPARK-25547 Pluggable jdbc connection factory
> SPARK-20845 Support specification of column names in INSERT INTO
> SPARK-24417 Build and Run Spark on JDK11
> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
> SPARK-25074 Implement maxNumConcurrentTasks() in
> MesosFineGrainedSchedulerBackend
> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
> SPARK-25186 Stabilize Data Source V2 API
> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
> execution mode
> SPARK-25390 data source V2 API refactoring
> SPARK-7768 Make user-defined type (UDT) API public
> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
> SPARK-15691 Refactor and improve Hive support
> SPARK-15694 Implement ScriptTransformation in sql/core
> SPARK-16217 Support SELECT INTO statement
> SPARK-16452 basic INFORMATION_SCHEMA support
> SPARK-18134 SQL: MapType in Group BY and Joins not working
> SPARK-18245 Improving support for bucketed table
> SPARK-19842 Informational Referential Integrity Constraints Support in
> Spark
> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
> list of structures
> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
> respect session timezone
> SPARK-22386 Data Source V2 improvements
> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
> <[email protected]>
>
>
>

Re: Thoughts on Spark 3 release, or a preview release

Reply via email to