Re: Thoughts on Spark 3 release, or a preview release

Mats Rydberg Thu, 19 Sep 2019 01:39:22 -0700

Hello all,

We are Martin and Mats from Neo4j and we're working on the Spark Graph SPIP
(https://issues.apache.org/jira/browse/SPARK-25994).
We are also +1 for a Spark 3.0 preview release and setting a timeline for
the actual release.


The SPIP was accepted in the beginning of this year and we've merged the
initial modules and dependency declarations required for Spark Cypher (
https://github.com/apache/spark/pull/24490).

Our current state is that we have our main API PR open since three months (
https://github.com/apache/spark/pull/24851). The last interaction with the
SPIP shepherd was two months ago, and after responding to the review we
have seen no progress.

Since the SPIP was accepted as a Spark 3.0 feature, we would like to extend
an invite for more involvement from the Spark community, especially around
PR review and merging. The implementation work is essentially done, as is
visible in our PoC PR (https://github.com/apache/spark/pull/24297).
Contents from that PR will be iteratively extracted and issued in separate
PRs, which require review and merging. However, this process is currently
blocked by the API PR.

There are a number of remaining JIRA issues for the completion of the work,
some of which we believe may be cut from the scope if we need to reduce it
to be ready in time for the 3.0 release. The ones we believe are necessary
to complete are:

- https://issues.apache.org/jira/browse/SPARK-27303 (API PR as mentioned
above)
- https://issues.apache.org/jira/browse/SPARK-27306 (Python API)
- https://issues.apache.org/jira/browse/SPARK-27309 (Implementation)
- https://issues.apache.org/jira/browse/SPARK-27310 (Python adapter)
- https://issues.apache.org/jira/browse/SPARK-27311 (Documentation)

Looking forward to working with you all to deliver Spark Graph for 3.0!

Best regards
Mats, Martin
Neo4j

On Tue, Sep 17, 2019 at 8:35 PM Matt Cheah <[email protected]> wrote:

> I don’t know if it will be feasible to merge all of SPARK-25299 into Spark
> 3. There are a number of APIs that will be submitted for review, and I
> wouldn’t want to block the release on negotiating these changes, as the
> decisions we make for each API can be pretty involved.
>
>
>
> Our original plan was to mark every API included in SPARK-25299 as private
> until the entirety was merged, sometime between the release of Spark 3 and
> Spark 3.1. Once the entire API is merged into the codebase, we’d promote
> all of them to Experimental status and ship them in Spark 3.1.
>
>
>
> So, I’m -1 on blocking the Spark 3 preview release specifically on
> SPARK-25299.
>
>
>
> -Matt Cheah
>
>
>
> *From: *Xiao Li <[email protected]>
> *Date: *Tuesday, September 17, 2019 at 12:00 AM
> *To: *Erik Erlandson <[email protected]>
> *Cc: *Sean Owen <[email protected]>, dev <[email protected]>
> *Subject: *Re: Thoughts on Spark 3 release, or a preview release
>
>
>
> https://issues.apache.org/jira/browse/SPARK-28264 [issues.apache.org]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D28264&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=1Q_M2gB47Tg055OUQvtF27n8Duf0ltc4DymrYDXmxF8&s=IpHgUciMmEHdfKbMmOI1lzujFtAF4ZxwjXiytLsyaAs&e=>
>  SPARK-28264
> Revisiting Python / pandas UDF sounds critical for 3.0 preview
>
>
>
> Xiao
>
>
>
> On Mon, Sep 16, 2019 at 12:22 PM Erik Erlandson <[email protected]>
> wrote:
>
>
>
> I'm in favor of adding SPARK-25299 [issues.apache.org]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D25299&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=1Q_M2gB47Tg055OUQvtF27n8Duf0ltc4DymrYDXmxF8&s=iT7mSAztELml5mB-hCvWfVnuLO7uMK1z_QfOVxMZBxI&e=>
> - Use remote storage for persisting shuffle data
>
> https://issues.apache.org/jira/browse/SPARK-25299 [issues.apache.org]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D25299&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=1Q_M2gB47Tg055OUQvtF27n8Duf0ltc4DymrYDXmxF8&s=iT7mSAztELml5mB-hCvWfVnuLO7uMK1z_QfOVxMZBxI&e=>
>
>
>
> If that is far enough along to get onto the roadmap.
>
>
>
>
>
> On Wed, Sep 11, 2019 at 11:37 AM Sean Owen <[email protected]> wrote:
>
> I'm curious what current feelings are about ramping down towards a
> Spark 3 release. It feels close to ready. There is no fixed date,
> though in the past we had informally tossed around "back end of 2019".
> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
> due.
>
> What are the few major items that must get done for Spark 3, in your
> opinion? Below are all of the open JIRAs for 3.0 (which everyone
> should feel free to update with things that aren't really needed for
> Spark 3; I already triaged some).
>
> For me, it's:
> - DSv2?
> - Finishing touches on the Hive, JDK 11 update
>
> What about considering a preview release earlier, as happened for
> Spark 2, to get feedback much earlier than the RC cycle? Could that
> even happen ... about now?
>
> I'm also wondering what a realistic estimate of Spark 3 release is. My
> guess is quite early 2020, from here.
>
>
>
> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog
> uses
> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
> SPARK-28588 Build a SQL reference doc
> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
> SPARK-28684 Hive module support JDK 11
> SPARK-28548 explain() shows wrong result for persisted DataFrames
> after some operations
> SPARK-28372 Document Spark WEB UI
> SPARK-28476 Support ALTER DATABASE SET LOCATION
> SPARK-28264 Revisiting Python / pandas UDF
> SPARK-28301 fix the behavior of table name resolution with multi-catalog
> SPARK-28155 do not leak SaveMode to file source v2
> SPARK-28103 Cannot infer filters from union table with empty local
> relation table properly
> SPARK-28024 Incorrect numeric values when out of range
> SPARK-27936 Support local dependency uploading from --py-files
> SPARK-27884 Deprecate Python 2 support in Spark 3.0
> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
> SPARK-27780 Shuffle server & client should be versioned to enable
> smoother upgrade
> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
> of joined tables > 12
> SPARK-27471 Reorganize public v2 catalog API
> SPARK-27520 Introduce a global config system to replace hadoopConfiguration
> SPARK-24625 put all the backward compatible behavior change configs
> under spark.sql.legacy.*
> SPARK-24640 size(null) returns null
> SPARK-24702 Unable to cast to calendar interval in spark sql.
> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
> SPARK-24941 Add RDDBarrier.coalesce() function
> SPARK-25017 Add test suite for ContextBarrierState
> SPARK-25083 remove the type erasure hack in data source scan
> SPARK-25383 Image data source supports sample pushdown
> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
> default
> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
> efficiency problem
> SPARK-25128 multiple simultaneous job submissions against k8s backend
> cause driver pods to hang
> SPARK-26731 remove EOLed spark jobs from jenkins
> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
> SPARK-21559 Remove Mesos fine-grained mode
> SPARK-24942 Improve cluster resource management with jobs containing
> barrier stage
> SPARK-25914 Separate projection from grouping and aggregate in logical
> Aggregate
> SPARK-26022 PySpark Comparison with Pandas
> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
> SPARK-26221 Improve Spark SQL instrumentation and metrics
> SPARK-26425 Add more constraint checks in file streaming source to
> avoid checkpoint corruption
> SPARK-25843 Redesign rangeBetween API
> SPARK-25841 Redesign window function rangeBetween API
> SPARK-25752 Add trait to easily whitelist logical operators that
> produce named output from CleanupAliases
> SPARK-23210 Introduce the concept of default value to schema
> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
> aggregate
> SPARK-25531 new write APIs for data source v2
> SPARK-25547 Pluggable jdbc connection factory
> SPARK-20845 Support specification of column names in INSERT INTO
> SPARK-24417 Build and Run Spark on JDK11
> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
> SPARK-25074 Implement maxNumConcurrentTasks() in
> MesosFineGrainedSchedulerBackend
> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
> SPARK-25186 Stabilize Data Source V2 API
> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
> execution mode
> SPARK-25390 data source V2 API refactoring
> SPARK-7768 Make user-defined type (UDT) API public
> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
> SPARK-15691 Refactor and improve Hive support
> SPARK-15694 Implement ScriptTransformation in sql/core
> SPARK-16217 Support SELECT INTO statement
> SPARK-16452 basic INFORMATION_SCHEMA support
> SPARK-18134 SQL: MapType in Group BY and Joins not working
> SPARK-18245 Improving support for bucketed table
> SPARK-19842 Informational Referential Integrity Constraints Support in
> Spark
> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
> list of structures
> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
> respect session timezone
> SPARK-22386 Data Source V2 improvements
> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>
>
>
> --
>
> [image: Image removed by sender. Databricks Summit - Watch the talks]
> [databricks.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__databricks.com_sparkaisummit_north-2Damerica&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=1Q_M2gB47Tg055OUQvtF27n8Duf0ltc4DymrYDXmxF8&s=PrThiFiIjx9w-BWAFi54IJ9fBWNiK_Wi9cWKzhCSxrw&e=>
>
>

Re: Thoughts on Spark 3 release, or a preview release

Reply via email to