Re: Thoughts on Spark 3 release, or a preview release

Michael Heuer Mon, 16 Sep 2019 07:57:57 -0700

Thank you, Fokko.

Probably best to discuss further off-list.  I'm almost embarrassed to describe 
our current workaround — it involves among other things a custom Shader 
implementation for the Maven Shade plugin.


   michael


> On Sep 13, 2019, at 3:07 AM, Driesprong, Fokko <[email protected]> wrote:
> 
> Michael Heuer, that's an interesting issue.
> 
> 1.8.2 to 1.9.0 is almost binary compatible (94%): 
> http://people.apache.org/~busbey/avro/1.9.0-RC4/1.8.2_to_1.9.0RC4_compat_report.html
>  
> <http://people.apache.org/~busbey/avro/1.9.0-RC4/1.8.2_to_1.9.0RC4_compat_report.html>.
>  Most of the stuff is removing the Jackson and Netty API from Avro's public 
> API and deprecating the Joda library. I would strongly advise moving to 1.9.1 
> since there are some regression issues, for Java most important: 
> https://jira.apache.org/jira/browse/AVRO-2400 
> <https://jira.apache.org/jira/browse/AVRO-2400>
> 
> I'd love to dive into the issue that you describe and I'm curious if the 
> issue is still there with Avro 1.9.1. I'm a bit busy at the moment but might 
> have some time this weekend to dive into it.
> 
> Cheers, Fokko Driesprong
> 
> 
> Op vr 13 sep. 2019 om 02:32 schreef Reynold Xin <[email protected] 
> <mailto:[email protected]>>:
> +1! Long due for a preview release.
> 
> 
> On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <[email protected] 
> <mailto:[email protected]>> wrote:
> I like the idea from the PoV of giving folks something to start testing 
> against and exploring so they can raise issues with us earlier in the process 
> and we have more time to make calls around this.
> 
> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <[email protected] 
> <mailto:[email protected]>> wrote:
> +1  Like the idea as a user and a DSv2 contributor.
> 
> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <[email protected] 
> <mailto:[email protected]>> wrote:
> +1 (as a contributor) from me to have preview release on Spark 3 as it would 
> help to test the feature. When to cut preview release is questionable, as 
> major works are ideally to be done before that - if we are intended to 
> introduce new features before official release, that should work regardless 
> of this, but if we are intended to have opportunity to test earlier, ideally 
> it should.
> 
> As a one of contributors in structured streaming area, I'd like to add some 
> items for Spark 3.0, both "must be done" and "better to have". For "better to 
> have", I pick some items for new features which committers reviewed couple of 
> rounds and dropped off without soft-reject (No valid reason to stop). For 
> Spark 2.4 users, only added feature for structured streaming is Kafka 
> delegation token. (given we assume revising Kafka consumer pool as 
> improvement) I hope we provide some gifts for structured streaming users in 
> Spark 3.0 envelope.
> 
> > must be done
> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent output
> It's a correctness issue with multiple users reported, being reported at Nov. 
> 2018. There's a way to reproduce it consistently, and we have a patch 
> submitted at Jan. 2019 to fix it.
> 
> > better to have
> * SPARK-23539 Add support for Kafka headers in Structured Streaming
> * SPARK-26848 Introduce new option to Kafka source - specify timestamp to 
> start and end offset
> * SPARK-20568 Delete files after processing in structured streaming
> 
> There're some more new features/improvements items in SS, but given we're 
> talking about ramping-down, above list might be realistic one.
> 
> 
> 
> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <[email protected] 
> <mailto:[email protected]>> wrote:
> As a user/non committer, +1
> 
> I love the idea of an early 3.0.0 so we can test current dev against it, I 
> know the final 3.x will probably need another round of testing when it gets 
> out, but less for sure... I know I could checkout and compile, but having a 
> “packaged” preversion is great if it does not take too much time to the 
> team...
> 
> jg
> 
> 
> On Sep 11, 2019, at 20:40, Hyukjin Kwon <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> +1 from me too but I would like to know what other people think too.
>> 
>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <[email protected] 
>> <mailto:[email protected]>>님이 작성:
>> Thank you, Sean.
>> 
>> I'm also +1 for the following three.
>> 
>> 1. Start to ramp down (by the official branch-3.0 cut)
>> 2. Apache Spark 3.0.0-preview in 2019
>> 3. Apache Spark 3.0.0 in early 2020
>> 
>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps it a 
>> lot.
>> 
>> After this discussion, can we have some timeline for `Spark 3.0 Release 
>> Window` in our versioning-policy page?
>> 
>> - https://spark.apache.org/versioning-policy.html 
>> <https://spark.apache.org/versioning-policy.html>
>> 
>> Bests,
>> Dongjoon.
>> 
>> 
>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <[email protected] 
>> <mailto:[email protected]>> wrote:
>> I would love to see Spark + Hadoop + Parquet + Avro compatibility problems 
>> resolved, e.g.
>> 
>> https://issues.apache.org/jira/browse/SPARK-25588 
>> <https://issues.apache.org/jira/browse/SPARK-25588>
>> https://issues.apache.org/jira/browse/SPARK-27781 
>> <https://issues.apache.org/jira/browse/SPARK-27781>
>> 
>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far as I 
>> know, Parquet has not cut a release based on this new version.
>> 
>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>> 
>> https://github.com/apache/spark/pull/24851 
>> <https://github.com/apache/spark/pull/24851>
>> https://github.com/apache/spark/pull/24297 
>> <https://github.com/apache/spark/pull/24297>
>> 
>>    michael
>> 
>> 
>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> I'm curious what current feelings are about ramping down towards a
>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>> though in the past we had informally tossed around "back end of 2019".
>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>> due.
>>> 
>>> What are the few major items that must get done for Spark 3, in your
>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>> should feel free to update with things that aren't really needed for
>>> Spark 3; I already triaged some).
>>> 
>>> For me, it's:
>>> - DSv2?
>>> - Finishing touches on the Hive, JDK 11 update
>>> 
>>> What about considering a preview release earlier, as happened for
>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>> even happen ... about now?
>>> 
>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>> guess is quite early 2020, from here.
>>> 
>>> 
>>> 
>>> SPARK-29014 DataSourceV2: Clean up current, default, and session catalog 
>>> uses
>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>> SPARK-28588 Build a SQL reference doc
>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>> SPARK-28684 Hive module support JDK 11
>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>> after some operations
>>> SPARK-28372 Document Spark WEB UI
>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>> SPARK-28264 Revisiting Python / pandas UDF
>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>> SPARK-28155 do not leak SaveMode to file source v2
>>> SPARK-28103 Cannot infer filters from union table with empty local
>>> relation table properly
>>> SPARK-28024 Incorrect numeric values when out of range
>>> SPARK-27936 Support local dependency uploading from --py-files
>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>> smoother upgrade
>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>> of joined tables > 12
>>> SPARK-27471 Reorganize public v2 catalog API
>>> SPARK-27520 Introduce a global config system to replace hadoopConfiguration
>>> SPARK-24625 put all the backward compatible behavior change configs
>>> under spark.sql.legacy.*
>>> SPARK-24640 size(null) returns null
>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>> SPARK-25017 Add test suite for ContextBarrierState
>>> SPARK-25083 remove the type erasure hack in data source scan
>>> SPARK-25383 Image data source supports sample pushdown
>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by 
>>> default
>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>> efficiency problem
>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>> cause driver pods to hang
>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>> SPARK-21559 Remove Mesos fine-grained mode
>>> SPARK-24942 Improve cluster resource management with jobs containing
>>> barrier stage
>>> SPARK-25914 Separate projection from grouping and aggregate in logical 
>>> Aggregate
>>> SPARK-26022 PySpark Comparison with Pandas
>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>> SPARK-26425 Add more constraint checks in file streaming source to
>>> avoid checkpoint corruption
>>> SPARK-25843 Redesign rangeBetween API
>>> SPARK-25841 Redesign window function rangeBetween API
>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>> produce named output from CleanupAliases
>>> SPARK-23210 Introduce the concept of default value to schema
>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window 
>>> aggregate
>>> SPARK-25531 new write APIs for data source v2
>>> SPARK-25547 Pluggable jdbc connection factory
>>> SPARK-20845 Support specification of column names in INSERT INTO
>>> SPARK-24417 Build and Run Spark on JDK11
>>> SPARK-24724 Discuss necessary info and access in barrier mode + Kubernetes
>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>> MesosFineGrainedSchedulerBackend
>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>> SPARK-25186 Stabilize Data Source V2 API
>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>> execution mode
>>> SPARK-25390 data source V2 API refactoring
>>> SPARK-7768 Make user-defined type (UDT) API public
>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition Spec
>>> SPARK-15691 Refactor and improve Hive support
>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>> SPARK-16217 Support SELECT INTO statement
>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>> SPARK-18245 Improving support for bucketed table
>>> SPARK-19842 Informational Referential Integrity Constraints Support in Spark
>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>> list of structures
>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>> respect session timezone
>>> SPARK-22386 Data Source V2 improvements
>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected] 
>>> <mailto:[email protected]>
>>> 
>> 
> 
> 
> -- 
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior <http://medium.com/@heartsavior>
> Twitter : http://twitter.com/heartsavior <http://twitter.com/heartsavior>
> LinkedIn : http://www.linkedin.com/in/heartsavior 
> <http://www.linkedin.com/in/heartsavior>
> 
> -- 
> John Zhuge
> 
> 
> -- 
> Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau>
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
>  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau 
> <https://www.youtube.com/user/holdenkarau>

Re: Thoughts on Spark 3 release, or a preview release

Reply via email to