Re: Thoughts on Spark 3 release, or a preview release

Thomas Graves Fri, 13 Sep 2019 09:58:48 -0700

+1, I think having preview release would be great.

Tom


On Fri, Sep 13, 2019 at 4:55 AM Stavros Kontopoulos <
[email protected]> wrote:

> +1 as a contributor and as a user. Given the amount of testing required
> for all the new cool stuff like java 11 support, major
> refactorings/deprecations etc, a preview version would help a lot the
> community making adoption smoother long term. I would also add to the list
> of issues, Scala 2.13 support (
> https://issues.apache.org/jira/browse/SPARK-25075) assuming things will
> move forward faster the next few months.
>
> On Fri, Sep 13, 2019 at 11:08 AM Driesprong, Fokko <[email protected]>
> wrote:
>
>> Michael Heuer, that's an interesting issue.
>>
>> 1.8.2 to 1.9.0 is almost binary compatible (94%):
>> http://people.apache.org/~busbey/avro/1.9.0-RC4/1.8.2_to_1.9.0RC4_compat_report.html.
>> Most of the stuff is removing the Jackson and Netty API from Avro's public
>> API and deprecating the Joda library. I would strongly advise moving to
>> 1.9.1 since there are some regression issues, for Java most important:
>> https://jira.apache.org/jira/browse/AVRO-2400
>>
>> I'd love to dive into the issue that you describe and I'm curious if the
>> issue is still there with Avro 1.9.1. I'm a bit busy at the moment but
>> might have some time this weekend to dive into it.
>>
>> Cheers, Fokko Driesprong
>>
>>
>> Op vr 13 sep. 2019 om 02:32 schreef Reynold Xin <[email protected]>:
>>
>>> +1! Long due for a preview release.
>>>
>>>
>>> On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <[email protected]>
>>> wrote:
>>>
>>>> I like the idea from the PoV of giving folks something to start testing
>>>> against and exploring so they can raise issues with us earlier in the
>>>> process and we have more time to make calls around this.
>>>>
>>>> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <[email protected]> wrote:
>>>>
>>>> +1  Like the idea as a user and a DSv2 contributor.
>>>>
>>>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <[email protected]> wrote:
>>>>
>>>> +1 (as a contributor) from me to have preview release on Spark 3 as it
>>>> would help to test the feature. When to cut preview release is
>>>> questionable, as major works are ideally to be done before that - if we are
>>>> intended to introduce new features before official release, that should
>>>> work regardless of this, but if we are intended to have opportunity to test
>>>> earlier, ideally it should.
>>>>
>>>> As a one of contributors in structured streaming area, I'd like to add
>>>> some items for Spark 3.0, both "must be done" and "better to have". For
>>>> "better to have", I pick some items for new features which committers
>>>> reviewed couple of rounds and dropped off without soft-reject (No valid
>>>> reason to stop). For Spark 2.4 users, only added feature for structured
>>>> streaming is Kafka delegation token. (given we assume revising Kafka
>>>> consumer pool as improvement) I hope we provide some gifts for structured
>>>> streaming users in Spark 3.0 envelope.
>>>>
>>>> > must be done
>>>> * SPARK-26154 Stream-stream joins - left outer join gives inconsistent
>>>> output
>>>> It's a correctness issue with multiple users reported, being reported
>>>> at Nov. 2018. There's a way to reproduce it consistently, and we have a
>>>> patch submitted at Jan. 2019 to fix it.
>>>>
>>>> > better to have
>>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming
>>>> * SPARK-26848 Introduce new option to Kafka source - specify timestamp
>>>> to start and end offset
>>>> * SPARK-20568 Delete files after processing in structured streaming
>>>>
>>>> There're some more new features/improvements items in SS, but given
>>>> we're talking about ramping-down, above list might be realistic one.
>>>>
>>>>
>>>>
>>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <[email protected]>
>>>> wrote:
>>>>
>>>> As a user/non committer, +1
>>>>
>>>> I love the idea of an early 3.0.0 so we can test current dev against
>>>> it, I know the final 3.x will probably need another round of testing when
>>>> it gets out, but less for sure... I know I could checkout and compile, but
>>>> having a “packaged” preversion is great if it does not take too much time
>>>> to the team...
>>>>
>>>> jg
>>>>
>>>>
>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <[email protected]> wrote:
>>>>
>>>> +1 from me too but I would like to know what other people think too.
>>>>
>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <[email protected]>님이 작성:
>>>>
>>>> Thank you, Sean.
>>>>
>>>> I'm also +1 for the following three.
>>>>
>>>> 1. Start to ramp down (by the official branch-3.0 cut)
>>>> 2. Apache Spark 3.0.0-preview in 2019
>>>> 3. Apache Spark 3.0.0 in early 2020
>>>>
>>>> For JDK11 clean-up, it will meet the timeline and `3.0.0-preview` helps
>>>> it a lot.
>>>>
>>>> After this discussion, can we have some timeline for `Spark 3.0 Release
>>>> Window` in our versioning-policy page?
>>>>
>>>> - https://spark.apache.org/versioning-policy.html
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer <[email protected]>
>>>> wrote:
>>>>
>>>> I would love to see Spark + Hadoop + Parquet + Avro compatibility
>>>> problems resolved, e.g.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-25588
>>>> https://issues.apache.org/jira/browse/SPARK-27781
>>>>
>>>> Note that Avro is now at 1.9.1, binary-incompatible with 1.8.x.  As far
>>>> as I know, Parquet has not cut a release based on this new version.
>>>>
>>>> Then out of curiosity, are the new Spark Graph APIs targeting 3.0?
>>>>
>>>> https://github.com/apache/spark/pull/24851
>>>> https://github.com/apache/spark/pull/24297
>>>>
>>>>    michael
>>>>
>>>>
>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <[email protected]> wrote:
>>>>
>>>> I'm curious what current feelings are about ramping down towards a
>>>> Spark 3 release. It feels close to ready. There is no fixed date,
>>>> though in the past we had informally tossed around "back end of 2019".
>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd expect
>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is coming
>>>> due.
>>>>
>>>> What are the few major items that must get done for Spark 3, in your
>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone
>>>> should feel free to update with things that aren't really needed for
>>>> Spark 3; I already triaged some).
>>>>
>>>> For me, it's:
>>>> - DSv2?
>>>> - Finishing touches on the Hive, JDK 11 update
>>>>
>>>> What about considering a preview release earlier, as happened for
>>>> Spark 2, to get feedback much earlier than the RC cycle? Could that
>>>> even happen ... about now?
>>>>
>>>> I'm also wondering what a realistic estimate of Spark 3 release is. My
>>>> guess is quite early 2020, from here.
>>>>
>>>>
>>>>
>>>> SPARK-29014 DataSourceV2: Clean up current, default, and session
>>>> catalog uses
>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
>>>> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
>>>> SPARK-28588 Build a SQL reference doc
>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
>>>> SPARK-28684 Hive module support JDK 11
>>>> SPARK-28548 explain() shows wrong result for persisted DataFrames
>>>> after some operations
>>>> SPARK-28372 Document Spark WEB UI
>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION
>>>> SPARK-28264 Revisiting Python / pandas UDF
>>>> SPARK-28301 fix the behavior of table name resolution with multi-catalog
>>>> SPARK-28155 do not leak SaveMode to file source v2
>>>> SPARK-28103 Cannot infer filters from union table with empty local
>>>> relation table properly
>>>> SPARK-28024 Incorrect numeric values when out of range
>>>> SPARK-27936 Support local dependency uploading from --py-files
>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0
>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL
>>>> SPARK-27780 Shuffle server & client should be versioned to enable
>>>> smoother upgrade
>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm when the #
>>>> of joined tables > 12
>>>> SPARK-27471 Reorganize public v2 catalog API
>>>> SPARK-27520 Introduce a global config system to replace
>>>> hadoopConfiguration
>>>> SPARK-24625 put all the backward compatible behavior change configs
>>>> under spark.sql.legacy.*
>>>> SPARK-24640 size(null) returns null
>>>> SPARK-24702 Unable to cast to calendar interval in spark sql.
>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more operators
>>>> SPARK-24941 Add RDDBarrier.coalesce() function
>>>> SPARK-25017 Add test suite for ContextBarrierState
>>>> SPARK-25083 remove the type erasure hack in data source scan
>>>> SPARK-25383 Image data source supports sample pushdown
>>>> SPARK-27272 Enable blacklisting of node/executor on fetch failures by
>>>> default
>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a major
>>>> efficiency problem
>>>> SPARK-25128 multiple simultaneous job submissions against k8s backend
>>>> cause driver pods to hang
>>>> SPARK-26731 remove EOLed spark jobs from jenkins
>>>> SPARK-26664 Make DecimalType's minimum adjusted scale configurable
>>>> SPARK-21559 Remove Mesos fine-grained mode
>>>> SPARK-24942 Improve cluster resource management with jobs containing
>>>> barrier stage
>>>> SPARK-25914 Separate projection from grouping and aggregate in logical
>>>> Aggregate
>>>> SPARK-26022 PySpark Comparison with Pandas
>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL standard
>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics
>>>> SPARK-26425 Add more constraint checks in file streaming source to
>>>> avoid checkpoint corruption
>>>> SPARK-25843 Redesign rangeBetween API
>>>> SPARK-25841 Redesign window function rangeBetween API
>>>> SPARK-25752 Add trait to easily whitelist logical operators that
>>>> produce named output from CleanupAliases
>>>> SPARK-23210 Introduce the concept of default value to schema
>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and window
>>>> aggregate
>>>> SPARK-25531 new write APIs for data source v2
>>>> SPARK-25547 Pluggable jdbc connection factory
>>>> SPARK-20845 Support specification of column names in INSERT INTO
>>>> SPARK-24417 Build and Run Spark on JDK11
>>>> SPARK-24724 Discuss necessary info and access in barrier mode +
>>>> Kubernetes
>>>> SPARK-24725 Discuss necessary info and access in barrier mode + Mesos
>>>> SPARK-25074 Implement maxNumConcurrentTasks() in
>>>> MesosFineGrainedSchedulerBackend
>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>> SPARK-25186 Stabilize Data Source V2 API
>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>>> execution mode
>>>> SPARK-25390 data source V2 API refactoring
>>>> SPARK-7768 Make user-defined type (UDT) API public
>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>>> Spec
>>>> SPARK-15691 Refactor and improve Hive support
>>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>>> SPARK-16217 Support SELECT INTO statement
>>>> SPARK-16452 basic INFORMATION_SCHEMA support
>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>>> SPARK-18245 Improving support for bucketed table
>>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>>> Spark
>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>>> list of structures
>>>> SPARK-22632 Fix the behavior of timestamp values for R's DataFrame to
>>>> respect session timezone
>>>> SPARK-22386 Data Source V2 improvements
>>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: [email protected]
>>>> <[email protected]>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Name : Jungtaek Lim
>>>> Blog : http://medium.com/@heartsavior
>>>> Twitter : http://twitter.com/heartsavior
>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>
>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>>
>
>

Re: Thoughts on Spark 3 release, or a preview release

Reply via email to