Hi Spark Aficionados- On Fri, Sep 13, 2019 at 15:08 Ryan Blue <rb...@netflix.com.invalid> wrote:
> +1 for a preview release. > > DSv2 is quite close to being ready. I can only think of a couple issues > that we need to merge, like getting a fix for stats estimation done. I'll > have a better idea once I've caught up from being away for ApacheCon and > I'll add this to the agenda for our next DSv2 sync on Wednesday. > What does 3.0 mean for the DSv2 API? Does the API freeze at that point, or would it still be allowed to change? I'm writing a DSv2 plug-in (GitHub.com/spark-root/laurelin) and there's a couple little API things I think could be useful, I've just not had time to write here/open a JIRA about. Thanks Andrew > On Fri, Sep 13, 2019 at 12:26 PM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> Ur, Sean. >> >> I prefer a full release like 2.0.0-preview. >> >> https://archive.apache.org/dist/spark/spark-2.0.0-preview/ >> >> And, thank you, Xingbo! >> Could you take a look at website generation? It seems to be broken on >> `master`. >> >> Bests, >> Dongjoon. >> >> >> On Fri, Sep 13, 2019 at 11:30 AM Xingbo Jiang <jiangxb1...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> I would like to volunteer to be the release manager of Spark 3 preview, >>> thanks! >>> >>> Sean Owen <sro...@gmail.com> 于2019年9月13日周五 上午11:21写道: >>> >>>> Well, great to hear the unanimous support for a Spark 3 preview >>>> release. Now, I don't know how to make releases myself :) I would >>>> first open it up to our revered release managers: would anyone be >>>> interested in trying to make one? sounds like it's not too soon to get >>>> what's in master out for evaluation, as there aren't any major >>>> deficiencies left, although a number of items to consider for the >>>> final release. >>>> >>>> I think we just need one release, targeting Hadoop 3.x / Hive 2.x in >>>> order to make it possible to test with JDK 11. (We're only on Scala >>>> 2.12 at this point.) >>>> >>>> On Thu, Sep 12, 2019 at 7:32 PM Reynold Xin <r...@databricks.com> >>>> wrote: >>>> > >>>> > +1! Long due for a preview release. >>>> > >>>> > >>>> > On Thu, Sep 12, 2019 at 5:26 PM, Holden Karau <hol...@pigscanfly.ca> >>>> wrote: >>>> >> >>>> >> I like the idea from the PoV of giving folks something to start >>>> testing against and exploring so they can raise issues with us earlier in >>>> the process and we have more time to make calls around this. >>>> >> >>>> >> On Thu, Sep 12, 2019 at 4:15 PM John Zhuge <jzh...@apache.org> >>>> wrote: >>>> >>> >>>> >>> +1 Like the idea as a user and a DSv2 contributor. >>>> >>> >>>> >>> On Thu, Sep 12, 2019 at 4:10 PM Jungtaek Lim <kabh...@gmail.com> >>>> wrote: >>>> >>>> >>>> >>>> +1 (as a contributor) from me to have preview release on Spark 3 >>>> as it would help to test the feature. When to cut preview release is >>>> questionable, as major works are ideally to be done before that - if we are >>>> intended to introduce new features before official release, that should >>>> work regardless of this, but if we are intended to have opportunity to test >>>> earlier, ideally it should. >>>> >>>> >>>> >>>> As a one of contributors in structured streaming area, I'd like to >>>> add some items for Spark 3.0, both "must be done" and "better to have". For >>>> "better to have", I pick some items for new features which committers >>>> reviewed couple of rounds and dropped off without soft-reject (No valid >>>> reason to stop). For Spark 2.4 users, only added feature for structured >>>> streaming is Kafka delegation token. (given we assume revising Kafka >>>> consumer pool as improvement) I hope we provide some gifts for structured >>>> streaming users in Spark 3.0 envelope. >>>> >>>> >>>> >>>> > must be done >>>> >>>> * SPARK-26154 Stream-stream joins - left outer join gives >>>> inconsistent output >>>> >>>> It's a correctness issue with multiple users reported, being >>>> reported at Nov. 2018. There's a way to reproduce it consistently, and we >>>> have a patch submitted at Jan. 2019 to fix it. >>>> >>>> >>>> >>>> > better to have >>>> >>>> * SPARK-23539 Add support for Kafka headers in Structured Streaming >>>> >>>> * SPARK-26848 Introduce new option to Kafka source - specify >>>> timestamp to start and end offset >>>> >>>> * SPARK-20568 Delete files after processing in structured streaming >>>> >>>> >>>> >>>> There're some more new features/improvements items in SS, but >>>> given we're talking about ramping-down, above list might be realistic one. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Sep 12, 2019 at 9:53 AM Jean Georges Perrin <j...@jgp.net> >>>> wrote: >>>> >>>>> >>>> >>>>> As a user/non committer, +1 >>>> >>>>> >>>> >>>>> I love the idea of an early 3.0.0 so we can test current dev >>>> against it, I know the final 3.x will probably need another round of >>>> testing when it gets out, but less for sure... I know I could checkout and >>>> compile, but having a “packaged” preversion is great if it does not take >>>> too much time to the team... >>>> >>>>> >>>> >>>>> jg >>>> >>>>> >>>> >>>>> >>>> >>>>> On Sep 11, 2019, at 20:40, Hyukjin Kwon <gurwls...@gmail.com> >>>> wrote: >>>> >>>>> >>>> >>>>> +1 from me too but I would like to know what other people think >>>> too. >>>> >>>>> >>>> >>>>> 2019년 9월 12일 (목) 오전 9:07, Dongjoon Hyun <dongjoon.h...@gmail.com>님이 >>>> 작성: >>>> >>>>>> >>>> >>>>>> Thank you, Sean. >>>> >>>>>> >>>> >>>>>> I'm also +1 for the following three. >>>> >>>>>> >>>> >>>>>> 1. Start to ramp down (by the official branch-3.0 cut) >>>> >>>>>> 2. Apache Spark 3.0.0-preview in 2019 >>>> >>>>>> 3. Apache Spark 3.0.0 in early 2020 >>>> >>>>>> >>>> >>>>>> For JDK11 clean-up, it will meet the timeline and >>>> `3.0.0-preview` helps it a lot. >>>> >>>>>> >>>> >>>>>> After this discussion, can we have some timeline for `Spark 3.0 >>>> Release Window` in our versioning-policy page? >>>> >>>>>> >>>> >>>>>> - https://spark.apache.org/versioning-policy.html >>>> >>>>>> >>>> >>>>>> Bests, >>>> >>>>>> Dongjoon. >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> On Wed, Sep 11, 2019 at 11:54 AM Michael Heuer < >>>> heue...@gmail.com> wrote: >>>> >>>>>>> >>>> >>>>>>> I would love to see Spark + Hadoop + Parquet + Avro >>>> compatibility problems resolved, e.g. >>>> >>>>>>> >>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-25588 >>>> >>>>>>> https://issues.apache.org/jira/browse/SPARK-27781 >>>> >>>>>>> >>>> >>>>>>> Note that Avro is now at 1.9.1, binary-incompatible with >>>> 1.8.x. As far as I know, Parquet has not cut a release based on this new >>>> version. >>>> >>>>>>> >>>> >>>>>>> Then out of curiosity, are the new Spark Graph APIs targeting >>>> 3.0? >>>> >>>>>>> >>>> >>>>>>> https://github.com/apache/spark/pull/24851 >>>> >>>>>>> https://github.com/apache/spark/pull/24297 >>>> >>>>>>> >>>> >>>>>>> michael >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> On Sep 11, 2019, at 1:37 PM, Sean Owen <sro...@apache.org> >>>> wrote: >>>> >>>>>>> >>>> >>>>>>> I'm curious what current feelings are about ramping down >>>> towards a >>>> >>>>>>> Spark 3 release. It feels close to ready. There is no fixed >>>> date, >>>> >>>>>>> though in the past we had informally tossed around "back end of >>>> 2019". >>>> >>>>>>> For reference, Spark 1 was May 2014, Spark 2 was July 2016. I'd >>>> expect >>>> >>>>>>> Spark 2 to last longer, so to speak, but feels like Spark 3 is >>>> coming >>>> >>>>>>> due. >>>> >>>>>>> >>>> >>>>>>> What are the few major items that must get done for Spark 3, in >>>> your >>>> >>>>>>> opinion? Below are all of the open JIRAs for 3.0 (which everyone >>>> >>>>>>> should feel free to update with things that aren't really >>>> needed for >>>> >>>>>>> Spark 3; I already triaged some). >>>> >>>>>>> >>>> >>>>>>> For me, it's: >>>> >>>>>>> - DSv2? >>>> >>>>>>> - Finishing touches on the Hive, JDK 11 update >>>> >>>>>>> >>>> >>>>>>> What about considering a preview release earlier, as happened >>>> for >>>> >>>>>>> Spark 2, to get feedback much earlier than the RC cycle? Could >>>> that >>>> >>>>>>> even happen ... about now? >>>> >>>>>>> >>>> >>>>>>> I'm also wondering what a realistic estimate of Spark 3 release >>>> is. My >>>> >>>>>>> guess is quite early 2020, from here. >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> SPARK-29014 DataSourceV2: Clean up current, default, and >>>> session catalog uses >>>> >>>>>>> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests >>>> >>>>>>> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite >>>> >>>>>>> SPARK-28717 Update SQL ALTER TABLE RENAME to use TableCatalog >>>> API >>>> >>>>>>> SPARK-28588 Build a SQL reference doc >>>> >>>>>>> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder >>>> >>>>>>> SPARK-28684 Hive module support JDK 11 >>>> >>>>>>> SPARK-28548 explain() shows wrong result for persisted >>>> DataFrames >>>> >>>>>>> after some operations >>>> >>>>>>> SPARK-28372 Document Spark WEB UI >>>> >>>>>>> SPARK-28476 Support ALTER DATABASE SET LOCATION >>>> >>>>>>> SPARK-28264 Revisiting Python / pandas UDF >>>> >>>>>>> SPARK-28301 fix the behavior of table name resolution with >>>> multi-catalog >>>> >>>>>>> SPARK-28155 do not leak SaveMode to file source v2 >>>> >>>>>>> SPARK-28103 Cannot infer filters from union table with empty >>>> local >>>> >>>>>>> relation table properly >>>> >>>>>>> SPARK-28024 Incorrect numeric values when out of range >>>> >>>>>>> SPARK-27936 Support local dependency uploading from --py-files >>>> >>>>>>> SPARK-27884 Deprecate Python 2 support in Spark 3.0 >>>> >>>>>>> SPARK-27763 Port test cases from PostgreSQL to Spark SQL >>>> >>>>>>> SPARK-27780 Shuffle server & client should be versioned to >>>> enable >>>> >>>>>>> smoother upgrade >>>> >>>>>>> SPARK-27714 Support Join Reorder based on Genetic Algorithm >>>> when the # >>>> >>>>>>> of joined tables > 12 >>>> >>>>>>> SPARK-27471 Reorganize public v2 catalog API >>>> >>>>>>> SPARK-27520 Introduce a global config system to replace >>>> hadoopConfiguration >>>> >>>>>>> SPARK-24625 put all the backward compatible behavior change >>>> configs >>>> >>>>>>> under spark.sql.legacy.* >>>> >>>>>>> SPARK-24640 size(null) returns null >>>> >>>>>>> SPARK-24702 Unable to cast to calendar interval in spark sql. >>>> >>>>>>> SPARK-24838 Support uncorrelated IN/EXISTS subqueries for more >>>> operators >>>> >>>>>>> SPARK-24941 Add RDDBarrier.coalesce() function >>>> >>>>>>> SPARK-25017 Add test suite for ContextBarrierState >>>> >>>>>>> SPARK-25083 remove the type erasure hack in data source scan >>>> >>>>>>> SPARK-25383 Image data source supports sample pushdown >>>> >>>>>>> SPARK-27272 Enable blacklisting of node/executor on fetch >>>> failures by default >>>> >>>>>>> SPARK-27296 User Defined Aggregating Functions (UDAFs) have a >>>> major >>>> >>>>>>> efficiency problem >>>> >>>>>>> SPARK-25128 multiple simultaneous job submissions against k8s >>>> backend >>>> >>>>>>> cause driver pods to hang >>>> >>>>>>> SPARK-26731 remove EOLed spark jobs from jenkins >>>> >>>>>>> SPARK-26664 Make DecimalType's minimum adjusted scale >>>> configurable >>>> >>>>>>> SPARK-21559 Remove Mesos fine-grained mode >>>> >>>>>>> SPARK-24942 Improve cluster resource management with jobs >>>> containing >>>> >>>>>>> barrier stage >>>> >>>>>>> SPARK-25914 Separate projection from grouping and aggregate in >>>> logical Aggregate >>>> >>>>>>> SPARK-26022 PySpark Comparison with Pandas >>>> >>>>>>> SPARK-20964 Make some keywords reserved along with the ANSI/SQL >>>> standard >>>> >>>>>>> SPARK-26221 Improve Spark SQL instrumentation and metrics >>>> >>>>>>> SPARK-26425 Add more constraint checks in file streaming source >>>> to >>>> >>>>>>> avoid checkpoint corruption >>>> >>>>>>> SPARK-25843 Redesign rangeBetween API >>>> >>>>>>> SPARK-25841 Redesign window function rangeBetween API >>>> >>>>>>> SPARK-25752 Add trait to easily whitelist logical operators that >>>> >>>>>>> produce named output from CleanupAliases >>>> >>>>>>> SPARK-23210 Introduce the concept of default value to schema >>>> >>>>>>> SPARK-25640 Clarify/Improve EvalType for grouped aggregate and >>>> window aggregate >>>> >>>>>>> SPARK-25531 new write APIs for data source v2 >>>> >>>>>>> SPARK-25547 Pluggable jdbc connection factory >>>> >>>>>>> SPARK-20845 Support specification of column names in INSERT INTO >>>> >>>>>>> SPARK-24417 Build and Run Spark on JDK11 >>>> >>>>>>> SPARK-24724 Discuss necessary info and access in barrier mode + >>>> Kubernetes >>>> >>>>>>> SPARK-24725 Discuss necessary info and access in barrier mode + >>>> Mesos >>>> >>>>>>> SPARK-25074 Implement maxNumConcurrentTasks() in >>>> >>>>>>> MesosFineGrainedSchedulerBackend >>>> >>>>>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 >>>> >>>>>>> SPARK-25186 Stabilize Data Source V2 API >>>> >>>>>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for >>>> barrier >>>> >>>>>>> execution mode >>>> >>>>>>> SPARK-25390 data source V2 API refactoring >>>> >>>>>>> SPARK-7768 Make user-defined type (UDT) API public >>>> >>>>>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based >>>> Partition Spec >>>> >>>>>>> SPARK-15691 Refactor and improve Hive support >>>> >>>>>>> SPARK-15694 Implement ScriptTransformation in sql/core >>>> >>>>>>> SPARK-16217 Support SELECT INTO statement >>>> >>>>>>> SPARK-16452 basic INFORMATION_SCHEMA support >>>> >>>>>>> SPARK-18134 SQL: MapType in Group BY and Joins not working >>>> >>>>>>> SPARK-18245 Improving support for bucketed table >>>> >>>>>>> SPARK-19842 Informational Referential Integrity Constraints >>>> Support in Spark >>>> >>>>>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in >>>> nested >>>> >>>>>>> list of structures >>>> >>>>>>> SPARK-22632 Fix the behavior of timestamp values for R's >>>> DataFrame to >>>> >>>>>>> respect session timezone >>>> >>>>>>> SPARK-22386 Data Source V2 improvements >>>> >>>>>>> SPARK-24723 Discuss necessary info and access in barrier mode + >>>> YARN >>>> >>>>>>> >>>> >>>>>>> >>>> --------------------------------------------------------------------- >>>> >>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>>>>> >>>> >>>>>>> >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Name : Jungtaek Lim >>>> >>>> Blog : http://medium.com/@heartsavior >>>> >>>> Twitter : http://twitter.com/heartsavior >>>> >>>> LinkedIn : http://www.linkedin.com/in/heartsavior >>>> >>> >>>> >>> >>>> >>> >>>> >>> -- >>>> >>> John Zhuge >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Twitter: https://twitter.com/holdenkarau >>>> >> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 >>>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> > >>>> > >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> > > -- > Ryan Blue > Software Engineer > Netflix > -- It's dark in this basement.