+1 2015-12-22 12:43 GMT-08:00 Reynold Xin <r...@databricks.com>:
> +1 > > > On Tue, Dec 22, 2015 at 12:29 PM, Michael Armbrust <mich...@databricks.com > > wrote: > >> I'll kick the voting off with a +1. >> >> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust < >> mich...@databricks.com> wrote: >> >>> Please vote on releasing the following candidate as Apache Spark version >>> 1.6.0! >>> >>> The vote is open until Friday, December 25, 2015 at 18:00 UTC and >>> passes if a majority of at least 3 +1 PMC votes are cast. >>> >>> [ ] +1 Release this package as Apache Spark 1.6.0 >>> [ ] -1 Do not release this package because ... >>> >>> To learn more about Apache Spark, please see http://spark.apache.org/ >>> >>> The tag to be voted on is *v1.6.0-rc4 >>> (4062cda3087ae42c6c3cb24508fc1d3a931accdf) >>> <https://github.com/apache/spark/tree/v1.6.0-rc4>* >>> >>> The release files, including signatures, digests, etc. can be found at: >>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/ >>> >>> Release artifacts are signed with the following key: >>> https://people.apache.org/keys/committer/pwendell.asc >>> >>> The staging repository for this release can be found at: >>> https://repository.apache.org/content/repositories/orgapachespark-1176/ >>> >>> The test repository (versioned as v1.6.0-rc4) for this release can be >>> found at: >>> https://repository.apache.org/content/repositories/orgapachespark-1175/ >>> >>> The documentation corresponding to this release can be found at: >>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/ >>> >>> ======================================= >>> == How can I help test this release? == >>> ======================================= >>> If you are a Spark user, you can help us test this release by taking an >>> existing Spark workload and running on this release candidate, then >>> reporting any regressions. >>> >>> ================================================ >>> == What justifies a -1 vote for this release? == >>> ================================================ >>> This vote is happening towards the end of the 1.6 QA period, so -1 votes >>> should only occur for significant regressions from 1.5. Bugs already >>> present in 1.5, minor regressions, or bugs related to new features will not >>> block this release. >>> >>> =============================================================== >>> == What should happen to JIRA tickets still targeting 1.6.0? == >>> =============================================================== >>> 1. It is OK for documentation patches to target 1.6.0 and still go into >>> branch-1.6, since documentations will be published separately from the >>> release. >>> 2. New features for non-alpha-modules should target 1.7+. >>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the >>> target version. >>> >>> >>> ================================================== >>> == Major changes to help you focus your testing == >>> ================================================== >>> >>> Notable changes since 1.6 RC3 >>> >>> - SPARK-12404 - Fix serialization error for Datasets with >>> Timestamps/Arrays/Decimal >>> - SPARK-12218 - Fix incorrect pushdown of filters to parquet >>> - SPARK-12395 - Fix join columns of outer join for DataFrame using >>> - SPARK-12413 - Fix mesos HA >>> >>> Notable changes since 1.6 RC2 >>> - SPARK_VERSION has been set correctly >>> - SPARK-12199 ML Docs are publishing correctly >>> - SPARK-12345 Mesos cluster mode has been fixed >>> >>> Notable changes since 1.6 RC1 >>> Spark Streaming >>> >>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> >>> trackStateByKey has been renamed to mapWithState >>> >>> Spark SQL >>> >>> - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165> >>> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix >>> bugs in eviction of storage memory by execution. >>> - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct >>> passing null into ScalaUDF >>> >>> Notable Features Since 1.5Spark SQL >>> >>> - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet >>> Performance - Improve Parquet scan performance when using flat >>> schemas. >>> - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810> >>> Session Management - Isolated devault database (i.e USE mydb) even >>> on shared clusters. >>> - SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset >>> API - A type-safe API (similar to RDDs) that performs many >>> operations on serialized binary data and code generation (i.e. Project >>> Tungsten). >>> - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified >>> Memory Management - Shared memory for execution and caching instead >>> of exclusive division of the regions. >>> - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL >>> Queries on Files - Concise syntax for running SQL queries over files >>> of any supported format without registering a table. >>> - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading >>> non-standard JSON files - Added options to read non-standard JSON >>> files (e.g. single-quotes, unquoted attributes) >>> - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> >>> Per-operator >>> Metrics for SQL Execution - Display statistics on a peroperator >>> basis for memory usage and spilled data size. >>> - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star >>> (*) expansion for StructTypes - Makes it easier to nest and unest >>> arbitrary numbers of columns >>> - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>, >>> SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory >>> Columnar Cache Performance - Significant (up to 14x) speed up when >>> caching data that contains complex types in DataFrames or SQL. >>> - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast >>> null-safe joins - Joins using null-safe equality (<=>) will now >>> execute using SortMergeJoin instead of computing a cartisian product. >>> - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL >>> Execution Using Off-Heap Memory - Support for configuring query >>> execution to occur using off-heap memory to avoid GC overhead >>> - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> >>> Datasource >>> API Avoid Double Filter - When implemeting a datasource with filter >>> pushdown, developers can now tell Spark SQL to avoid double evaluating a >>> pushed-down filter. >>> - SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced >>> Layout of Cached Data - storing partitioning and ordering schemes in >>> In-memory table scan, and adding distributeBy and localSort to DF API >>> - SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive >>> query execution - Intial support for automatically selecting the >>> number of reducers for joins and aggregations. >>> - SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved >>> query planner for queries having distinct aggregations - Query plans >>> of distinct aggregations are more robust when distinct columns have high >>> cardinality. >>> >>> Spark Streaming >>> >>> - API Updates >>> - SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New >>> improved state management - mapWithState - a DStream >>> transformation for stateful stream processing, supercedes >>> updateStateByKey in functionality and performance. >>> - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> >>> Kinesis >>> record deaggregation - Kinesis streams have been upgraded to use >>> KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated >>> records. >>> - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> >>> Kinesis >>> message handler function - Allows arbitraray function to be >>> applied to a Kinesis record in the Kinesis receiver before to >>> customize >>> what data is to be stored in memory. >>> - SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> >>> Python >>> Streamng Listener API - Get streaming statistics (scheduling >>> delays, batch processing times, etc.) in streaming. >>> >>> >>> - UI Improvements >>> - Made failures visible in the streaming tab, in the timelines, >>> batch list, and batch details page. >>> - Made output operations visible in the streaming tab as progress >>> bars. >>> >>> MLlibNew algorithms/models >>> >>> - SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival >>> analysis - Log-linear model for survival analysis >>> - SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal >>> equation for least squares - Normal equation solver, providing >>> R-like model summary statistics >>> - SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online >>> hypothesis testing - A/B testing in the Spark Streaming framework >>> - SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New >>> feature transformers - ChiSqSelector, QuantileDiscretizer, SQL >>> transformer >>> - SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> >>> Bisecting >>> K-Means clustering - Fast top-down clustering variant of K-Means >>> >>> API improvements >>> >>> - ML Pipelines >>> - SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> >>> Pipeline >>> persistence - Save/load for ML Pipelines, with partial coverage >>> of spark.mlalgorithms >>> - SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA >>> in ML Pipelines - API for Latent Dirichlet Allocation in ML >>> Pipelines >>> - R API >>> - SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> >>> R-like >>> statistics for GLMs - (Partial) R-like stats for ordinary least >>> squares via summary(model) >>> - SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> >>> Feature >>> interactions in R formula - Interaction operator ":" in R formula >>> - Python API - Many improvements to Python API to approach feature >>> parity >>> >>> Misc improvements >>> >>> - SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>, >>> SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance >>> weights for GLMs - Logistic and Linear Regression can take instance >>> weights >>> - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>, >>> SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> >>> Univariate >>> and bivariate statistics in DataFrames - Variance, stddev, >>> correlations, etc. >>> - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM >>> data source - LIBSVM as a SQL data sourceDocumentation improvements >>> - SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since >>> versions - Documentation includes initial version when classes and >>> methods were added >>> - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> >>> Testable >>> example code - Automated testing for code in user guide examples >>> >>> Deprecations >>> >>> - In spark.mllib.clustering.KMeans, the "runs" parameter has been >>> deprecated. >>> - In spark.ml.classification.LogisticRegressionModel and >>> spark.ml.regression.LinearRegressionModel, the "weights" field has been >>> deprecated, in favor of the new name "coefficients." This helps >>> disambiguate from instance (row) weights given to algorithms. >>> >>> Changes of behavior >>> >>> - spark.mllib.tree.GradientBoostedTrees validationTol has changed >>> semantics in 1.6. Previously, it was a threshold for absolute change in >>> error. Now, it resembles the behavior of GradientDescent convergenceTol: >>> For large errors, it uses relative error (relative to the previous >>> error); >>> for small errors (< 0.01), it uses absolute error. >>> - spark.ml.feature.RegexTokenizer: Previously, it did not convert >>> strings to lowercase before tokenizing. Now, it converts to lowercase by >>> default, with an option not to. This matches the behavior of the simpler >>> Tokenizer transformer. >>> - Spark SQL's partition discovery has been changed to only discover >>> partition directories that are children of the given path. (i.e. if >>> path="/my/data/x=1" then x=1 will no longer be considered a >>> partition but only children of x=1.) This behavior can be overridden >>> by manually specifying the basePath that partitioning discovery >>> should start with (SPARK-11678 >>> <https://issues.apache.org/jira/browse/SPARK-11678>). >>> - When casting a value of an integral type to timestamp (e.g. >>> casting a long value to timestamp), the value is treated as being in >>> seconds instead of milliseconds (SPARK-11724 >>> <https://issues.apache.org/jira/browse/SPARK-11724>). >>> - With the improved query planner for queries having distinct >>> aggregations (SPARK-9241 >>> <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a >>> query having a single distinct aggregation has been changed to a more >>> robust version. To switch back to the plan generated by Spark 1.5's >>> planner, please set spark.sql.specializeSingleDistinctAggPlanning to >>> true (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077> >>> ). >>> >>> >> >