Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Kousuke Saruta Mon, 14 Dec 2015 11:41:51 -0800

+1 (non-binding)

Tested some workloads using basic API and DataFrame API on my 4-nodesYARN cluster (1 master and 3 slaves).

I also tested the Web UI.

(I'm resending this mail just in case because it seems that I failed tosend the mail to dev@)

On 2015/12/13 2:39, Michael Armbrust wrote:

Please vote on releasing the following candidate as Apache Sparkversion 1.6.0!

The vote is open until Tuesday, December 15, 2015 at 6:00 UTC andpasses if a majority of at least 3 +1 PMC votes are cast.


[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is _v1.6.0-rc2(23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)<https://github.com/apache/spark/tree/v1.6.0-rc2>_


The release files, including signatures, digests, etc. can be found at:

http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/<http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc2-bin/>


Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1169/

The test repository (versioned as v1.6.0-rc2) for this release can befound at:

https://repository.apache.org/content/repositories/orgapachespark-1168/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/ 
<http://people.apache.org/%7Epwendell/spark-releases/spark-1.6.0-rc2-docs/>

=======================================
== How can I help test this release? ==
=======================================

If you are a Spark user, you can help us test this release by takingan existing Spark workload and running on this release candidate, thenreporting any regressions.


================================================
== What justifies a -1 vote for this release? ==
================================================

This vote is happening towards the end of the 1.6 QA period, so -1votes should only occur for significant regressions from 1.5. Bugsalready present in 1.5, minor regressions, or bugs related to newfeatures will not block this release.


===============================================================
== What should happen to JIRA tickets still targeting 1.6.0? ==
===============================================================

1. It is OK for documentation patches to target 1.6.0 and still gointo branch-1.6, since documentations will be published separatelyfrom the release.

2. New features for non-alpha-modules should target 1.7+.

3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop thetarget version.



==================================================
== Major changes to help you focus your testing ==
==================================================


  Spark 1.6.0 Preview


    Notable changes since 1.6 RC1


      Spark Streaming

  * SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
    |trackStateByKey| has been renamed to |mapWithState|


      Spark SQL

  * SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
    SPARK-12189
    <https://issues.apache.org/jira/browse/SPARK-12189> Fix bugs in
    eviction of storage memory by execution.
  * SPARK-12258
    <https://issues.apache.org/jira/browse/SPARK-12258> correct
    passing null into ScalaUDF


    Notable Features Since 1.5


      Spark SQL

  * SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787>
    Parquet Performance - Improve Parquet scan performance when using
    flat schemas.
  * SPARK-10810
    <https://issues.apache.org/jira/browse/SPARK-10810>Session
    Management - Isolated devault database (i.e |USE mydb|) even on
    shared clusters.
  * SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999>
    Dataset API - A type-safe API (similar to RDDs) that performs many
    operations on serialized binary data and code generation (i.e.
    Project Tungsten).
  * SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000>
    Unified Memory Management - Shared memory for execution and
    caching instead of exclusive division of the regions.
  * SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197>
    SQL Queries on Files - Concise syntax for running SQL queries over
    files of any supported format without registering a table.
  * SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745>
    Reading non-standard JSON files - Added options to read
    non-standard JSON files (e.g. single-quotes, unquoted attributes)
  * SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
    Per-operator Metrics for SQL Execution - Display statistics on a
    peroperator basis for memory usage and spilled data size.
  * SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329>
    Star (*) expansion for StructTypes - Makes it easier to nest and
    unest arbitrary numbers of columns
  * SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
    SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149>
    In-memory Columnar Cache Performance - Significant (up to 14x)
    speed up when caching data that contains complex types in
    DataFrames or SQL.
  * SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111>
    Fast null-safe joins - Joins using null-safe equality (|<=>|) will
    now execute using SortMergeJoin instead of computing a cartisian
    product.
  * SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389>
    SQL Execution Using Off-Heap Memory - Support for configuring
    query execution to occur using off-heap memory to avoid GC overhead
  * SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978>
    Datasource API Avoid Double Filter - When implemeting a datasource
    with filter pushdown, developers can now tell Spark SQL to avoid
    double evaluating a pushed-down filter.
  * SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849>
    Advanced Layout of Cached Data - storing partitioning and ordering
    schemes in In-memory table scan, and adding distributeBy and
    localSort to DF API
  * SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858>
    Adaptive query execution - Intial support for automatically
    selecting the number of reducers for joins and aggregations.
  * SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241>
    Improved query planner for queries having distinct aggregations -
    Query plans of distinct aggregations are more robust when distinct
    columns have high cardinality.


      Spark Streaming

  * API Updates
      o SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629>
        New improved state management - |mapWithState| - a DStream
        transformation for stateful stream processing, supercedes
        |updateStateByKey| in functionality and performance.
      o SPARK-11198
        <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
        record deaggregation - Kinesis streams have been upgraded to
        use KCL 1.4.0 and supports transparent deaggregation of
        KPL-aggregated records.
      o SPARK-10891
        <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
        message handler function - Allows arbitraray function to be
        applied to a Kinesis record in the Kinesis receiver before to
        customize what data is to be stored in memory.
      o SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328>
        Python Streamng Listener API - Get streaming statistics
        (scheduling delays, batch processing times, etc.) in streaming.

  * UI Improvements
      o Made failures visible in the streaming tab, in the timelines,
        batch list, and batch details page.
      o Made output operations visible in the streaming tab as
        progress bars.


      MLlib


        New algorithms/models

  * SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518>
    Survival analysis - Log-linear model for survival analysis
  * SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834>
    Normal equation for least squares - Normal equation solver,
    providing R-like model summary statistics
  * SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147>
    Online hypothesis testing - A/B testing in the Spark Streaming
    framework
  * SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New
    feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
    transformer
  * SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517>
    Bisecting K-Means clustering - Fast top-down clustering variant of
    K-Means


        API improvements

  * ML Pipelines
      o SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725>
        Pipeline persistence - Save/load for ML Pipelines, with
        partial coverage of spark.ml <http://spark.ml> algorithms
      o SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565>
        LDA in ML Pipelines - API for Latent Dirichlet Allocation in
        ML Pipelines
  * R API
      o SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836>
        R-like statistics for GLMs - (Partial) R-like stats for
        ordinary least squares via summary(model)
      o SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681>
        Feature interactions in R formula - Interaction operator ":"
        in R formula
  * Python API - Many improvements to Python API to approach feature
    parity


        Misc improvements

  * SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>,
    SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642>
    Instance weights for GLMs - Logistic and Linear Regression can
    take instance weights
  * SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
    SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385>
    Univariate and bivariate statistics in DataFrames - Variance,
    stddev, correlations, etc.
  * SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117>
    LIBSVM data source - LIBSVM as a SQL data source


            Documentation improvements

  * SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751>
    @since versions - Documentation includes initial version when
    classes and methods were added
  * SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337>
    Testable example code - Automated testing for code in user guide
    examples


    Deprecations

  * In spark.mllib.clustering.KMeans, the "runs" parameter has been
    deprecated.
  * In spark.ml.classification.LogisticRegressionModel and
    spark.ml.regression.LinearRegressionModel, the "weights" field has
    been deprecated, in favor of the new name "coefficients." This
    helps disambiguate from instance (row) weights given to algorithms.


    Changes of behavior

  * spark.mllib.tree.GradientBoostedTrees validationTol has changed
    semantics in 1.6. Previously, it was a threshold for absolute
    change in error. Now, it resembles the behavior of GradientDescent
    convergenceTol: For large errors, it uses relative error (relative
    to the previous error); for small errors (< 0.01), it uses
    absolute error.
  * spark.ml.feature.RegexTokenizer: Previously, it did not convert
    strings to lowercase before tokenizing. Now, it converts to
    lowercase by default, with an option not to. This matches the
    behavior of the simpler Tokenizer transformer.
  * Spark SQL's partition discovery has been changed to only discover
    partition directories that are children of the given path. (i.e.
    if |path="/my/data/x=1"| then |x=1| will no longer be considered a
    partition but only children of |x=1|.) This behavior can be
    overridden by manually specifying the |basePath| that partitioning
    discovery should start with (SPARK-11678
    <https://issues.apache.org/jira/browse/SPARK-11678>).
  * When casting a value of an integral type to timestamp (e.g.
    casting a long value to timestamp), the value is treated as being
    in seconds instead of milliseconds (SPARK-11724
    <https://issues.apache.org/jira/browse/SPARK-11724>).
  * With the improved query planner for queries having distinct
    aggregations (SPARK-9241
    <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a
    query having a single distinct aggregation has been changed to a
    more robust version. To switch back to the plan generated by Spark
    1.5's planner, please set
    |spark.sql.specializeSingleDistinctAggPlanning| to
    |true| (SPARK-12077
    <https://issues.apache.org/jira/browse/SPARK-12077>).

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

Reply via email to