+1 (non binding) Tested with samples on standalone and yarn.
Regards JB On 12/22/2015 09:10 PM, Michael Armbrust wrote:
Please vote on releasing the following candidate as Apache Spark version 1.6.0! The vote is open until Friday, December 25, 2015 at 18:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is _v1.6.0-rc4 (4062cda3087ae42c6c3cb24508fc1d3a931accdf) <https://github.com/apache/spark/tree/v1.6.0-rc4>_ The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1176/ The test repository (versioned as v1.6.0-rc4) for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1175/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/ ======================================= == How can I help test this release? == ======================================= If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. ================================================ == What justifies a -1 vote for this release? == ================================================ This vote is happening towards the end of the 1.6 QA period, so -1 votes should only occur for significant regressions from 1.5. Bugs already present in 1.5, minor regressions, or bugs related to new features will not block this release. =============================================================== == What should happen to JIRA tickets still targeting 1.6.0? == =============================================================== 1. It is OK for documentation patches to target 1.6.0 and still go into branch-1.6, since documentations will be published separately from the release. 2. New features for non-alpha-modules should target 1.7+. 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target version. ================================================== == Major changes to help you focus your testing == ================================================== Notable changes since 1.6 RC3 - SPARK-12404 - Fix serialization error for Datasets with Timestamps/Arrays/Decimal - SPARK-12218 - Fix incorrect pushdown of filters to parquet - SPARK-12395 - Fix join columns of outer join for DataFrame using - SPARK-12413 - Fix mesos HA Notable changes since 1.6 RC2 - SPARK_VERSION has been set correctly - SPARK-12199 ML Docs are publishing correctly - SPARK-12345 Mesos cluster mode has been fixed Notable changes since 1.6 RC1 Spark Streaming * SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> |trackStateByKey| has been renamed to |mapWithState| Spark SQL * SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165> SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix bugs in eviction of storage memory by execution. * SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct passing null into ScalaUDF Notable Features Since 1.5 Spark SQL * SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet Performance - Improve Parquet scan performance when using flat schemas. * SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>Session Management - Isolated devault database (i.e |USE mydb|) even on shared clusters. * SPARK-9999 <https://issues.apache.org/jira/browse/SPARK-9999> Dataset API - A type-safe API (similar to RDDs) that performs many operations on serialized binary data and code generation (i.e. Project Tungsten). * SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions. * SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table. * SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes) * SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412> Per-operator Metrics for SQL Execution - Display statistics on a peroperator basis for memory usage and spilled data size. * SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star (*) expansion for StructTypes - Makes it easier to nest and unest arbitrary numbers of columns * SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>, SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL. * SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast null-safe joins - Joins using null-safe equality (|<=>|) will now execute using SortMergeJoin instead of computing a cartisian product. * SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead * SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource API Avoid Double Filter - When implemeting a datasource with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter. * SPARK-4849 <https://issues.apache.org/jira/browse/SPARK-4849> Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API * SPARK-9858 <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive query execution - Intial support for automatically selecting the number of reducers for joins and aggregations. * SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241> Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality. Spark Streaming * API Updates o SPARK-2629 <https://issues.apache.org/jira/browse/SPARK-2629> New improved state management - |mapWithState| - a DStream transformation for stateful stream processing, supercedes |updateStateByKey| in functionality and performance. o SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records. o SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis message handler function - Allows arbitraray function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory. o SPARK-6328 <https://issues.apache.org/jira/browse/SPARK-6328> Python Streamng Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming. * UI Improvements o Made failures visible in the streaming tab, in the timelines, batch list, and batch details page. o Made output operations visible in the streaming tab as progress bars. MLlib New algorithms/models * SPARK-8518 <https://issues.apache.org/jira/browse/SPARK-8518> Survival analysis - Log-linear model for survival analysis * SPARK-9834 <https://issues.apache.org/jira/browse/SPARK-9834> Normal equation for least squares - Normal equation solver, providing R-like model summary statistics * SPARK-3147 <https://issues.apache.org/jira/browse/SPARK-3147> Online hypothesis testing - A/B testing in the Spark Streaming framework * SPARK-9930 <https://issues.apache.org/jira/browse/SPARK-9930> New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer * SPARK-6517 <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting K-Means clustering - Fast top-down clustering variant of K-Means API improvements * ML Pipelines o SPARK-6725 <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.ml <http://spark.ml/>algorithms o SPARK-5565 <https://issues.apache.org/jira/browse/SPARK-5565> LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines * R API o SPARK-9836 <https://issues.apache.org/jira/browse/SPARK-9836> R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model) o SPARK-9681 <https://issues.apache.org/jira/browse/SPARK-9681> Feature interactions in R formula - Interaction operator ":" in R formula * Python API - Many improvements to Python API to approach feature parity Misc improvements * SPARK-7685 <https://issues.apache.org/jira/browse/SPARK-7685>, SPARK-9642 <https://issues.apache.org/jira/browse/SPARK-9642> Instance weights for GLMs - Logistic and Linear Regression can take instance weights * SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>, SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc. * SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM data source - LIBSVM as a SQL data source Documentation improvements * SPARK-7751 <https://issues.apache.org/jira/browse/SPARK-7751> @since versions - Documentation includes initial version when classes and methods were added * SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable example code - Automated testing for code in user guide examples Deprecations * In spark.mllib.clustering.KMeans, the "runs" parameter has been deprecated. * In spark.ml.classification.LogisticRegressionModel and spark.ml.regression.LinearRegressionModel, the "weights" field has been deprecated, in favor of the new name "coefficients." This helps disambiguate from instance (row) weights given to algorithms. Changes of behavior * spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in 1.6. Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of GradientDescent convergenceTol: For large errors, it uses relative error (relative to the previous error); for small errors (< 0.01), it uses absolute error. * spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to lowercase before tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the behavior of the simpler Tokenizer transformer. * Spark SQL's partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if |path="/my/data/x=1"| then |x=1| will no longer be considered a partition but only children of |x=1|.) This behavior can be overridden by manually specifying the |basePath| that partitioning discovery should start with (SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>). * When casting a value of an integral type to timestamp (e.g. casting a long value to timestamp), the value is treated as being in seconds instead of milliseconds (SPARK-11724 <https://issues.apache.org/jira/browse/SPARK-11724>). * With the improved query planner for queries having distinct aggregations (SPARK-9241 <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5's planner, please set |spark.sql.specializeSingleDistinctAggPlanning| to |true| (SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).
-- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org