(I tried to send this last night but somehow ASF mailing list rejected my mail)
In order to facilitate community testing of the 1.5.0 release, I've built a preview package. This is not a release candidate, so there is no voting involved. However, it'd be great if community members can start testing with this preview package. This preview package contains all the commits to branch-1.5 till commit cedce9bdb72a00cbcbcc81d57f2a550eaf4416e8. The source files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-bin/ The artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1133/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/ == How can you help? == If you are a Spark user, you can help us test this release by taking a Spark workload and running on this preview release, then reporting any regressions. == Major changes to help you focus your testing == As of today, Spark 1.5 contains more than 1000 commits from 220+ contributors. I've curated a list of important changes for 1.5. For the complete list, please refer to Apache JIRA changelog. RDD/DataFrame/SQL APIs - New UDAF interface - DataFrame hints for broadcast join - expr function for turning a SQL expression into DataFrame column - Improved support for NaN values - StructType now supports ordering - TimestampType precision is reduced to 1us - 100 new built-in expressions, including date/time, string, math - memory and local disk only checkpointing DataFrame/SQL Backend Execution - Code generation on by default - Improved join, aggregation, shuffle, sorting with cache friendly algorithms and external algorithms - Improved window function performance - Better metrics instrumentation and reporting for DF/SQL execution plans Data Sources, Hive, Hadoop, Mesos and Cluster Management - Dynamic allocation support in all resource managers (Mesos, YARN, Standalone) - Improved Mesos support (framework authentication, roles, dynamic allocation, constraints) - Improved YARN support (dynamic allocation with preferred locations) - Improved Hive support (metastore partition pruning, metastore connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2) - Support persisting data in Hive compatible format in metastore - Support data partitioning for JSON data sources - Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata discovery and schema merging, support reading non-standard legacy Parquet files generated by other libraries) - Faster and more robust dynamic partition insert - DataSourceRegister interface for external data sources to specify short names SparkR - YARN cluster mode in R - GLMs with R formula, binomial/Gaussian families, and elastic-net regularization - Improved error messages - Aliases to make DataFrame functions more R-like Streaming - Backpressure for handling bursty input streams. - Improved Python support for streaming sources (Kafka offsets, Kinesis, MQTT, Flume) - Improved Python streaming machine learning algorithms (K-Means, linear regression, logistic regression) - Native reliable Kinesis stream support - Input metadata like Kafka offsets made visible in the batch details UI - Better load balancing and scheduling of receivers across cluster - Include streaming storage in web UI Machine Learning and Advanced Analytics - Feature transformers: CountVectorizer, Discrete Cosine transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer. - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic regression. - Algorithms: multilayer perceptron classifier, PrefixSpan for sequential pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov test. - Improvements to existing algorithms: LDA, trees/ensembles, GMMs - More efficient Pregel API implementation for GraphX - Model summary for linear and logistic regression. - Python API: distributed matrices, streaming k-means and linear models, LDA, power iteration clustering, etc. - Tuning and evaluation: train-validation split and multiclass classification evaluator. - Documentation: document the release version of public API methods