(I tried to send this last night but somehow ASF mailing list rejected my
mail)


In order to facilitate community testing of the 1.5.0 release, I've built a
preview package. This is not a release candidate, so there is no voting
involved. However, it'd be great if community members can start testing
with this preview package.


This preview package contains all the commits to branch-1.5 till commit
cedce9bdb72a00cbcbcc81d57f2a550eaf4416e8.

The source files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-bin/

The artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1133/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/


== How can you help? ==

If you are a Spark user, you can help us test this release by taking a
Spark workload and running on this preview release, then reporting any
regressions.


== Major changes to help you focus your testing ==

As of today, Spark 1.5 contains more than 1000 commits from 220+
contributors. I've curated a list of important changes for 1.5. For the
complete list, please refer to Apache JIRA changelog.


RDD/DataFrame/SQL APIs

- New UDAF interface
- DataFrame hints for broadcast join
- expr function for turning a SQL expression into DataFrame column
- Improved support for NaN values
- StructType now supports ordering
- TimestampType precision is reduced to 1us
- 100 new built-in expressions, including date/time, string, math
- memory and local disk only checkpointing

DataFrame/SQL Backend Execution

- Code generation on by default
- Improved join, aggregation, shuffle, sorting with cache friendly
algorithms and external algorithms
- Improved window function performance
- Better metrics instrumentation and reporting for DF/SQL execution plans

Data Sources, Hive, Hadoop, Mesos and Cluster Management

- Dynamic allocation support in all resource managers (Mesos, YARN,
Standalone)
- Improved Mesos support (framework authentication, roles, dynamic
allocation, constraints)
- Improved YARN support (dynamic allocation with preferred locations)
- Improved Hive support (metastore partition pruning, metastore
connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
- Support persisting data in Hive compatible format in metastore
- Support data partitioning for JSON data sources
- Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata
discovery and schema merging, support reading non-standard legacy Parquet
files generated by other libraries)
- Faster and more robust dynamic partition insert
- DataSourceRegister interface for external data sources to specify short
names

SparkR

- YARN cluster mode in R
- GLMs with R formula, binomial/Gaussian families, and elastic-net
regularization
- Improved error messages
- Aliases to make DataFrame functions more R-like

Streaming

- Backpressure for handling bursty input streams.
- Improved Python support for streaming sources (Kafka offsets, Kinesis,
MQTT, Flume)
- Improved Python streaming machine learning algorithms (K-Means, linear
regression, logistic regression)
- Native reliable Kinesis stream support
- Input metadata like Kafka offsets made visible in the batch details UI
- Better load balancing and scheduling of receivers across cluster
- Include streaming storage in web UI

Machine Learning and Advanced Analytics

- Feature transformers: CountVectorizer, Discrete Cosine transformation,
MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
- Estimators under pipeline APIs: naive Bayes, k-means, and isotonic
regression.
- Algorithms: multilayer perceptron classifier, PrefixSpan for sequential
pattern mining, association rule generation, 1-sample Kolmogorov-Smirnov
test.
- Improvements to existing algorithms: LDA, trees/ensembles, GMMs
- More efficient Pregel API implementation for GraphX
- Model summary for linear and logistic regression.
- Python API: distributed matrices, streaming k-means and linear models,
LDA, power iteration clustering, etc.
- Tuning and evaluation: train-validation split and multiclass
classification evaluator.
- Documentation: document the release version of public API methods

Reply via email to