I'm going to -1 the release myself since the issue @yhuai identified is pretty serious. It basically OOMs the driver for reading any files with a large number of partitions. Looks like the patch for that has already been merged.
I'm going to cut rc3 momentarily. On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > +1 (non-binding) > built from source and ran some jobs against YARN > > -Sandy > > On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <vaquar.k...@gmail.com> > wrote: > >> >> +1 (1.5.0 RC2)Compiled on Windows with YARN. >> >> Regards, >> Vaquar khan >> +1 (non-binding, of course) >> >> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min >> mvn clean package -Pyarn -Phadoop-2.6 -DskipTests >> 2. Tested pyspark, mllib >> 2.1. statistics (min,max,mean,Pearson,Spearman) OK >> 2.2. Linear/Ridge/Laso Regression OK >> 2.3. Decision Tree, Naive Bayes OK >> 2.4. KMeans OK >> Center And Scale OK >> 2.5. RDD operations OK >> State of the Union Texts - MapReduce, Filter,sortByKey (word count) >> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK >> Model evaluation/optimization (rank, numIter, lambda) with >> itertools OK >> 3. Scala - MLlib >> 3.1. statistics (min,max,mean,Pearson,Spearman) OK >> 3.2. LinearRegressionWithSGD OK >> 3.3. Decision Tree OK >> 3.4. KMeans OK >> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK >> 3.6. saveAsParquetFile OK >> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile, >> registerTempTable, sql OK >> 3.8. result = sqlContext.sql("SELECT >> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER >> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK >> 4.0. Spark SQL from Python OK >> 4.1. result = sqlContext.sql("SELECT * from people WHERE State = 'WA'") OK >> 5.0. Packages >> 5.1. com.databricks.spark.csv - read/write OK >> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But >> com.databricks:spark-csv_2.11:1.2.0 worked) >> 6.0. DataFrames >> 6.1. cast,dtypes OK >> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK >> 6.3. joins,sql,set operations,udf OK >> >> Cheers >> <k/> >> >> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <r...@databricks.com> wrote: >> >>> Please vote on releasing the following candidate as Apache Spark version >>> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes >>> if a majority of at least 3 +1 PMC votes are cast. >>> >>> [ ] +1 Release this package as Apache Spark 1.5.0 >>> [ ] -1 Do not release this package because ... >>> >>> To learn more about Apache Spark, please see http://spark.apache.org/ >>> >>> >>> The tag to be voted on is v1.5.0-rc2: >>> >>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a >>> >>> The release files, including signatures, digests, etc. can be found at: >>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/ >>> >>> Release artifacts are signed with the following key: >>> https://people.apache.org/keys/committer/pwendell.asc >>> >>> The staging repository for this release (published as 1.5.0-rc2) can be >>> found at: >>> https://repository.apache.org/content/repositories/orgapachespark-1141/ >>> >>> The staging repository for this release (published as 1.5.0) can be >>> found at: >>> https://repository.apache.org/content/repositories/orgapachespark-1140/ >>> >>> The documentation corresponding to this release can be found at: >>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/ >>> >>> >>> ======================================= >>> How can I help test this release? >>> ======================================= >>> If you are a Spark user, you can help us test this release by taking an >>> existing Spark workload and running on this release candidate, then >>> reporting any regressions. >>> >>> >>> ================================================ >>> What justifies a -1 vote for this release? >>> ================================================ >>> This vote is happening towards the end of the 1.5 QA period, so -1 votes >>> should only occur for significant regressions from 1.4. Bugs already >>> present in 1.4, minor regressions, or bugs related to new features will not >>> block this release. >>> >>> >>> =============================================================== >>> What should happen to JIRA tickets still targeting 1.5.0? >>> =============================================================== >>> 1. It is OK for documentation patches to target 1.5.0 and still go into >>> branch-1.5, since documentations will be packaged separately from the >>> release. >>> 2. New features for non-alpha-modules should target 1.6+. >>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the >>> target version. >>> >>> >>> ================================================== >>> Major changes to help you focus your testing >>> ================================================== >>> >>> As of today, Spark 1.5 contains more than 1000 commits from 220+ >>> contributors. I've curated a list of important changes for 1.5. For the >>> complete list, please refer to Apache JIRA changelog. >>> >>> RDD/DataFrame/SQL APIs >>> >>> - New UDAF interface >>> - DataFrame hints for broadcast join >>> - expr function for turning a SQL expression into DataFrame column >>> - Improved support for NaN values >>> - StructType now supports ordering >>> - TimestampType precision is reduced to 1us >>> - 100 new built-in expressions, including date/time, string, math >>> - memory and local disk only checkpointing >>> >>> DataFrame/SQL Backend Execution >>> >>> - Code generation on by default >>> - Improved join, aggregation, shuffle, sorting with cache friendly >>> algorithms and external algorithms >>> - Improved window function performance >>> - Better metrics instrumentation and reporting for DF/SQL execution plans >>> >>> Data Sources, Hive, Hadoop, Mesos and Cluster Management >>> >>> - Dynamic allocation support in all resource managers (Mesos, YARN, >>> Standalone) >>> - Improved Mesos support (framework authentication, roles, dynamic >>> allocation, constraints) >>> - Improved YARN support (dynamic allocation with preferred locations) >>> - Improved Hive support (metastore partition pruning, metastore >>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2) >>> - Support persisting data in Hive compatible format in metastore >>> - Support data partitioning for JSON data sources >>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster >>> metadata discovery and schema merging, support reading non-standard legacy >>> Parquet files generated by other libraries) >>> - Faster and more robust dynamic partition insert >>> - DataSourceRegister interface for external data sources to specify >>> short names >>> >>> SparkR >>> >>> - YARN cluster mode in R >>> - GLMs with R formula, binomial/Gaussian families, and elastic-net >>> regularization >>> - Improved error messages >>> - Aliases to make DataFrame functions more R-like >>> >>> Streaming >>> >>> - Backpressure for handling bursty input streams. >>> - Improved Python support for streaming sources (Kafka offsets, Kinesis, >>> MQTT, Flume) >>> - Improved Python streaming machine learning algorithms (K-Means, linear >>> regression, logistic regression) >>> - Native reliable Kinesis stream support >>> - Input metadata like Kafka offsets made visible in the batch details UI >>> - Better load balancing and scheduling of receivers across cluster >>> - Include streaming storage in web UI >>> >>> Machine Learning and Advanced Analytics >>> >>> - Feature transformers: CountVectorizer, Discrete Cosine transformation, >>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer. >>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic >>> regression. >>> - Algorithms: multilayer perceptron classifier, PrefixSpan for >>> sequential pattern mining, association rule generation, 1-sample >>> Kolmogorov-Smirnov test. >>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs >>> - More efficient Pregel API implementation for GraphX >>> - Model summary for linear and logistic regression. >>> - Python API: distributed matrices, streaming k-means and linear models, >>> LDA, power iteration clustering, etc. >>> - Tuning and evaluation: train-validation split and multiclass >>> classification evaluator. >>> - Documentation: document the release version of public API methods >>> >>> >> >