The upper/lower case thing is known.  
https://issues.apache.org/jira/browse/SPARK-9550I assume it was decided to be 
ok and its going to be in the release notes  but Reynold or Josh can probably 
speak to it more.
Tom 


     On Thursday, September 3, 2015 10:21 PM, Krishna Sankar 
<ksanka...@gmail.com> wrote:
   

 +? 
1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min      mvn clean 
package -Pyarn -Phadoop-2.6 -DskipTests2. Tested pyspark, mllib2.1. statistics 
(min,max,mean,Pearson,Spearman) OK2.2. Linear/Ridge/Laso Regression OK 2.3. 
Decision Tree, Naive Bayes OK2.4. KMeans OK       Center And Scale OK2.5. RDD 
operations OK      State of the Union Texts - MapReduce, Filter,sortByKey (word 
count)2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK       
Model evaluation/optimization (rank, numIter, lambda) with itertools OK3. Scala 
- MLlib3.1. statistics (min,max,mean,Pearson,Spearman) OK3.2. 
LinearRegressionWithSGD OK3.3. Decision Tree OK3.4. KMeans OK3.5. 
Recommendation (Movielens medium dataset ~1 M ratings) OK3.6. saveAsParquetFile 
OK3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile, 
registerTempTable, sql OK3.8. result = sqlContext.sql("SELECT 
OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders INNER JOIN 
OrderDetails ON Orders.OrderID = OrderDetails.OrderID") OK4.0. Spark SQL from 
Python OK4.1. result = sqlContext.sql("SELECT * from people WHERE State = 
'WA'") OK5.0. Packages5.1. com.databricks.spark.csv - read/write OK(--packages 
com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t work. But 
com.databricks:spark-csv_2.11:1.2.0 worked)6.0. DataFrames 6.1. cast,dtypes 
OK6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK6.3. All joins,sql,set 
operations,udf OK
Two Problems:
1. The synthetic column names are lowercase ( i.e. now ‘sum(OrderPrice)’; 
previously ‘SUM(OrderPrice)’, now ‘avg(Total)’; previously 'AVG(Total)'). So 
programs that depend on the case of the synthetic column names would fail.2. 
orders_3.groupBy("Year","Month").sum('Total').show()    fails with the error 
‘java.io.IOException: Unable to acquire 4194304 bytes of memory’    
orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails with the same 
error    Is this a known bug ?Cheers<k/>P.S: Sorry for the spam, forgot Reply 
All 
On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin <r...@databricks.com> wrote:

Please vote on releasing the following candidate as Apache Spark version 1.5.0. 
The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and passes if a 
majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.5.0[ ] -1 Do not release this 
package because ...
To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is 
v1.5.0-rc3:https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9df6e40f31a
The release files, including signatures, digests, etc. can be found 
at:http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/
Release artifacts are signed with the following 
key:https://people.apache.org/keys/committer/pwendell.asc
The staging repository for this release (published as 1.5.0-rc3) can be found 
at:https://repository.apache.org/content/repositories/orgapachespark-1143/
The staging repository for this release (published as 1.5.0) can be found 
at:https://repository.apache.org/content/repositories/orgapachespark-1142/
The documentation corresponding to this release can be found 
at:http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/

=======================================How can I help test this 
release?=======================================If you are a Spark user, you can 
help us test this release by taking an existing Spark workload and running on 
this release candidate, then reporting any regressions.

================================================What justifies a -1 vote for 
this release?================================================This vote is 
happening towards the end of the 1.5 QA period, so -1 votes should only occur 
for significant regressions from 1.4. Bugs already present in 1.4, minor 
regressions, or bugs related to new features will not block this release.

===============================================================What should 
happen to JIRA tickets still targeting 
1.5.0?===============================================================1. It is 
OK for documentation patches to target 1.5.0 and still go into branch-1.5, 
since documentations will be packaged separately from the release.2. New 
features for non-alpha-modules should target 1.6+.3. Non-blocker bug fixes 
should target 1.5.1 or 1.6.0, or drop the target version.

==================================================Major changes to help you 
focus your testing==================================================
As of today, Spark 1.5 contains more than 1000 commits from 220+ contributors. 
I've curated a list of important changes for 1.5. For the complete list, please 
refer to Apache JIRA changelog.
RDD/DataFrame/SQL APIs
- New UDAF interface- DataFrame hints for broadcast join- expr function for 
turning a SQL expression into DataFrame column- Improved support for NaN 
values- StructType now supports ordering- TimestampType precision is reduced to 
1us- 100 new built-in expressions, including date/time, string, math- memory 
and local disk only checkpointing
DataFrame/SQL Backend Execution
- Code generation on by default- Improved join, aggregation, shuffle, sorting 
with cache friendly algorithms and external algorithms- Improved window 
function performance- Better metrics instrumentation and reporting for DF/SQL 
execution plans
Data Sources, Hive, Hadoop, Mesos and Cluster Management
- Dynamic allocation support in all resource managers (Mesos, YARN, 
Standalone)- Improved Mesos support (framework authentication, roles, dynamic 
allocation, constraints)- Improved YARN support (dynamic allocation with 
preferred locations)- Improved Hive support (metastore partition pruning, 
metastore connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)- Support 
persisting data in Hive compatible format in metastore- Support data 
partitioning for JSON data sources- Parquet improvements (upgrade to 1.7, 
predicate pushdown, faster metadata discovery and schema merging, support 
reading non-standard legacy Parquet files generated by other libraries)- Faster 
and more robust dynamic partition insert- DataSourceRegister interface for 
external data sources to specify short names
SparkR
- YARN cluster mode in R- GLMs with R formula, binomial/Gaussian families, and 
elastic-net regularization- Improved error messages- Aliases to make DataFrame 
functions more R-like
Streaming
- Backpressure for handling bursty input streams.- Improved Python support for 
streaming sources (Kafka offsets, Kinesis, MQTT, Flume)- Improved Python 
streaming machine learning algorithms (K-Means, linear regression, logistic 
regression)- Native reliable Kinesis stream support- Input metadata like Kafka 
offsets made visible in the batch details UI- Better load balancing and 
scheduling of receivers across cluster- Include streaming storage in web UI
Machine Learning and Advanced Analytics
- Feature transformers: CountVectorizer, Discrete Cosine transformation, 
MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.- 
Estimators under pipeline APIs: naive Bayes, k-means, and isotonic regression.- 
Algorithms: multilayer perceptron classifier, PrefixSpan for sequential pattern 
mining, association rule generation, 1-sample Kolmogorov-Smirnov test.- 
Improvements to existing algorithms: LDA, trees/ensembles, GMMs- More efficient 
Pregel API implementation for GraphX- Model summary for linear and logistic 
regression.- Python API: distributed matrices, streaming k-means and linear 
models, LDA, power iteration clustering, etc.- Tuning and evaluation: 
train-validation split and multiclass classification evaluator.- Documentation: 
document the release version of public API methods





  

Reply via email to