Hi Krishna, Thanks for providing the notebook! I tried and found that the problem is with PySpark's zip. I created a JIRA to track the issue: https://issues.apache.org/jira/browse/SPARK-4841
-Xiangrui On Thu, Dec 11, 2014 at 1:55 PM, Krishna Sankar <ksanka...@gmail.com> wrote: > K-Means iPython notebook & data attached. > It is the zip that gives the error ; while one of the RDDs is from the > prediction, most probably there is no problem with the K-Means. > Lines 34,35 & 36 essentially are the same. But only 36 works with 1.2.0. > Interestingly, lines 34,35 & 36 work with 1.1.1 (Checked just now) > > The plot thickens! > In 1.1.1, freq_cluster_map.take(5) prints normally for 34 & 35, but in > exponential form for 36. So there is some difference even in 1.1.1. > #34,#35 [(array([28143, 0, 174, 1, 0, 0, 7000]), 1), > > (array([19244, 0, 215, 2, 0, 0, 6968]), 1), > (array([41354, 0, 4123, 4, 0, 0, 7034]), 1), > (array([14776, 0, 500, 1, 0, 0, 6952]), 1), > (array([97752, 0, 43300, 26, 2077, 4, 6935]), 0)] > > #36 [(array([ 2.81430000e+04, 0.00000000e+00, 1.74000000e+02, > > 1.00000000e+00, 0.00000000e+00, 0.00000000e+00, > 7.00000000e+03]), 1), > (array([ 1.92440000e+04, 0.00000000e+00, 2.15000000e+02, > 2.00000000e+00, 0.00000000e+00, 0.00000000e+00, > 6.96800000e+03]), 1), > (array([ 4.13540000e+04, 0.00000000e+00, 4.12300000e+03, > 4.00000000e+00, 0.00000000e+00, 0.00000000e+00, > 7.03400000e+03]), 1), > (array([ 1.47760000e+04, 0.00000000e+00, 5.00000000e+02, > 1.00000000e+00, 0.00000000e+00, 0.00000000e+00, > 6.95200000e+03]), 1), > (array([ 9.77520000e+04, 0.00000000e+00, 4.33000000e+04, > 2.60000000e+01, 2.07700000e+03, 4.00000000e+00, > 6.93500000e+03]), 0)] > > I had overwritten the naive bayes example. Will chase the older versions > down > > Cheers > <k/> > > On Wed, Dec 3, 2014 at 4:19 PM, Xiangrui Meng <men...@gmail.com> wrote: >> >> Krishna, could you send me some code snippets for the issues you saw >> in naive Bayes and k-means? -Xiangrui >> >> On Sun, Nov 30, 2014 at 6:49 AM, Krishna Sankar <ksanka...@gmail.com> >> wrote: >> > +1 >> > 1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4 >> > -Dhadoop.version=2.4.0 -DskipTests clean package 16:46 min (slightly >> > slower >> > connection) >> > 2. Tested pyspark, mlib - running as well as compare esults with 1.1.x >> > 2.1. statistics OK >> > 2.2. Linear/Ridge/Laso Regression OK >> > Slight difference in the print method (vs. 1.1.x) of the model >> > object - with a label & more details. This is good. >> > 2.3. Decision Tree, Naive Bayes OK >> > Changes in print(model) - now print (model.ToDebugString()) - OK >> > Some changes in NaiveBayes. Different from my 1.1.x code - had to >> > flatten list structures, zip required same number in partitions >> > After code changes ran fine. >> > 2.4. KMeans OK >> > zip occasionally fails with error "localhost): >> > org.apache.spark.SparkException: Can only zip RDDs with same number of >> > elements in each partition" >> > Has https://issues.apache.org/jira/browse/SPARK-2251 reappeared ? >> > Made it work by doing a different transformation ie reusing an original >> > rdd. >> > 2.5. rdd operations OK >> > State of the Union Texts - MapReduce, Filter,sortByKey (word >> > count) >> > 2.6. recommendation OK >> > 2.7. Good work ! In 1.x.x, had a map distinct over the movielens medium >> > dataset which never worked. Works fine in 1.2.0 ! >> > 3. Scala Mlib - subset of examples as in #2 above, with Scala >> > 3.1. statistics OK >> > 3.2. Linear Regression OK >> > 3.3. Decision Tree OK >> > 3.4. KMeans OK >> > Cheers >> > <k/> >> > P.S: Plan to add RF and .ml mechanics to this bank >> > >> > On Fri, Nov 28, 2014 at 9:16 PM, Patrick Wendell <pwend...@gmail.com> >> > wrote: >> > >> >> Please vote on releasing the following candidate as Apache Spark >> >> version >> >> 1.2.0! >> >> >> >> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1): >> >> >> >> >> >> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb >> >> >> >> The release files, including signatures, digests, etc. can be found at: >> >> http://people.apache.org/~pwendell/spark-1.2.0-rc1/ >> >> >> >> Release artifacts are signed with the following key: >> >> https://people.apache.org/keys/committer/pwendell.asc >> >> >> >> The staging repository for this release can be found at: >> >> https://repository.apache.org/content/repositories/orgapachespark-1048/ >> >> >> >> The documentation corresponding to this release can be found at: >> >> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/ >> >> >> >> Please vote on releasing this package as Apache Spark 1.2.0! >> >> >> >> The vote is open until Tuesday, December 02, at 05:15 UTC and passes >> >> if a majority of at least 3 +1 PMC votes are cast. >> >> >> >> [ ] +1 Release this package as Apache Spark 1.1.0 >> >> [ ] -1 Do not release this package because ... >> >> >> >> To learn more about Apache Spark, please see >> >> http://spark.apache.org/ >> >> >> >> == What justifies a -1 vote for this release? == >> >> This vote is happening very late into the QA period compared with >> >> previous votes, so -1 votes should only occur for significant >> >> regressions from 1.0.2. Bugs already present in 1.1.X, minor >> >> regressions, or bugs related to new features will not block this >> >> release. >> >> >> >> == What default changes should I be aware of? == >> >> 1. The default value of "spark.shuffle.blockTransferService" has been >> >> changed to "netty" >> >> --> Old behavior can be restored by switching to "nio" >> >> >> >> 2. The default value of "spark.shuffle.manager" has been changed to >> >> "sort". >> >> --> Old behavior can be restored by setting "spark.shuffle.manager" to >> >> "hash". >> >> >> >> == Other notes == >> >> Because this vote is occurring over a weekend, I will likely extend >> >> the vote if this RC survives until the end of the vote period. >> >> >> >> - Patrick >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >> >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org