Hi Krishna,

Thanks for providing the notebook! I tried and found that the problem
is with PySpark's zip. I created a JIRA to track the issue:
https://issues.apache.org/jira/browse/SPARK-4841

-Xiangrui

On Thu, Dec 11, 2014 at 1:55 PM, Krishna Sankar <ksanka...@gmail.com> wrote:
> K-Means iPython notebook & data attached.
> It is the zip that gives the error ; while one of the RDDs is from the
> prediction, most probably there is no problem with the K-Means.
> Lines 34,35 & 36 essentially are the same. But only 36 works with 1.2.0.
> Interestingly, lines 34,35 & 36 work with 1.1.1 (Checked just now)
>
> The plot thickens!
> In 1.1.1, freq_cluster_map.take(5) prints normally for 34 & 35, but in
> exponential form for 36. So there is some difference even in 1.1.1.
> #34,#35 [(array([28143, 0, 174, 1, 0, 0, 7000]), 1),
>
>  (array([19244,     0,   215,     2,     0,     0,  6968]), 1),
>  (array([41354,     0,  4123,     4,     0,     0,  7034]), 1),
>  (array([14776,     0,   500,     1,     0,     0,  6952]), 1),
>  (array([97752,     0, 43300,    26,  2077,     4,  6935]), 0)]
>
> #36 [(array([  2.81430000e+04,   0.00000000e+00,   1.74000000e+02,
>
>            1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
>            7.00000000e+03]), 1),
>  (array([  1.92440000e+04,   0.00000000e+00,   2.15000000e+02,
>            2.00000000e+00,   0.00000000e+00,   0.00000000e+00,
>            6.96800000e+03]), 1),
>  (array([  4.13540000e+04,   0.00000000e+00,   4.12300000e+03,
>            4.00000000e+00,   0.00000000e+00,   0.00000000e+00,
>            7.03400000e+03]), 1),
>  (array([  1.47760000e+04,   0.00000000e+00,   5.00000000e+02,
>            1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
>            6.95200000e+03]), 1),
>  (array([  9.77520000e+04,   0.00000000e+00,   4.33000000e+04,
>            2.60000000e+01,   2.07700000e+03,   4.00000000e+00,
>            6.93500000e+03]), 0)]
>
> I had overwritten the naive bayes example. Will chase the older versions
> down
>
> Cheers
> <k/>
>
> On Wed, Dec 3, 2014 at 4:19 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> Krishna, could you send me some code snippets for the issues you saw
>> in naive Bayes and k-means? -Xiangrui
>>
>> On Sun, Nov 30, 2014 at 6:49 AM, Krishna Sankar <ksanka...@gmail.com>
>> wrote:
>> > +1
>> > 1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4
>> > -Dhadoop.version=2.4.0 -DskipTests clean package 16:46 min (slightly
>> > slower
>> > connection)
>> > 2. Tested pyspark, mlib - running as well as compare esults with 1.1.x
>> > 2.1. statistics OK
>> > 2.2. Linear/Ridge/Laso Regression OK
>> >        Slight difference in the print method (vs. 1.1.x) of the model
>> > object - with a label & more details. This is good.
>> > 2.3. Decision Tree, Naive Bayes OK
>> >        Changes in print(model) - now print (model.ToDebugString()) - OK
>> >        Some changes in NaiveBayes. Different from my 1.1.x code - had to
>> > flatten list structures, zip required same number in partitions
>> >        After code changes ran fine.
>> > 2.4. KMeans OK
>> >        zip occasionally fails with error "localhost):
>> > org.apache.spark.SparkException: Can only zip RDDs with same number of
>> > elements in each partition"
>> > Has https://issues.apache.org/jira/browse/SPARK-2251 reappeared ?
>> > Made it work by doing a different transformation ie reusing an original
>> > rdd.
>> > 2.5. rdd operations OK
>> >        State of the Union Texts - MapReduce, Filter,sortByKey (word
>> > count)
>> > 2.6. recommendation OK
>> > 2.7. Good work ! In 1.x.x, had a map distinct over the movielens medium
>> > dataset which never worked. Works fine in 1.2.0 !
>> > 3. Scala Mlib - subset of examples as in #2 above, with Scala
>> > 3.1. statistics OK
>> > 3.2. Linear Regression OK
>> > 3.3. Decision Tree OK
>> > 3.4. KMeans OK
>> > Cheers
>> > <k/>
>> > P.S: Plan to add RF and .ml mechanics to this bank
>> >
>> > On Fri, Nov 28, 2014 at 9:16 PM, Patrick Wendell <pwend...@gmail.com>
>> > wrote:
>> >
>> >> Please vote on releasing the following candidate as Apache Spark
>> >> version
>> >> 1.2.0!
>> >>
>> >> The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
>> >>
>> >>
>> >> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb
>> >>
>> >> The release files, including signatures, digests, etc. can be found at:
>> >> http://people.apache.org/~pwendell/spark-1.2.0-rc1/
>> >>
>> >> Release artifacts are signed with the following key:
>> >> https://people.apache.org/keys/committer/pwendell.asc
>> >>
>> >> The staging repository for this release can be found at:
>> >> https://repository.apache.org/content/repositories/orgapachespark-1048/
>> >>
>> >> The documentation corresponding to this release can be found at:
>> >> http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/
>> >>
>> >> Please vote on releasing this package as Apache Spark 1.2.0!
>> >>
>> >> The vote is open until Tuesday, December 02, at 05:15 UTC and passes
>> >> if a majority of at least 3 +1 PMC votes are cast.
>> >>
>> >> [ ] +1 Release this package as Apache Spark 1.1.0
>> >> [ ] -1 Do not release this package because ...
>> >>
>> >> To learn more about Apache Spark, please see
>> >> http://spark.apache.org/
>> >>
>> >> == What justifies a -1 vote for this release? ==
>> >> This vote is happening very late into the QA period compared with
>> >> previous votes, so -1 votes should only occur for significant
>> >> regressions from 1.0.2. Bugs already present in 1.1.X, minor
>> >> regressions, or bugs related to new features will not block this
>> >> release.
>> >>
>> >> == What default changes should I be aware of? ==
>> >> 1. The default value of "spark.shuffle.blockTransferService" has been
>> >> changed to "netty"
>> >> --> Old behavior can be restored by switching to "nio"
>> >>
>> >> 2. The default value of "spark.shuffle.manager" has been changed to
>> >> "sort".
>> >> --> Old behavior can be restored by setting "spark.shuffle.manager" to
>> >> "hash".
>> >>
>> >> == Other notes ==
>> >> Because this vote is occurring over a weekend, I will likely extend
>> >> the vote if this RC survives until the end of the vote period.
>> >>
>> >> - Patrick
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>
>> >>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to