Re: GraphX implementation of ALS?

2015-05-26 Thread Debasish Das
In general for implicit feedback in als you have to do a blocked gram matrix calculation which might not fit in graphx flow and lot of blocked operations can be used...but if your loss is likelihood or kl divergence or just simple sgd update rules and not least square then graphx idea makes sense..

Re: Power iteration clustering

2015-05-26 Thread Debasish Das
Ok I thought we tried that and found graphx based flow was faster due to some inherent problem structure (graphx can compute K eigenvectors at the same time) I will report some stats on row similarities experiments on vector blocked index row matrix multiply vs current pic flow... On May 26, 2015

Re: GraphX implementation of ALS?

2015-05-26 Thread Ben Mabey
On 5/26/15 5:45 PM, Ankur Dave wrote: This is the latest GraphX-based ALS implementation that I'm aware of: https://github.com/ankurdave/spark/blob/GraphXALS/graphx/src/main/scala/org/apache/spark/graphx/lib/ALS.scala When I benchmarked it last year, it was about twice as slow as MLlib's ALS,

Re: Power iteration clustering

2015-05-26 Thread Joseph Bradley
That's a good question; I could imagine it being much more efficient if kept in a BlockMatrix and using BLAS2 ops. On Sat, May 23, 2015 at 8:09 PM, Debasish Das wrote: > Hi, > > What was the motivation to write power iteration clustering using graphx > and not a vector matrix multiplication over

Re: GraphX implementation of ALS?

2015-05-26 Thread Ankur Dave
This is the latest GraphX-based ALS implementation that I'm aware of: https://github.com/ankurdave/spark/blob/GraphXALS/graphx/src/main/scala/org/apache/spark/graphx/lib/ALS.scala When I benchmarked it last year, it was about twice as slow as MLlib's ALS, and I think the latter has gotten faster s

GraphX implementation of ALS?

2015-05-26 Thread Ben Mabey
Hi all, I've heard in a number of presentations Spark's ALS implementation was going to be moved over to a GraphX version. For example, this presentation on GraphX (slide #23) at the Spark Summit mentioned a 40

Re: [VOTE] Release Apache Spark 1.4.0 (RC2)

2015-05-26 Thread Andrew Or
-1 Found a new blocker SPARK-7864 that is being resolved by https://github.com/apache/spark/pull/6419. 2015-05-26 11:32 GMT-07:00 Shivaram Venkataraman : > +1 > > Tested the SparkR binaries using a Standalone Hadoop 1 cluster, a YARN > Hadoop 2.

Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Davies Liu
I think relative imports can not help in this case. When you run scripts in pyspark/sql, it doesn't know anything about pyspark.sql, it just see types.py as a separate module. On Tue, May 26, 2015 at 12:44 PM, Punyashloka Biswal wrote: > Davies: Can we use relative imports (import .types) in the

Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Davies Liu
When you run the test in python/pyspark/sql/ by bin/spark-submit python/pyspark/sql/dataframe.py the the current directory is the first item in sys.path, sql/types.py will have higher priority then python3.4/types.py, the tests will fail. On Tue, May 26, 2015 at 12:08 PM, Justin Uang wrote: > T

Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Punyashloka Biswal
Davies: Can we use relative imports (import .types) in the unit tests in order to disambiguate between the global and local module? Punya On Tue, May 26, 2015 at 3:09 PM Justin Uang wrote: > Thanks for clarifying! I don't understand python package and modules names > that well, but I thought th

Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Justin Uang
Thanks for clarifying! I don't understand python package and modules names that well, but I thought that the package namespacing would've helped, since you are in pyspark.sql.types. I guess not? On Tue, May 26, 2015 at 3:03 PM Davies Liu wrote: > There is a module called 'types' in python 3: > >

Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Davies Liu
There is a module called 'types' in python 3: davies@localhost:~/work/spark$ python3 Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import types >>> types W

Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Justin Uang
In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was renamed to pyspark/sql/_types.py and then some magic in pyspark/sql/__init__.py dynamically renamed the module back to types. I imagine that this is some naming conflict with Python 3, but what was the error that showed up? Th

Re: SparkR and RDDs

2015-05-26 Thread Reynold Xin
You definitely don't want to implement kmeans in R, since it would be very slow. Just providing R wrappers for the MLlib implementation is the way to go. I believe one of the major items in SparkR next is the MLlib wrappers. On Tue, May 26, 2015 at 7:46 AM, Andrew Psaltis wrote: > Hi Alek, > T

Re: [VOTE] Release Apache Spark 1.4.0 (RC2)

2015-05-26 Thread Shivaram Venkataraman
+1 Tested the SparkR binaries using a Standalone Hadoop 1 cluster, a YARN Hadoop 2.4 cluster and on my Mac. Minor thing I noticed is that on Amazon Linux AMI, the R version is 3.1.1 while the binaries seem to have been built with R 3.1.3. This leads to a warning when we load the package but does

Re: [VOTE] Release Apache Spark 1.4.0 (RC2)

2015-05-26 Thread Iulian DragoČ™
I tried 1.4.0-rc2 binaries on a 3-node Mesos cluster, everything seemed to work fine, both spark-shell and spark-submit. Cluster mode deployment also worked. +1 (non-binding) iulian On Tue, May 26, 2015 at 4:44 AM, jameszhouyi wrote: > Compiled: > git clone https://github.com/apache/spark.git

Re: SparkR and RDDs

2015-05-26 Thread Andrew Psaltis
Hi Alek, Thanks for the info. You are correct ,that using the three colons does work. Admittedly I am a R novice, but since the three colons is used to access hidden methods, it seems pretty dirty. Can someone shed light on the design direction being taken with SparkR? Should I really be accessing

Re: SparkR and RDDs

2015-05-26 Thread Eskilson,Aleksander
>From the changes to the namespace file, that appears to be correct, all >methods of the RDD API have been made private, which in R means that you may >still access them by using the namespace prefix SparkR with three colons, e.g. >SparkR:::func(foo, bar). So a starting place for porting old Sp