Re: How can I retrieve item-pair after calculating similarity using RowMatrix

2015-04-25 Thread Joseph Bradley
It looks like your code is making 1 Row per item, which means that columnSimilarities will compute similarities between users. If you transpose the matrix (or construct it as the transpose), then columnSimilarities should do what you want, and it will return meaningful indices. Joseph On Fri, Apr

Re: KMeans takeSample jobs and RDD cached

2015-04-25 Thread Joseph Bradley
Yes, the count() should be the first task, and the sampling + collecting should be the second task. The first one is probably slow because the RDD being sampled is not yet cached/materialized. K-Means creates some RDDs internally while learning, and since they aren't needed after learning, they a

回复:Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread doovsaid
Even through grouping by only on name, the issue (CassCastException) still be here. - 原始邮件 发件人:ayan guha 收件人:doovs...@sina.com 抄送人:user 主题:Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown 日期:2015年04月25日 22点33分 Sorry if I am looking at the wrong issue, but your query is wron

Re: directory loader in windows

2015-04-25 Thread ayan guha
This code is in python. Also I tried with fwd slash at the end with same result On 26 Apr 2015 01:36, "Jeetendra Gangele" wrote: > also if this code is in scala why not val in newsY? is this define above? > loc = "D:\\Project\\Spark\\code\\news\\jsonfeeds" > newsY = sc.textFile(loc) > print news

Re: DAG

2015-04-25 Thread Corey Nolet
Giovanni, The DAG can be walked by calling the "dependencies()" function on any RDD. It returns a Seq containing the parent RDDs. If you start at the leaves and walk through the parents until dependencies() returns an empty Seq, you ultimately have your DAG. On Sat, Apr 25, 2015 at 1:28 PM, Akhi

Re: StreamingContext.textFileStream issue

2015-04-25 Thread Yang Lei
I have no problem running the socket text stream sample in the same environment. Thanks Yang Sent from my iPhone > On Apr 25, 2015, at 1:30 PM, Akhil Das wrote: > > Make sure you are having >=2 core for your streaming application. > > Thanks > Best Regards > >> On Sat, Apr 25, 2015 at 3:0

Re: Convert DStream[Long] to Long

2015-04-25 Thread Sergio Jiménez Barrio
It is solved. Thank u! Is more efficient messages.foreachRDD(rdd => { if(!rdd.isEmpty) //Do whatever you want. }) 2015-04-25 19:21 GMT+02:00 Akhil Das : > Like this? > > messages.foreachRDD(rdd => { > > if(rdd.count() > 0) //Do whatever you want. > > }) > > > Thanks > Best Regards > > On Fri,

Re: StreamingContext.textFileStream issue

2015-04-25 Thread Akhil Das
Make sure you are having >=2 core for your streaming application. Thanks Best Regards On Sat, Apr 25, 2015 at 3:02 AM, Yang Lei wrote: > I hit the same issue "as if the directory has no files at all" when running > the sample "examples/src/main/python/streaming/hdfs_wordcount.py" with a > local

Re: DAG

2015-04-25 Thread Akhil Das
May be this will give you a good start https://github.com/apache/spark/pull/2077 Thanks Best Regards On Sat, Apr 25, 2015 at 1:29 AM, Giovanni Paolo Gibilisco wrote: > Hi, > I would like to know if it is possible to build the DAG before actually > executing the application. My guess is that in

Re: Convert DStream[Long] to Long

2015-04-25 Thread Akhil Das
Like this? messages.foreachRDD(rdd => { if(rdd.count() > 0) //Do whatever you want. }) Thanks Best Regards On Fri, Apr 24, 2015 at 11:20 PM, Sergio Jiménez Barrio < drarse.a...@gmail.com> wrote: > Hi, > > I need compare the count of messages recived if is 0 or not, but > messages.count() ret

Re: spark1.3.1 using mysql error!

2015-04-25 Thread Anand Mohan
Yes, you would need to add the MySQL driver jar to the Spark driver & executor classpath. Either using the deprecated SPARK_CLASSPATH environment variable (which the latest docs still recommend anyway although its deprecated) like so >export SPARK_CLASSPATH=/usr/share/java/mysql-connector.jar >spar

Re: what is the best way to transfer data from RDBMS to spark?

2015-04-25 Thread Sujeevan
If your use case is more to do with querying RDBMS and then bringing the results to spark do some analysis then Spark SQL JDBC datasource API is the best. If your use case is to bring entire data to spa

Re: directory loader in windows

2015-04-25 Thread Jeetendra Gangele
also if this code is in scala why not val in newsY? is this define above? loc = "D:\\Project\\Spark\\code\\news\\jsonfeeds" newsY = sc.textFile(loc) print newsY.count() On 25 April 2015 at 20:08, ayan guha wrote: > Hi > > I am facing this weird issue. > > I am on Windows, and I am trying to

Re: directory loader in windows

2015-04-25 Thread Jeetendra Gangele
extra forward slash at the end. sometime I have seen this kind of issues On 25 April 2015 at 20:50, Jeetendra Gangele wrote: > loc = "D:\\Project\\Spark\\code\\news\\jsonfeeds\\" > > On 25 April 2015 at 20:49, Jeetendra Gangele wrote: > >> Hi Ayan can you try below line >> >> loc = "D:\\Project

Re: directory loader in windows

2015-04-25 Thread Jeetendra Gangele
Hi Ayan can you try below line loc = "D:\\Project\\Spark\\code\\news\\jsonfeeds" On 25 April 2015 at 20:08, ayan guha wrote: > Hi > > I am facing this weird issue. > > I am on Windows, and I am trying to load all files within a folder. Here > is my code - > > loc = "D:\\Project\\Spark\\code

Re: directory loader in windows

2015-04-25 Thread Jeetendra Gangele
loc = "D:\\Project\\Spark\\code\\news\\jsonfeeds\\" On 25 April 2015 at 20:49, Jeetendra Gangele wrote: > Hi Ayan can you try below line > > loc = "D:\\Project\\Spark\\code\\news\\jsonfeeds" > > On 25 April 2015 at 20:08, ayan guha wrote: > >> Hi >> >> I am facing this weird issue. >> >> I

directory loader in windows

2015-04-25 Thread ayan guha
Hi I am facing this weird issue. I am on Windows, and I am trying to load all files within a folder. Here is my code - loc = "D:\\Project\\Spark\\code\\news\\jsonfeeds" newsY = sc.textFile(loc) print newsY.count() Even this simple code fails. I have tried with giving exact file names, every

Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread ayan guha
Sorry if I am looking at the wrong issue, but your query is wrong.you shoulf group by only on name. On Sat, Apr 25, 2015 at 11:59 PM, wrote: > Hi all, > When I query Postgresql based on Spark SQL like this: > dataFrame.registerTempTable("Employees") > val emps = sqlContext.sq

回复:Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread doovsaid
Yeah, same issue. I noticed this issue is not solved yet. - 原始邮件 - 发件人:Ted Yu 收件人:doovs...@sina.com 抄送人:user 主题:Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown 日期:2015年04月25日 22点04分 Looks like this is related: https://issues.apache.org/jira/browse/SPARK-5456 On Sat, Apr 25

Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread Ted Yu
Looks like this is related: https://issues.apache.org/jira/browse/SPARK-5456 On Sat, Apr 25, 2015 at 6:59 AM, wrote: > Hi all, > When I query Postgresql based on Spark SQL like this: > dataFrame.registerTempTable("Employees") > val emps = sqlContext.sql("select name, sum(salary) from

Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread doovsaid
Hi all, When I query Postgresql based on Spark SQL like this: dataFrame.registerTempTable("Employees") val emps = sqlContext.sql("select name, sum(salary) from Employees group by name, salary") monitor { emps.take(10) .map(row => (row.getString(0), row.getDecima

KMeans takeSample jobs and RDD cached

2015-04-25 Thread podioss
Hi, i am running k-means algorithm with initialization mode set to random and various dataset sizes and values for clusters and i have a question regarding the takeSample job of the algorithm. More specific i notice that in every application there are two sampling jobs. The first one is consuming

Re: what is the best way to transfer data from RDBMS to spark?

2015-04-25 Thread ayan guha
Actually, Spark SQL provides a data source. Here is from documentation - JDBC To Other Databases Spark SQL also includes a data source that can read data from other databases using JDBC. This functionality should be preferred over using JdbcRDD