Re: IP to geo information in spark streaming application

2014-12-02 Thread qinwei
1) I think using library based solution is a better idea, we used that, and it works.2) We used broadcast variable, and it works qinwei  From: Noam KfirDate: 2014-12-02 23:14To: user@spark.apache.orgSubject: IP to geo information in spark streaming application Hi I'm n

Re: Joined RDD

2014-11-12 Thread qinwei
ng to the log, it is stored qinwei  From: ajay gargDate: 2014-11-13 14:56To: userSubject: Joined RDDHi, I have two RDDs A and B which are created from reading file from HDFS. I have a third RDD C which is created by taking join of A and B. All three RDDs (A, B and C ) are not cached. Now if I per

why flatmap has shuffle

2014-11-12 Thread qinwei
gt; arg.asInstanceOf[(String, String, Long, String, String)]).filter(arg => arg._3 != 0)val flatMappedRDD = rdd.flatMap(arg => List((arg._1, (arg._2, arg._3, 1)), (arg._2, (arg._1, arg._3, 0 Thank for your help! qinwei

Re: Pass RDD to functions

2014-11-12 Thread qinwei
I think it‘s ok,feel free to treat RDD like common object qinwei  From: Deep PradhanDate: 2014-11-12 18:24To: user@spark.apache.orgSubject: Pass RDD to functionsHi, Can we pass RDD to functions?Like, can we do the following? def func (temp: RDD[String]):RDD[String] = {//body of the

Re: Re: about write mongodb in mapPartitions

2014-11-09 Thread qinwei
Thanks for your reply! As you mentioned , the insert clause is not executed as the results of args.map are never used anywhere, and after i modified the code , it works. qinwei  From: Tobias PfeifferDate: 2014-11-07 18:04To: qinweiCC: userSubject: Re: about write mongodb in

Re: Re: about write mongodb in mapPartitions

2014-11-09 Thread qinwei
Thanks for your reply!    According to your hint, the code should be like this:       // i want to save data in rdd to mongodb and hdfs         rdd.saveAsNewAPIHadoopFile()        rdd.saveAsTextFile()     but will the application read hdfs twice? qinwei  From: Akhil DasDate: 2014-11-07

about write mongodb in mapPartitions

2014-11-07 Thread qinwei
re someting wrong?    I know that collecting the newRDD to driver and then saving it to mongodb will success, but will the following saveAsTextFile read the filesystem once again?     Thanks     qinwei

about aggregateByKey and standard deviation

2014-10-31 Thread qinwei
the moment, i have done that by using groupByKey and map, I notice that groupByKey is very expensive,  but i can not figure out how to do it by using aggregateByKey, so i wonder is there any better way to do this? Thanks! qinwei

Re: Re: problem with patitioning

2014-09-28 Thread qinwei
Thank you for your reply, and your tips on code refactoring is helpful, after a second look on the code, the casts and null check is really unnecessary. qinwei  From: Sean OwenDate: 2014-09-28 15:03To: qinweiCC: userSubject: Re: problem with patitioning(Most of this code is not relevant

回复: RE: problem with data locality api

2014-09-27 Thread qinwei
your reply! qinwei ?发件人:?Shao, Saisai发送时间:?2014-09-28?14:42收件人:?qinwei抄送:?user主题:?RE: problem with data locality api Hi ? First conf is used for Hadoop to determine the locality distribution of HDFS file. Second conf is used for Spark, though with the same name, actually they are two

problem with patitioning

2014-09-27 Thread qinwei
nd path3,  i want to change the number of patition by the code below:    val rdd1 = sc.textFile(path1, 1920)      val rdd2 = sc.textFile(path2, 1920)      val rdd3 = sc.textFile(path3, 1920)     by doing this, i expect there are 1920 tasks totally, but i found the number of tasks becomes 8920, any idea what's going on here?     Thanks! qinwei

problem with data locality api

2014-09-27 Thread qinwei
”)))?? ??? ??val sc = new SparkContext(conf, locData)? ? but i found the two confs above are of different types, conf in the first line if of type?org.apache.hadoop.conf.Configuration, and conf in the second line is of type SparkConf, ?can anyone explain that to me or give me some example code?? ?? qinwei

Re: How to use multi thread in RDD map function ?

2014-09-27 Thread qinwei
in the options of spark-submit, there are two options which may be helpful to your problem, they are "--total-executor-cores NUM"(standalone and mesos only), "--executor-cores"(yarn only) qinwei  From: myasukaDate: 2014-09-28 11:44To: userSubject: How to use mult

Re: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-27 Thread qinwei
Thanks a lot for your reply, it gave me much inspiration. qinwei  From: Sean OwenDate: 2014-04-25 14:10To: userSubject: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in sparkSo you are computing all-pairs similarity over 20M users? This going to take

Re: Re: what is the best way to do cartesian

2014-04-27 Thread qinwei
Thanks a lot for your reply, but i have tried the  built-in RDD.cartesian() method before, it didn't make it faster. qinwei  From: Alex BoisvertDate: 2014-04-26 00:32To: userSubject: Re: what is the best way to do cartesianYou might want to try the built-in RDD.cartesian() method.

Re: Re: how to set spark.executor.memory and heap size

2014-04-23 Thread qinwei
try the complete path qinwei  From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set spark.executor.memory and heap sizethank you, i add setJars, but nothing changes       val conf = new SparkConf()   .setMaster("spark://127.0.0.1:7077")   .setAppName(&