回复: RE: problem with data locality api

2014-09-27 Thread qinwei
Thank you for your reply, ? ? I understand your explaination, but i wonder what is the?correct usage of the apinew SparkContext(config: SparkConf, preferredNodeLocationData: Map[String, Set[SplitInfo]])how to construct the second param?preferredNodeLocationData?hope for yo

Re: How to use multi thread in RDD map function ?

2014-09-27 Thread Sean Owen
If increasing executors really isn't enough, then you can consider using mapPartitions to process whole partitions at a time. Within that you can multi thread your processing of the elements in the partition. (And you should probably use more like one worker per machine then.) The question is how

RE: problem with data locality api

2014-09-27 Thread Shao, Saisai
Hi First conf is used for Hadoop to determine the locality distribution of HDFS file. Second conf is used for Spark, though with the same name, actually they are two different classes. Thanks Jerry From: qinwei [mailto:wei@dewmobile.net] Sent: Sunday, September 28, 2014 2:05 PM To: user Su

Re: How to use multi thread in RDD map function ?

2014-09-27 Thread myasuka
Thank you for your reply, Actually, we have already used this parameter. Our cluster is a standalone cluster with 16 nodes, every node has 16 cores. We have 256 pairs matrices along with 256 tasks , when we set --total-executor-cores as 64, each node can launch 4 tasks simultaneously, each task

problem with patitioning

2014-09-27 Thread qinwei
Hi, everyone    I come across a problem with changing the patition number of the rdd,  my code is as below:    val rdd1 = sc.textFile(path1)     val rdd2 = sc.textFile(path2)     val rdd3 = sc.textFile(path3)     val imeiList = parseParam(job.jobParams)     val broadcastVar = sc.broadc

problem with data locality api

2014-09-27 Thread qinwei
Hi, everyone? ? I come across with a problem about data locality, i found these?example?code in 《Spark-on-YARN-A-Deep-Dive-Sandy-Ryza.pdf》? ??? ??val locData = InputFormatInfo.computePreferredLocations(Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new Path(“myfile.txt”)))?? ???

Re: How to use multi thread in RDD map function ?

2014-09-27 Thread qinwei
in the options of spark-submit, there are two options which may be helpful to your problem, they are "--total-executor-cores NUM"(standalone and mesos only), "--executor-cores"(yarn only) qinwei  From: myasukaDate: 2014-09-28 11:44To: userSubject: How to use multi thread in RDD map funct

How to use multi thread in RDD map function ?

2014-09-27 Thread myasuka
Hi, everyone I come across with a problem about increasing the concurency. In a program, after shuffle write, each node should fetch 16 pair matrices to do matrix multiplication. such as: * import breeze.linalg.{DenseMatrix => BDM} pairs.map(t => { val b1 = t._2._1.asInstanceOf[BDM[Do

PageRank execution imbalance, might hurt performance by 6x

2014-09-27 Thread Larry Xiao
Hi all! I'm running PageRank on GraphX, and I find on some tasks on one machine can spend 5~6 times more time than on others, others are perfectly balance (around 1 second to finish). And since time for a stage (iteration) is determined by the slowest task, the performance is undesirable. I

Re: Build spark with Intellij IDEA 13

2014-09-27 Thread maddenpj
I actually got this same exact issue compiling a unrelated project (not using spark). Maybe it's a protobuf issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Build-spark-with-Intellij-IDEA-13-tp9904p15284.html Sent from the Apache Spark User List maili

Re: New user question on Spark SQL: can I really use Spark SQL like a normal DB?

2014-09-27 Thread jamborta
hi, Yes, I have been using spark sql extensively that way. I have just tried and saveAsTable() works OK on 1.1.0. Alternatively, you can write the data from schemaRDD to HDFS using saveAsTextFile, and create an external table on top of it. thanks, -- View this message in context: http://a

Re: How to run hive scripts pro-grammatically in Spark 1.1.0 ?

2014-09-27 Thread jamborta
Hi, you can create a spark context in your python or scala environment and use that to run your hive queries, pretty much the same way as you'd do it in the shell. thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-hive-scripts-pro-grammati

yarn does not accept job in cluster mode

2014-09-27 Thread jamborta
hi all, I have a job that works ok in yarn-client mode,but when I try in yarn-cluster mode it returns the following: WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory the cluster has plent

Re: iPython notebook ec2 cluster matlabplot not found?

2014-09-27 Thread Nicholas Chammas
Can you first confirm that the regular PySpark shell works on your cluster? Without upgrading to 2.7. That is, you log on to your master using spark-ec2 login and run bin/pyspark successfully without any special flags. And as far as I can tell, you should be able to use IPython at 2.6, so I’d next

iPython notebook ec2 cluster matlabplot not found?

2014-09-27 Thread Andy Davidson
Hi I am having a heck of time trying to get python to work correctly on my cluster created using the spark-ec2 script The following link was really helpful https://issues.apache.org/jira/browse/SPARK-922 I am still running into problem with matplotlib. (it works fine on my mac). I can not fig

Re: Retrieve dataset of Big Data Benchmark

2014-09-27 Thread Tom
Hi, I was able to download the dataset this way (and just reconfirmed it by doing so again): //Following before starting spark export AWS_ACCESS_KEY_ID=*key_id* export AWS_SECRET_ACCESS_KEY=*access_key* //Start spark ./spark-shell //In the spark shell val dataset = sc.textFile("s3n://big-data-ben

MLlib 1.2 New & Interesting Features

2014-09-27 Thread Krishna Sankar
Guys, - Need help in terms of the interesting features coming up in MLlib 1.2. - I have a 2 Part, ~3 hr hands-on tutorial at the Big Data Tech Con - "The Hitchhiker's Guide to Machine Learning with Python & Apache Spark"[2] - At minimum, it would be good to take the last 30 mi

Re: Is it possible to use Parquet with Dremel encoding

2014-09-27 Thread Michael Armbrust
Based on your first example it looks like what you want is actually run length encoding (which parquet does support ). Repetition and definition levels are used to reconstruct nested or repeated (arrays) data that has been shredded

RDD logic and control

2014-09-27 Thread pop1998
hello, im examining the SPARK RDDs and trying to understand how does the RDD flow works. can any one please tell me how does the RDD decide to (and where can i find the relevant code): 1. re-split to new RDD? 2. move to a new PC? 3. perform PC selection? 4. preform union of multiple RDDs? 5. how

Re: Log hdfs blocks sending

2014-09-27 Thread Andrew Ash
Hi Alexey, You're looking in the right place in the first log from the driver. Specifically the locality is on the TaskSetManager INFO log level and looks like this: 14/09/26 16:57:31 INFO TaskSetManager: Starting task 9.0 in stage 1.0 (TID 10, 10.54.255.191, ANY, 1341 bytes) The ANY there mean

Re: flume spark streaming receiver host random

2014-09-27 Thread Sean Owen
I don't think you control which host he receiver runs on, right? So that Spark can handle the failure of that node and reassign the receiver. On Sep 27, 2014 2:43 AM, "centerqi hu" wrote: > the receiver is not running on the machine I expect > > > > 2014-09-26 14:09 GMT+08:00 Sean Owen : > > I th