Task splitting among workers

2014-04-19 Thread David Thomas
During a Spark stage, how are tasks split among the workers? Specifically for a HadoopRDD, who determines which worker has to get which task?

efficient joining

2014-04-19 Thread Joe L
What is the efficient way to join two RDDs? joining is taking too long to perform. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/efficient-joining-tp4497.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

questions about toArray and ClassTag

2014-04-19 Thread wxhsdp
Hi, all i'am quite new in scala, i do some tests in spark shell val b = a.mapPartitions{D => val p = D.toArray . p.toIterator } when a is an RDD of type RDD[Int], b.collect() works. but when i change a to RDD[MyOwnType], b.collect() returns error: 14/04/20 10:14:46 ERROR OneForOneSt

Help with error initializing SparkR.

2014-04-19 Thread tongzzz
I can't initialize sc context after a successful install on Cloudera quickstart VM. This is the error message: > library(SparkR) Loading required package: rJava [SparkR] Initializing with classpath /usr/lib64/R/library/SparkR/sparkr-assembly-0.1.jar > sc <- sparkR.init("local") Error in .jcall("R

Re: Anyone using value classes in RDDs?

2014-04-19 Thread kamatsuoka
No, you can wrap other types in value classes as well. You can try it in the REPL: scala> case class ID(val id: String) extends AnyVal defined class ID scala> val i = ID("foo") i: ID = ID(foo) On Fri, Apr 18, 2014 at 4:14 PM, Koert Kuipers [via Apache Spark User List] wrote: > isn't valueclas

Re: extremely slow k-means version

2014-04-19 Thread ticup
Thanks a lot for the explanation Matei. As a matter of fact, I was just reading up on the paper on the Narrow and Wide Dependencies and saw that groupByKey is indeed a wide dependency which, as you explained, is the problem. Maybe it wouldn't be a bad thing to have a section in the docs on the wi

Re: extremely slow k-means version

2014-04-19 Thread Matei Zaharia
The problem is that groupByKey means “bring all the points with this same key to the same JVM”. Your input is a Seq[Point], so you have to have all the points there. This means that a) all points will be sent across the network in a cluster, which is slow (and Spark goes through this sending cod

Re: ui broken in latest 1.0.0

2014-04-19 Thread Andrew Or
The reason why it worked before was because the UI would directly access sc.getStorageStatus, instead of getting it through Task and Stage events. This is not necessarily the best design, however, because the SparkContext and the SparkUI are closely coupled, and there is no way to create a SparkUI

Re: ui broken in latest 1.0.0

2014-04-19 Thread Koert Kuipers
got it. makes sense. i am surprised it worked before... On Apr 18, 2014 9:12 PM, "Andrew Or" wrote: > Hi Koert, > > I've tracked down what the bug is. The caveat is that each StageInfo only > keeps around the RDDInfo of the last RDD associated with the Stage. More > concretely, if you have someth

extremely slow k-means version

2014-04-19 Thread ticup
Hi, I was playing around with other k-means implementations in Scala/Spark in order to test performances and get a better grasp on Spark. Now, I made one similar to the one from the examples (https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkKMeans