During a Spark stage, how are tasks split among the workers? Specifically
for a HadoopRDD, who determines which worker has to get which task?
What is the efficient way to join two RDDs? joining is taking too long to
perform.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/efficient-joining-tp4497.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi, all
i'am quite new in scala, i do some tests in spark shell
val b = a.mapPartitions{D =>
val p = D.toArray
.
p.toIterator
}
when a is an RDD of type RDD[Int], b.collect() works. but when i change a to
RDD[MyOwnType], b.collect() returns error:
14/04/20 10:14:46 ERROR OneForOneSt
I can't initialize sc context after a successful install on Cloudera
quickstart VM.
This is the error message:
> library(SparkR)
Loading required package: rJava
[SparkR] Initializing with classpath
/usr/lib64/R/library/SparkR/sparkr-assembly-0.1.jar
> sc <- sparkR.init("local")
Error in .jcall("R
No, you can wrap other types in value classes as well. You can try it in
the REPL:
scala> case class ID(val id: String) extends AnyVal
defined class ID
scala> val i = ID("foo")
i: ID = ID(foo)
On Fri, Apr 18, 2014 at 4:14 PM, Koert Kuipers [via Apache Spark User List]
wrote:
> isn't valueclas
Thanks a lot for the explanation Matei.
As a matter of fact, I was just reading up on the paper on the Narrow and
Wide Dependencies and saw that groupByKey is indeed a wide dependency which,
as you explained, is the problem.
Maybe it wouldn't be a bad thing to have a section in the docs on the
wi
The problem is that groupByKey means “bring all the points with this same key
to the same JVM”. Your input is a Seq[Point], so you have to have all the
points there. This means that a) all points will be sent across the network in
a cluster, which is slow (and Spark goes through this sending cod
The reason why it worked before was because the UI would directly access
sc.getStorageStatus, instead of getting it through Task and Stage events.
This is not necessarily the best design, however, because the SparkContext
and the SparkUI are closely coupled, and there is no way to create a
SparkUI
got it. makes sense. i am surprised it worked before...
On Apr 18, 2014 9:12 PM, "Andrew Or" wrote:
> Hi Koert,
>
> I've tracked down what the bug is. The caveat is that each StageInfo only
> keeps around the RDDInfo of the last RDD associated with the Stage. More
> concretely, if you have someth
Hi,
I was playing around with other k-means implementations in Scala/Spark in
order to test performances and get a better grasp on Spark.
Now, I made one similar to the one from the examples
(https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkKMeans
10 matches
Mail list logo