Hello everybody,
I have two questions in one. I upgrade from Spark 1.1 to 1.3 and some part
of my code using groupBy became really slow.
*1/ *Why does the groupBy of rdd is really slow in comparison to the groupBy
of dataFrame?
// DataFrame : running in few seconds
val result = table.groupBy("co
I would like to monitor cpu/memory usage.
I read the section Metrics on :
http://spark.apache.org/docs/1.3.1/monitoring.html.
Here my $SPARK_HOME/conf/metrics.properties
# Enable CsvSink for all instances
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
# Polling period for CsvSink
*.sink
Hello patelmiteshn,
This could do the trick :
rdd1 = rdd.sortBy(lambda x: x[1], ascending=False)
rdd2 = rdd1.zipWithIndex().filter(tuple => tuple._2 < 1)
rdd2.saveAsTextFile()
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/extracting-the-top-100-val
I build spark with ganglia :
$SPARK_HOME/build/sbt -Pspark-ganglia-lgpl -Phadoop-1 -Phive
-Phive-thriftserver assembly
...
[info] Including from cache: metrics-ganglia-3.1.0.jar
...
In the master log :
ERROR actor.OneForOneStrategy: org.apache.spark.metrics.sink.GangliaSink
Hello,
I am wondering how does "/join/" work in SparkQL? Does it co-partition two
tables? or does it do it by wide dependency?
I have two big tables to join, the query creates more than 150Go temporary
data, so the query stops because I have no space left my disk.
I guess I could use a HashParti
Hello everybody,
I am trying to do a simple groupBy :
*Code:*
val df = hiveContext.sql("SELECT * FROM table1")
df .printSchema()
df .groupBy("customer_id").count().show(5)
*Stacktrace* :
root
|-- customer_id: string (nullable = true)
|-- rank: string (nullable = true)
|-- reco_material_id:
I read on the RDD paper
(http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) :
"For example, an RDD representing an HDFS file has a partition for each block
of the file and knows which machines each block is on"
And that on http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
"To minimi
Hello everybody,
I am wondering how Spark handles via HDFS his RDD, what if during a map
phase I need data which are not present locally?
What I am working on :
I am working on a recommendation algorithm : Matrix Factorization (MF) using
a stochastic gradient as optimizer. For now my algorithm wo
Yes that help to understand better how works spark. But that was also what I
was afraid, I think the network communications will take to much time for my
job.
I will continue to look for a trick in order to not have network
communications.
I saw on the hadoop website that : "To minimize global ba