DataFrame groupBy vs RDD groupBy

2015-05-22 Thread gtanguy
Hello everybody, I have two questions in one. I upgrade from Spark 1.1 to 1.3 and some part of my code using groupBy became really slow. *1/ *Why does the groupBy of rdd is really slow in comparison to the groupBy of dataFrame? // DataFrame : running in few seconds val result = table.groupBy("co

Spark metrics cpu/memory

2015-10-05 Thread gtanguy
I would like to monitor cpu/memory usage. I read the section Metrics on : http://spark.apache.org/docs/1.3.1/monitoring.html. Here my $SPARK_HOME/conf/metrics.properties # Enable CsvSink for all instances *.sink.csv.class=org.apache.spark.metrics.sink.CsvSink # Polling period for CsvSink *.sink

Re: extracting the top 100 values from an rdd and save it as text file

2015-10-06 Thread gtanguy
Hello patelmiteshn, This could do the trick : rdd1 = rdd.sortBy(lambda x: x[1], ascending=False) rdd2 = rdd1.zipWithIndex().filter(tuple => tuple._2 < 1) rdd2.saveAsTextFile() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/extracting-the-top-100-val

Spark ganglia jClassNotFoundException: org.apache.spark.metrics.sink.GangliaSink

2015-10-08 Thread gtanguy
I build spark with ganglia : $SPARK_HOME/build/sbt -Pspark-ganglia-lgpl -Phadoop-1 -Phive -Phive-thriftserver assembly ... [info] Including from cache: metrics-ganglia-3.1.0.jar ... In the master log : ERROR actor.OneForOneStrategy: org.apache.spark.metrics.sink.GangliaSink

SPARKQL Join partitioner

2015-03-12 Thread gtanguy
Hello, I am wondering how does "/join/" work in SparkQL? Does it co-partition two tables? or does it do it by wide dependency? I have two big tables to join, the query creates more than 150Go temporary data, so the query stops because I have no space left my disk. I guess I could use a HashParti

DataFrame GroupBy

2015-03-26 Thread gtanguy
Hello everybody, I am trying to do a simple groupBy : *Code:* val df = hiveContext.sql("SELECT * FROM table1") df .printSchema() df .groupBy("customer_id").count().show(5) *Stacktrace* : root |-- customer_id: string (nullable = true) |-- rank: string (nullable = true) |-- reco_material_id:

RDD creation on HDFS

2014-04-08 Thread gtanguy
I read on the RDD paper (http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) : "For example, an RDD representing an HDFS file has a partition for each block of the file and knows which machines each block is on" And that on http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html "To minimi

How does Spark handle RDD via HDFS ?

2014-04-09 Thread gtanguy
Hello everybody, I am wondering how Spark handles via HDFS his RDD, what if during a map phase I need data which are not present locally? What I am working on : I am working on a recommendation algorithm : Matrix Factorization (MF) using a stochastic gradient as optimizer. For now my algorithm wo

Re: How does Spark handle RDD via HDFS ?

2014-04-10 Thread gtanguy
Yes that help to understand better how works spark. But that was also what I was afraid, I think the network communications will take to much time for my job. I will continue to look for a trick in order to not have network communications. I saw on the hadoop website that : "To minimize global ba