Hello everybody, I have two questions in one. I upgrade from Spark 1.1 to 1.3 and some part of my code using groupBy became really slow.
*1/ *Why does the groupBy of rdd is really slow in comparison to the groupBy of dataFrame? // DataFrame : running in few seconds val result = table.groupBy("col1").count // RDD : taking hours with a lot of /spilling in-memory/ val schemaOriginel = table.schema val result = table.rdd.groupBy { r => val rs = RowSchema(r, schemaOriginel) val col1 = rs.getValueByName("col1") col1 }.map(l => (l._1,l._2.size) ).count() *2/* My goal is to groupBy on a key, then to order each group over a column and finally to add the row number in each group. I had this code running before changing to Spark 1.3 and it worked fine, but since I have changed to DataFrame it is really slow. val schemaOriginel = table.schema val result = table.rdd.groupBy { r => val rs = RowSchema(r, schemaOriginel) val col1 = rs.getValueByName("col1") col1 }.flatMap { l => l._2.toList .sortBy { u => val rs = RowSchema(u, schemaOriginel) val col1 = rs.getValueByName("col1") val col2 = rs.getValueByName("col2") (col1, col2) } .zipWithIndex } /I think the SQL equivalent of what I try to do : / SELECT a, ROW_NUMBER() OVER (PARTITION BY a) AS num FROM table. I don't think I can do this with a GroupedData (result of df.groupby). Any ideas on how I can speed up this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-groupBy-vs-RDD-groupBy-tp22995.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org