Hello everybody,

I have two questions in one. I upgrade from Spark 1.1 to 1.3 and some part
of my code using groupBy became really slow.

*1/ *Why does the groupBy of rdd is really slow in comparison to the groupBy
of dataFrame?

// DataFrame : running in few seconds
val result = table.groupBy("col1").count

// RDD : taking hours with a lot of /spilling in-memory/
val schemaOriginel = table.schema
val result = table.rdd.groupBy { r =>
     val rs = RowSchema(r, schemaOriginel)
     val col1 = rs.getValueByName("col1")
     col1
  }.map(l => (l._1,l._2.size) ).count()


*2/* My goal is to groupBy on a key, then to order each group over a column
and finally to add the row number in each group. I had this code running
before changing to Spark 1.3 and it worked fine, but since I have changed to
DataFrame it is really slow. 

 val schemaOriginel = table.schema
 val result = table.rdd.groupBy { r =>
        val rs = RowSchema(r, schemaOriginel)
        val col1 = rs.getValueByName("col1")
         col1
        }.flatMap {
         l =>
           l._2.toList
             .sortBy {
              u =>
                val rs = RowSchema(u, schemaOriginel)
                val col1 = rs.getValueByName("col1")
                val col2 = rs.getValueByName("col2")
                (col1, col2)
            } .zipWithIndex
        }

/I think the SQL equivalent of what I try to do : /

SELECT a,
       ROW_NUMBER() OVER (PARTITION BY a) AS num
FROM table.


 I don't think I can do this with a GroupedData (result of df.groupby). Any
ideas on how I can speed up this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-groupBy-vs-RDD-groupBy-tp22995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to