I have reported the issue on JIRA: https://issues.apache.org/jira/browse/SPARK-7276
On Thu, Apr 30, 2015 at 4:36 PM, alexandre Clement <a.p.clem...@gmail.com> wrote: > Hi all, > > > I'm experimenting serious performance problem when using withColumn and > dataset with large number of columns. It is very slow: on a dataset with > 100 columns it takes a few seconds. > > > The code snippet demonstrates the problem. > > > val custs = Seq( > Row(1, "Bob", 21, 80.5), > Row(2, "Bobby", 21, 80.5), > Row(3, "Jean", 21, 80.5), > Row(4, "Fatime", 21, 80.5) > ) > > var fields = List( > StructField("id", IntegerType, true), > StructField("a", IntegerType, true), > StructField("b", StringType, true), > StructField("target", DoubleType, false)) > val schema = StructType(fields) > > var rdd = sc.parallelize(custs) > var df = sqlContext.createDataFrame(rdd, schema) > > for (i <- 1 to 200) > { val now = System.currentTimeMillis df = df.withColumn("a_new_col_" + i, > df("a") + i) println(s"$i -> " + (System.currentTimeMillis - now)) } > > df.show() >