I have reported the issue on JIRA:
https://issues.apache.org/jira/browse/SPARK-7276

On Thu, Apr 30, 2015 at 4:36 PM, alexandre Clement <a.p.clem...@gmail.com>
wrote:

> Hi all,
>
>
> I'm experimenting serious performance problem when using withColumn and
> dataset with large number of columns. It is very slow: on a dataset with
> 100 columns it takes a few seconds.
>
>
> The code snippet demonstrates the problem.
>
>
> val custs = Seq(
> Row(1, "Bob", 21, 80.5),
> Row(2, "Bobby", 21, 80.5),
> Row(3, "Jean", 21, 80.5),
> Row(4, "Fatime", 21, 80.5)
> )
>
> var fields = List(
> StructField("id", IntegerType, true),
> StructField("a", IntegerType, true),
> StructField("b", StringType, true),
> StructField("target", DoubleType, false))
> val schema = StructType(fields)
>
> var rdd = sc.parallelize(custs)
> var df = sqlContext.createDataFrame(rdd, schema)
>
> for (i <- 1 to 200)
> { val now = System.currentTimeMillis df = df.withColumn("a_new_col_" + i,
> df("a") + i) println(s"$i -> " + (System.currentTimeMillis - now)) }
>
> df.show()
>

Reply via email to