Ran into this same issue. Only solution seems to be to coerce the DataFrame's
schema back into the right state. Looks like you have to convert the DF to
an RDD, which has an overhead. But otherwise this worked for me:
val newDF = sqlContext.createDataFrame(origDF.rdd, new
StructType(origDF.schema.
Just use select() to create a new DataFrame with only the columns you want.
Sort of the opposite of what you want -- but you can select all but the
columns you want minus the one you don. You could even use a filter to
remove just the one column you want on the fly:
myDF.select(myDF.columns.filter
val newRdd = myRdd.map(row => row ++ Array((row(1).toLong *
row(199).toLong).toString))
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-a-column-to-a-spark-RDD-with-many-columns-tp22729p22735.html
Sent from the Apache Spark User List mailing list
Test it out, but I would be willing to bet the join is going to be a good
deal faster.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-filter-vs-RDD-join-advice-please-tp22612p22614.html
Sent from the Apache Spark User List mailing list archive at Nabble