In Spark 1.6
if I do (column name has dot in it, but is not a nested column):
df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`"))scala> df =
df.withColumn("raw.hourOfDay",
df.col("`raw.hourOfDay`"))org.apache.spark.sql.AnalysisException: cannot
resolve 'raw.minOfDay' given input columns raw.hourOfDay_2, raw.dayOfWeek,
raw.sensor2, raw.hourOfDay, raw.minOfDay; at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
at
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
at
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
at
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318) at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at
scala.collection.AbstractTraversable.map(Traversable.scala:105) at
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
but if I do:
df = df.withColumn("raw.hourOfDay_2", df.col("`raw.hourOfDay`"))scala>
df.printSchema
root
|-- raw.hourOfDay: long (nullable = true)
|-- raw.minOfDay: long (nullable = true)
|-- raw.dayOfWeek: long (nullable = true)
|-- raw.sensor2: long (nullable = true)
|-- raw.hourOfDay_2: long (nullable = true)
it works fine (i.e. column is created).
The only difference is that the name "raw.hourOfDay_2" does not exist yet, and
is properly created as a colName with dot, not as a nested column.
The documentation however says that if the column exists it will replace it,
but it seems there is a miss-interpretation of the column name as a nested
column
defwithColumn(colName: String, col: Column): DataFrameReturns a new DataFrame
by adding a column or replacing the existing column that has the same name.
Any thoughts on why the different behavior when the column exists?
Thanks