Hello everyone,
after one month without any reply on stackoverflow (
https://stackoverflow.com/questions/47789265/inconsistency-in-handling-duplicate-names-in-dataframe-schema)
I try to pose the question here.

Context: I am refactoring some code of mine, transforming scala methods
with a signature of the form methodname(df: DataFrame, **params**):
DataFrame into a subclass of org.apache.spark.ml.Transformer, following
this document: custom transformer doc (my dev is targeting Spark 2.2.0, if
relevant).

I have been trying to clear my mind on how the transformSchema method
should be used, but I have realized that there is general agreement on how
to use them (SPARK-14760). I have decided to invoke such method from my
transform methods when there are chances to catch type errors in advance.

The transformSchema method of one of these classes reads as follows:

override def transformSchema(schema: StructType): StructType = {
  val newFields = schema.fields :+ new StructField($(outputCol),
                                                   DataTypes.StringType,
false)
  DataTypes.createStructType(newFields)
}

I have written some unit tests to test my custom transformers:

test("Use of ambiguous column names") {
  val df1 = sqlContext.createDataFrame(
    Seq(
      ("a", "1a", 2),
      ("b", "2a", 4),
      ("c", "3a", 5)
    )
  ).toDF("c1", "c2", "c3")

  val df2 = sqlContext.createDataFrame(
    Seq(
      ("a", "1b", 1),
      ("b", "2b", 3),
      ("c", "3b", 6)
    )
  ).toDF("c1", "c2", "c3")

  val df_join = df1.join(df2, "c1")

  val columnsToConcatenate = Array(df1("c2"), df2("c2"))
  [...transformer instantiation and invocation...]
}

To my surprise the unit test above failed by raising an exception:

java.lang.IllegalArgumentException: fields should have distinct names.
at org.apache.spark.sql.types.DataTypes.createStructType(DataTypes.java:219)

Analysis:

By inspecting the source code of createStructType method, I have realized
that duplicate names are not allowed.

Of course I can bypass the factory method and generate my StructType by
directly using its constructor, but I fail to see the point of this "check"
that fails with legal dataframes, and I generally tend to use factories
over constructors, if available.

There is no explanation for not supporting duplicate names in the method
documentation, and its semantics is, to my opinion, unexpected.

Question: Is this an inconsistency or there is a reason for the semantics
adopted by createStructType (that is, no dups)?

Best regards,
Alessandro

Reply via email to