With Spark 2.x, I construct a Dataframe from a sample libsvm file: scala> val higgsDF = spark.read.format("libsvm").load("higgs.libsvm") higgsDF: org.apache.spark.sql.DataFrame = [label: double, features: vector]
Then, build a new dataframe that involves an except( ) scala> val train_df = higgsDF.sample(false, 0.7, 42) train_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector] scala> val test_df = input_df.except(train_df) test_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector] Now, most operations on the test_df fail with this exception: scala> test_df.show() java.lang.RuntimeException: no default for type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 at org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179) at org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:117) at org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:110) . . Debugging this, I see that this is the schema of this dataframe: scala> test_df.schema res4: org.apache.spark.sql.types.StructType = StructType(StructField(label,DoubleType,true), StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) Looking a little deeper, the error occurs because the QueryPlanner ends up inside object ExtractEquiJoinKeys (/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala) where it processes a LeftAnti Join. Then there is an attempt to generate a default Literal value for the org.apache.spark.ml.linalg.VectorUDT DataType which fails with the above exception. This is because there is no match for the VectorUDT in def default(dataType: DataType): Literal = {..} (/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/literals.scala) Any processing on this dataframe that causes Spark to build a query plan (i.e. almost all productive uses of this dataframe) fails due to this exception. Is it a miss in the Literal implementation that it does not handle UserDefinedTypes or is it left out intentionally? Is there a way to get around this problem? This problem seems to be present in all 2.x version. Regards, Vinayak Joshi