Hi Vinayak, Thanks for reporting this.
I don't think it is left out intentionally for UserDefinedType. If you already know how the UDT is represented in internal format, you can explicitly convert the UDT column to other SQL types, then you may get around this problem. It is a bit hacky, anyway. I submitted a PR to fix this, but not sure if it will get in the master soon. vijoshi wrote > With Spark 2.x, I construct a Dataframe from a sample libsvm file: > > scala> val higgsDF = spark.read.format("libsvm").load("higgs.libsvm") > higgsDF: org.apache.spark.sql.DataFrame = [label: double, features: > vector] > > > Then, build a new dataframe that involves an except( ) > > scala> val train_df = higgsDF.sample(false, 0.7, 42) > train_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: > double, features: vector] > > scala> val test_df = input_df.except(train_df) > test_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: > double, features: vector] > > Now, most operations on the test_df fail with this exception: > > scala> test_df.show() > java.lang.RuntimeException: no default for type > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 > at > org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179) > at > org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:117) > at > org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:110) > . > . > > Debugging this, I see that this is the schema of this dataframe: > > scala> test_df.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(label,DoubleType,true), > StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) > > Looking a little deeper, the error occurs because the QueryPlanner ends up > inside > > object ExtractEquiJoinKeys > (/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala) > > where it processes a LeftAnti Join. Then there is an attempt to generate a > default Literal value for the org.apache.spark.ml.linalg.VectorUDT > DataType which fails with the above exception. This is because there is no > match for the VectorUDT in > > def default(dataType: DataType): Literal = {..} > (/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/literals.scala) > > > Any processing on this dataframe that causes Spark to build a query plan > (i.e. almost all productive uses of this dataframe) fails due to this > exception. > > Is it a miss in the Literal implementation that it does not handle > UserDefinedTypes or is it left out intentionally? Is there a way to get > around this problem? This problem seems to be present in all 2.x version. > > Regards, > Vinayak Joshi ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Dataframe-resulting-from-an-except-is-unusable-tp20802p20812.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org