With Spark 2.x, I construct a Dataframe from a sample libsvm file:

scala> val higgsDF = spark.read.format("libsvm").load("higgs.libsvm")
higgsDF: org.apache.spark.sql.DataFrame = [label: double, features: 
vector]


Then, build a new dataframe that involves an except( )

scala> val train_df = higgsDF.sample(false, 0.7, 42)
train_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: 
double, features: vector]

scala> val test_df = input_df.except(train_df)
test_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: 
double, features: vector]

Now, most operations on the test_df fail with this exception:

scala> test_df.show()
java.lang.RuntimeException: no default for type 
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
  at 
org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179)
  at 
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:117)
  at 
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:110)
   .
   .

Debugging this, I see that this is the schema of this dataframe:

scala> test_df.schema
res4: org.apache.spark.sql.types.StructType = 
StructType(StructField(label,DoubleType,true), 
StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))

Looking a little deeper, the error occurs because the QueryPlanner ends up 
inside

  object ExtractEquiJoinKeys 
(/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala)

where it processes a LeftAnti Join. Then there is an attempt to generate a 
default Literal value for the org.apache.spark.ml.linalg.VectorUDT 
DataType which fails with the above exception. This is because there is no 
match for the VectorUDT in

def default(dataType: DataType): Literal = {..} 
(/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/literals.scala)


Any processing on this dataframe that causes Spark to build a query plan 
(i.e. almost all productive uses of this dataframe) fails due to this 
exception. 

Is it a miss in the Literal implementation that it does not handle 
UserDefinedTypes or is it left out intentionally? Is there a way to get 
around this problem? This problem seems to be present in all 2.x version.

Regards,
Vinayak Joshi


Reply via email to