Re: Spark SQL Dataframe resulting from an except( ) is unusable

Liang-Chi Hsieh Wed, 01 Feb 2017 07:20:55 -0800

Hi Vinayak,

Thanks for reporting this.


I don't think it is left out intentionally for UserDefinedType. If you
already know how the UDT is represented in internal format, you can
explicitly convert the UDT column to other SQL types, then you may get
around this problem. It is a bit hacky, anyway.

I submitted a PR to fix this, but not sure if it will get in the master
soon.


vijoshi wrote
> With Spark 2.x, I construct a Dataframe from a sample libsvm file:
> 
> scala> val higgsDF = spark.read.format("libsvm").load("higgs.libsvm")
> higgsDF: org.apache.spark.sql.DataFrame = [label: double, features: 
> vector]
> 
> 
> Then, build a new dataframe that involves an except( )
> 
> scala> val train_df = higgsDF.sample(false, 0.7, 42)
> train_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: 
> double, features: vector]
> 
> scala> val test_df = input_df.except(train_df)
> test_df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: 
> double, features: vector]
> 
> Now, most operations on the test_df fail with this exception:
> 
> scala> test_df.show()
> java.lang.RuntimeException: no default for type 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
>   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179)
>   at 
> org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:117)
>   at 
> org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:110)
>    .
>    .
> 
> Debugging this, I see that this is the schema of this dataframe:
> 
> scala> test_df.schema
> res4: org.apache.spark.sql.types.StructType = 
> StructType(StructField(label,DoubleType,true), 
> StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
> 
> Looking a little deeper, the error occurs because the QueryPlanner ends up 
> inside
> 
>   object ExtractEquiJoinKeys 
> (/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala)
> 
> where it processes a LeftAnti Join. Then there is an attempt to generate a 
> default Literal value for the org.apache.spark.ml.linalg.VectorUDT 
> DataType which fails with the above exception. This is because there is no 
> match for the VectorUDT in
> 
> def default(dataType: DataType): Literal = {..} 
> (/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/literals.scala)
> 
> 
> Any processing on this dataframe that causes Spark to build a query plan 
> (i.e. almost all productive uses of this dataframe) fails due to this 
> exception. 
> 
> Is it a miss in the Literal implementation that it does not handle 
> UserDefinedTypes or is it left out intentionally? Is there a way to get 
> around this problem? This problem seems to be present in all 2.x version.
> 
> Regards,
> Vinayak Joshi





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Dataframe-resulting-from-an-except-is-unusable-tp20802p20812.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: Spark SQL Dataframe resulting from an except( ) is unusable

Reply via email to