SparkSQL can not extract values from UDT (like VectorUDT)

Hao Ren Mon, 12 Oct 2015 07:55:56 -0700

Hi,

Consider the following code using spark.ml to get the probability column on
a data set:


model.transform(dataSet)
.selectExpr("probability.values")
.printSchema()

 Note that "probability" is `vector` type which is a UDT with the following
implementation.

class VectorUDT extends UserDefinedType[Vector] {

  override def sqlType: StructType = {
    // type: 0 = sparse, 1 = dense
    // We only use "values" for dense vectors, and "size", "indices",
and "values" for sparse
    // vectors. The "values" field is nullable because we might want
to add binary vectors later,
    // which uses "size" and "indices", but not "values".
    StructType(Seq(
      StructField("type", ByteType, nullable = false),
      StructField("size", IntegerType, nullable = true),
      StructField("indices", ArrayType(IntegerType, containsNull =
false), nullable = true),
      StructField("values", ArrayType(DoubleType, containsNull =
false), nullable = true)))
  }

  //...

}


`values` is one of its attribute. However, it can not be extracted.

The first code snippet results in an exception of  complexTypeExtractors:

org.apache.spark.sql.AnalysisException: Can't extract value from
probability#743;
      at ...
      at ...
      at ...
...

Here is the code:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L49

It seems that the pattern matching does not take UDT into consideration.

Is this an intended feature? If not, I would like to create a PR to fix it.

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France

SparkSQL can not extract values from UDT (like VectorUDT)

Reply via email to