Hi,
Consider the following code using spark.ml to get the probability column on
a data set:
model.transform(dataSet)
.selectExpr("probability.values")
.printSchema()
Note that "probability" is `vector` type which is a UDT with the following
implementation.
class VectorUDT extends UserDefinedType[Vector] {
override def sqlType: StructType = {
// type: 0 = sparse, 1 = dense
// We only use "values" for dense vectors, and "size", "indices",
and "values" for sparse
// vectors. The "values" field is nullable because we might want
to add binary vectors later,
// which uses "size" and "indices", but not "values".
StructType(Seq(
StructField("type", ByteType, nullable = false),
StructField("size", IntegerType, nullable = true),
StructField("indices", ArrayType(IntegerType, containsNull =
false), nullable = true),
StructField("values", ArrayType(DoubleType, containsNull =
false), nullable = true)))
}
//...
}
`values` is one of its attribute. However, it can not be extracted.
The first code snippet results in an exception of complexTypeExtractors:
org.apache.spark.sql.AnalysisException: Can't extract value from
probability#743;
at ...
at ...
at ...
...
Here is the code:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L49
It seems that the pattern matching does not take UDT into consideration.
Is this an intended feature? If not, I would like to create a PR to fix it.
--
Hao Ren
Data Engineer @ leboncoin
Paris, France