DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

Aleksander Eskilson Thu, 13 Oct 2016 12:35:31 -0700

Hi there,

Working in the space of custom Encoders/ExpressionEncoders, I've noticed
that the StructType schema as set when creating an object of the
ExpressionEncoder[T] class [1] is not the schema actually used to set types
for the columns of a Dataset, as created by using the .as(encoder) method
[2] on read data. Instead, what occurs is that the schema is either
inferred through analysis of the data, or a schema can be provided using
the .schema(structType) method [3] of the DataFrameReader. However, when
using the .schema(..) method of DataFrameReader, potentially undesirable
behaviour occurs: while the DataSource is being resolved, all FieldTypes of
the a StructType schema have their nullability set to *true* (using the
asNullable function of StructTypes) [4] when the data is read from a local
file, as opposed to a non-streaming source.


Of course, allowing null-values where they shouldn't exist can weaken the
type-guarantees for DataSets over certain types of encoded data.

Thinking on how this might be resolved, first, if it's a legitimate bug,
I'm not sure why "non-streaming file based" datasources need to have their
StructFields all rendered nullable. Simply removing the call to asNullable
would fix the issue. Second, if it's actually necessary for most
filesystem-read data-sources to have their StructFields potentially
nullable in this manner, we could instead let the StructType schema
provided to the Encoder have the final say in the DataSet's schema.

This latter option seems sensible to me: if a client is willing to provide
a custom Encoder via the .as(..) method on the reader, presumably in
setting the schema field of the encoder they have some legitimate notion of
how their object's types should be mapped to DataSet column types. Any
failure when resolving their data to a DataSet by means of their Encoder
can then be traced to their Encoder for their own debugging.

Thoughts? Thanks,
Alek Eskilson

[1] -
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L213
[2] -
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L374
[3] -
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L62
[4] -
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L426

DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

Reply via email to