Hi there, Working in the space of custom Encoders/ExpressionEncoders, I've noticed that the StructType schema as set when creating an object of the ExpressionEncoder[T] class [1] is not the schema actually used to set types for the columns of a Dataset, as created by using the .as(encoder) method [2] on read data. Instead, what occurs is that the schema is either inferred through analysis of the data, or a schema can be provided using the .schema(structType) method [3] of the DataFrameReader. However, when using the .schema(..) method of DataFrameReader, potentially undesirable behaviour occurs: while the DataSource is being resolved, all FieldTypes of the a StructType schema have their nullability set to *true* (using the asNullable function of StructTypes) [4] when the data is read from a local file, as opposed to a non-streaming source.
Of course, allowing null-values where they shouldn't exist can weaken the type-guarantees for DataSets over certain types of encoded data. Thinking on how this might be resolved, first, if it's a legitimate bug, I'm not sure why "non-streaming file based" datasources need to have their StructFields all rendered nullable. Simply removing the call to asNullable would fix the issue. Second, if it's actually necessary for most filesystem-read data-sources to have their StructFields potentially nullable in this manner, we could instead let the StructType schema provided to the Encoder have the final say in the DataSet's schema. This latter option seems sensible to me: if a client is willing to provide a custom Encoder via the .as(..) method on the reader, presumably in setting the schema field of the encoder they have some legitimate notion of how their object's types should be mapped to DataSet column types. Any failure when resolving their data to a DataSet by means of their Encoder can then be traced to their Encoder for their own debugging. Thoughts? Thanks, Alek Eskilson [1] - https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L213 [2] - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L374 [3] - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L62 [4] - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L426