If I create a dataframe in Spark with non-nullable columns, and then save that to disk as a Parquet file, the columns are properly marked as non-nullable. I confirmed this using parquet-tools. Then, when loading it back, Spark forces the nullable back to True.
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L378 If I remove the `.asNullable` part, Spark performs exactly as I'd like by default, picking up the data using the schema either in the Parquet file or provided by me. This particular LoC goes back a year now, and I've seen a variety of discussions about this issue. In particular with Michael here: https://www.mail-archive.com/user@spark.apache.org/msg39230.html. Those seemed to be discussing writing, not reading, though, and writing is already supported now. Is this functionality still desirable? Is it potentially not applicable for all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable to pass an option to the DataFrameReader to disable this functionality? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org