Why are DataFrames always read with nullable=True?

Jason White Mon, 20 Mar 2017 14:30:40 -0700

If I create a dataframe in Spark with non-nullable columns, and then save
that to disk as a Parquet file, the columns are properly marked as
non-nullable. I confirmed this using parquet-tools. Then, when loading it
back, Spark forces the nullable back to True.


https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L378

If I remove the `.asNullable` part, Spark performs exactly as I'd like by
default, picking up the data using the schema either in the Parquet file or
provided by me.

This particular LoC goes back a year now, and I've seen a variety of
discussions about this issue. In particular with Michael here:
https://www.mail-archive.com/user@spark.apache.org/msg39230.html. Those
seemed to be discussing writing, not reading, though, and writing is already
supported now.

Is this functionality still desirable? Is it potentially not applicable for
all file formats and situations (e.g. HDFS/Parquet)? Would it be suitable to
pass an option to the DataFrameReader to disable this functionality?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-are-DataFrames-always-read-with-nullable-True-tp21207.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Why are DataFrames always read with nullable=True?

Reply via email to