[
https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575599#comment-15575599
]
Aleksander Eskilson commented on SPARK-17939:
---------------------------------------------
[~marmbrus] suggested the opening of this issue after a bit of discussion in
the email list [1].
I'd like to clarify what I proposed in this newer context. First, clarification
of what nullability means in the current API, like in the GitHub issue linked
above would be great. Second, it makes total sense for default nullability in
the reader in many instances, to support more loosely-typed data sources, like
JSON and CSV.
However, apart from an analysis-time hint to the Catalyst optimizer, there are
instances where a (potentially separate?), enforcement-level idea of
nullability would be quite useful. With the possibility of now writing
custom-encoders open, other kinds of more strongly-typed data might be read
into Datasets, e.g. Avro. Avro's UNION type with NULL gives us a harder notion
of a truly nullable vs. non-nullable type. It was suggested in the other linked
Jira issue above that the current contract is for users to make sure that they
do not pass bad data into the reader (as it currently performs conversions that
might surprise the user, like from null to 0).
What I mean to suggest is that a type-level notion of nullability could help us
fail faster and abide by our own data-contracts when we have data to read into
Datasets that comes from more strongly-typed sources with known schemas.
Thoughts on this?
> Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
> ------------------------------------------------------------------
>
> Key: SPARK-17939
> URL: https://issues.apache.org/jira/browse/SPARK-17939
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Aleksander Eskilson
> Priority: Critical
>
> The notion of Nullability of of StructFields in DataFrames and Datasets
> creates some confusion. As has been pointed out previously [1], Nullability
> is a hint to the Catalyst optimizer, and is not meant to be a type-level
> enforcement. Allowing null fields can also help the reader successfully parse
> certain types of more loosely-typed data, like JSON and CSV, where null
> values are common, rather than just failing.
> There's already been some movement to clarify the meaning of Nullable in the
> API, but also some requests for a (perhaps completely separate) type-level
> implementation of Nullable that can act as an enforcement contract.
> This bug is logged here to discuss and clarify this issue.
> [1] -
> [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535]
> [2] - https://github.com/apache/spark/pull/11785
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]