[jira] [Commented] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification

Aleksander Eskilson (JIRA) Fri, 14 Oct 2016 08:04:39 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575599#comment-15575599
 ]


Aleksander Eskilson commented on SPARK-17939:
---------------------------------------------

[~marmbrus] suggested the opening of this issue after a bit of discussion in 
the email list [1]. 

I'd like to clarify what I proposed in this newer context. First, clarification 
of what nullability means in the current API, like in the GitHub issue linked 
above would be great. Second, it makes total sense for default nullability in 
the reader in many instances, to support more loosely-typed data sources, like 
JSON and CSV. 

However, apart from an analysis-time hint to the Catalyst optimizer, there are 
instances where a (potentially separate?), enforcement-level idea of 
nullability would be quite useful. With the possibility of now writing 
custom-encoders open, other kinds of more strongly-typed data might be read 
into Datasets, e.g. Avro. Avro's UNION type with NULL gives us a harder notion 
of a truly nullable vs. non-nullable type. It was suggested in the other linked 
Jira issue above that the current contract is for users to make sure that they 
do not pass bad data into the reader (as it currently performs conversions that 
might surprise the user, like from null to 0). 

What I mean to suggest is that a type-level notion of nullability could help us 
fail faster and abide by our own data-contracts when we have data to read into 
Datasets that comes from more strongly-typed sources with known schemas. 

Thoughts on this?

> Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
> ------------------------------------------------------------------
>
>                 Key: SPARK-17939
>                 URL: https://issues.apache.org/jira/browse/SPARK-17939
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Aleksander Eskilson
>            Priority: Critical
>
> The notion of Nullability of of StructFields in DataFrames and Datasets 
> creates some confusion. As has been pointed out previously [1], Nullability 
> is a hint to the Catalyst optimizer, and is not meant to be a type-level 
> enforcement. Allowing null fields can also help the reader successfully parse 
> certain types of more loosely-typed data, like JSON and CSV, where null 
> values are common, rather than just failing. 
> There's already been some movement to clarify the meaning of Nullable in the 
> API, but also some requests for a (perhaps completely separate) type-level 
> implementation of Nullable that can act as an enforcement contract.
> This bug is logged here to discuss and clarify this issue.
> [1] - 
> [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535]
> [2] - https://github.com/apache/spark/pull/11785



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification

Reply via email to