Re: The null in Flink

Aljoscha Krettek Fri, 27 Nov 2015 02:41:41 -0800

Oh, this is probably the Jira for what I mentioned: 
https://issues.apache.org/jira/browse/FLINK-2988


> On 27 Nov 2015, at 11:02, Aljoscha Krettek <aljos...@apache.org> wrote:
> 
> Hi,
> just some information. The Table API code generator already has preliminary 
> support for generating code that is NULL-aware. So for example if you have 
> expressions like 1 + NULL the result would also be null.
> 
> I think one of the missing pieces is a way to get data that contains null 
> values into the system. For example, right now the expected way to read csv 
> files is via tuples and they don’t support null values. I think we need a way 
> to directly read CSV files into a Row DataSet (or Table).
> 
> Cheers,
> Aljoscha
>> On 26 Nov 2015, at 12:31, Stephan Ewen <se...@apache.org> wrote:
>> 
>> Hi!
>> 
>> Thanks for the good discussion! Here are some thoughts from my side:
>> 
>> 1)
>> I would agree with Chengxiang that it helps to have as much NULL handling
>> in the table API as possible, since most SQL constructs will be permitted
>> there are well.
>> 
>> 2)
>> A question that I have is whether we want to actually follow the SQL
>> standard exactly. There is a lot of criticism on NULL in the SQL standard,
>> and there have been many good proposals for more meaningful semantics (for
>> example differentiate between the meanings "value missing", "value
>> unknown", "value not applicable", etc).
>> 
>> Going with the SQL way is easiest and makes SQL addition on top of the
>> table API much easier. Also, there is only one type of NULL, meaning that
>> null-values can be encoded efficiently in bitmaps. Further more, the fact
>> that the Table API users have the power of a programming language at hand
>> (rather than the limited set of SQL operators), they should be able to
>> easily define their own constants for special meanings like "value not
>> applicable" or so.
>> 
>> Just curious if anyone has experience with some of the other null-sematic
>> proposals that have been around.
>> 
>> 3)
>> One comment concerning the three-value-logic for boolean expressions:
>> 
>> A while back, I worked on a SQL engine, and we were able to not implement
>> three-value logic with trick. If I recall correctly, it was like this:
>> 
>> If the boolean expressions are monotonous (have no NOT), then the UNKNOWN
>> value can be the same as FALSE. So the query planner had to rewrite all
>> expression trees to have no NOT, which means pushing the NOT down into the
>> leaf comparison operations (for example push NOT into == to become !=).
>> These leaf comparison operators needed to be NULL aware to return FALSE on
>> comparisons with NULL.
>> 
>> 
>> Greetings,
>> Stephan
>> 
>> 
>> On Thu, Nov 26, 2015 at 6:41 AM, Li, Chengxiang <chengxiang...@intel.com>
>> wrote:
>> 
>>> Thanks, Timo.
>>> We may put the NULL related function support to SQL API, but for Scalar
>>> expression and Boolean expression, it already been supported in Table API,
>>> without NULL value handling support, query with Scalar expression and
>>> Boolean expression would fail while encounter NULL value.
>>> 
>>> Thanks
>>> Chengxiang
>>> 
>>> -----Original Message-----
>>> From: Timo Walther [mailto:twal...@apache.org]
>>> Sent: Wednesday, November 25, 2015 7:33 PM
>>> To: dev@flink.apache.org
>>> Subject: Re: The null in Flink
>>> 
>>> Hi Chengxiang,
>>> 
>>> I totally agree that the Table API should fully support NULL values. The
>>> Table API is a logical API and therefore we should be as close to ANSI SQL
>>> as possible. Rows need to be nullable in the near future.
>>> 
>>> 2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to much
>>> like SQL magic. I think all other SQL magic (DBMS specific corner cases)
>>> should be handled by the SQL API on top of the Table API.
>>> 
>>> Regards,
>>> Timo
>>> 
>>> 
>>> On 25.11.2015 11:31, Li, Chengxiang wrote:
>>>> Hi
>>>> In this mail list, there are some discussions about null value handling
>>> in Flink, and I saw several related JIRAs as well(like FLINK-2203,
>>> FLINK-2210), but unfortunately, got reverted due to immature design, and no
>>> further action since then. I would like to pick this topic up here, as it's
>>> quite an important part of data analysis and many features depend on it.
>>> Hopefully, through a plenary discussion, we can generate an acceptable
>>> solution and move forward. Stephan has explained very clearly about how and
>>> why Flink handle "Null values in the Programming Language APIs", so I
>>> mainly talk about the second part of "Null values in the high-level
>>> (logical) APIs ".
>>>> 
>>>> 1. Why should Flink support Null values handling in Table API?
>>>>     i.  Data source may miss column value in many cases, if no Null
>>> values handling in Table API, user need to write an extra ETL to handle
>>> missing values manually.
>>>>     ii. Some Table API operators generate Null values on their own,
>>> like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values handling
>>> in Table API is the prerequisite of these features.
>>>> 
>>>> 2. The semantic of Null value handling in Table API.
>>>> Fortunately, there are already mature DBMS  standards we can follow for
>>> Null value handling, I list several semantic of Null value handling here.
>>> To be noted that, this may not cover all the cases, and the semantics may
>>> vary in different DBMSs, so it should totally open to discuss.
>>>>     I,  NULL compare. In ascending order, NULL is smaller than any
>>> other value, and NULL == NULL return false.
>>>>     ii. NULL exists in GroupBy Key, all NULL values are grouped as a
>>> single group.
>>>>     iii. NULL exists in Aggregate columns, ignore NULL in aggregation
>>> function.
>>>>                iv. NULL exists in both side Join key, refer to #i,
>>> NULL == NULL return false, no output for NULL Join key.
>>>>                v.  NULL in Scalar expression, expression within
>>> NULL(eg. 1 + NULL) return NULL.
>>>>                vi. NULL in Boolean expression, add an extra result:
>>> UNKNOWN, more semantic for Boolean expression in reference #1.
>>>>                vii. More related function support, like COALESCE, NVL,
>>> NANVL, and so on.
>>>> 
>>>> 3. NULL value storage in Table API.
>>>>  Just set null to Row field value. To mark NULL value in serialized
>>> binary record data, normally it use extra flag for each field to mark
>>> whether its value is NULL, which would change the data layout of Row
>>> object. So any logic that access serialized Row data directly should
>>> updated to sync with new data layout, for example, many methods in
>>> RowComparator.
>>>> 
>>>> Reference:
>>>> 1. Nulls: Nothing to worry about:
>>> http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-097727.html
>>> .
>>>> 2. Null related functions:
>>>> https://oracle-base.com/articles/misc/null-related-functions
>>>> 
>>>> -----Original Message-----
>>>> From: ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] On Behalf
>>>> Of Stephan Ewen
>>>> Sent: Thursday, June 18, 2015 8:43 AM
>>>> To: dev@flink.apache.org
>>>> Subject: Re: The null in Flink
>>>> 
>>>> Hi!
>>>> 
>>>> I think we actually have two discussions here, both of them important:
>>>> 
>>>> --------------------------------------------------------------
>>>> 1) Null values in the Programming Language APIs
>>>> --------------------------------------------------------------
>>>> 
>>>> Fields in composite types may simply be null pointers.
>>>> 
>>>> In object types:
>>>>  - primitives members are naturally non-nullable
>>>>  - all other members are nullable
>>>> 
>>>> => If you want to avoid the overhead of nullability, go with primitive
>>> types.
>>>> 
>>>> In Tuples, and derives types (Scala case classes):
>>>>  - Fields are non-nullable.
>>>> 
>>>> => The reason here is that we initially decided to keep tuples as a very
>>> fast data type. Because tuples cannot hold primitives in Java/Scala, we
>>> would not have a way to make fast non-nullable fields. The performance of
>>> nullable fields affects the key-operations, especially on normalized keys.
>>>> We can work around that with some effort, but have not one it so far.
>>>> 
>>>> => In Scala, the Option types is a natural way of elegantly working
>>> around that.
>>>> 
>>>> 
>>>> --------------------------------------------------------------
>>>> 2) Null values in the high-level (logial) APIs
>>>> --------------------------------------------------------------
>>>> 
>>>> This is mainly what Ted was referring to, if I understood him correctly.
>>>> 
>>>> Here, we need to figure out what form of semantical null values in the
>>> Table API and later, in SQL.
>>>> 
>>>> Besides deciding what semantics to follow here in the logical APIs, we
>>> need to decide what these values confert to/from when switching between
>>> logical/physical APIs.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning <ted.dunn...@gmail.com>
>>> wrote:
>>>> 
>>>>> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels <m...@apache.org>
>>>>> wrote:
>>>>> 
>>>>>> Just to give an idea what null values could cause in Flink:
>>>>> DataSet.count()
>>>>>> returns the number of elements of all values in a Dataset (null or
>>>>>> not) while #834 would ignore null values and aggregate the DataSet
>>>>>> without
>>>>> them.
>>>>> Compare R's na.action.
>>>>> 
>>>>> http://www.ats.ucla.edu/stat/r/faq/missing.htm
>>>>> 
>>> 
>>> 
>

Re: The null in Flink

Reply via email to