Oh, this is probably the Jira for what I mentioned: https://issues.apache.org/jira/browse/FLINK-2988
> On 27 Nov 2015, at 11:02, Aljoscha Krettek <aljos...@apache.org> wrote: > > Hi, > just some information. The Table API code generator already has preliminary > support for generating code that is NULL-aware. So for example if you have > expressions like 1 + NULL the result would also be null. > > I think one of the missing pieces is a way to get data that contains null > values into the system. For example, right now the expected way to read csv > files is via tuples and they don’t support null values. I think we need a way > to directly read CSV files into a Row DataSet (or Table). > > Cheers, > Aljoscha >> On 26 Nov 2015, at 12:31, Stephan Ewen <se...@apache.org> wrote: >> >> Hi! >> >> Thanks for the good discussion! Here are some thoughts from my side: >> >> 1) >> I would agree with Chengxiang that it helps to have as much NULL handling >> in the table API as possible, since most SQL constructs will be permitted >> there are well. >> >> 2) >> A question that I have is whether we want to actually follow the SQL >> standard exactly. There is a lot of criticism on NULL in the SQL standard, >> and there have been many good proposals for more meaningful semantics (for >> example differentiate between the meanings "value missing", "value >> unknown", "value not applicable", etc). >> >> Going with the SQL way is easiest and makes SQL addition on top of the >> table API much easier. Also, there is only one type of NULL, meaning that >> null-values can be encoded efficiently in bitmaps. Further more, the fact >> that the Table API users have the power of a programming language at hand >> (rather than the limited set of SQL operators), they should be able to >> easily define their own constants for special meanings like "value not >> applicable" or so. >> >> Just curious if anyone has experience with some of the other null-sematic >> proposals that have been around. >> >> 3) >> One comment concerning the three-value-logic for boolean expressions: >> >> A while back, I worked on a SQL engine, and we were able to not implement >> three-value logic with trick. If I recall correctly, it was like this: >> >> If the boolean expressions are monotonous (have no NOT), then the UNKNOWN >> value can be the same as FALSE. So the query planner had to rewrite all >> expression trees to have no NOT, which means pushing the NOT down into the >> leaf comparison operations (for example push NOT into == to become !=). >> These leaf comparison operators needed to be NULL aware to return FALSE on >> comparisons with NULL. >> >> >> Greetings, >> Stephan >> >> >> On Thu, Nov 26, 2015 at 6:41 AM, Li, Chengxiang <chengxiang...@intel.com> >> wrote: >> >>> Thanks, Timo. >>> We may put the NULL related function support to SQL API, but for Scalar >>> expression and Boolean expression, it already been supported in Table API, >>> without NULL value handling support, query with Scalar expression and >>> Boolean expression would fail while encounter NULL value. >>> >>> Thanks >>> Chengxiang >>> >>> -----Original Message----- >>> From: Timo Walther [mailto:twal...@apache.org] >>> Sent: Wednesday, November 25, 2015 7:33 PM >>> To: dev@flink.apache.org >>> Subject: Re: The null in Flink >>> >>> Hi Chengxiang, >>> >>> I totally agree that the Table API should fully support NULL values. The >>> Table API is a logical API and therefore we should be as close to ANSI SQL >>> as possible. Rows need to be nullable in the near future. >>> >>> 2. i, ii, iii and iv sound reasonable. But v, vi and vii sound to much >>> like SQL magic. I think all other SQL magic (DBMS specific corner cases) >>> should be handled by the SQL API on top of the Table API. >>> >>> Regards, >>> Timo >>> >>> >>> On 25.11.2015 11:31, Li, Chengxiang wrote: >>>> Hi >>>> In this mail list, there are some discussions about null value handling >>> in Flink, and I saw several related JIRAs as well(like FLINK-2203, >>> FLINK-2210), but unfortunately, got reverted due to immature design, and no >>> further action since then. I would like to pick this topic up here, as it's >>> quite an important part of data analysis and many features depend on it. >>> Hopefully, through a plenary discussion, we can generate an acceptable >>> solution and move forward. Stephan has explained very clearly about how and >>> why Flink handle "Null values in the Programming Language APIs", so I >>> mainly talk about the second part of "Null values in the high-level >>> (logical) APIs ". >>>> >>>> 1. Why should Flink support Null values handling in Table API? >>>> i. Data source may miss column value in many cases, if no Null >>> values handling in Table API, user need to write an extra ETL to handle >>> missing values manually. >>>> ii. Some Table API operators generate Null values on their own, >>> like Outer Join/Cube/Rollup/Grouping Set, and so on. Null values handling >>> in Table API is the prerequisite of these features. >>>> >>>> 2. The semantic of Null value handling in Table API. >>>> Fortunately, there are already mature DBMS standards we can follow for >>> Null value handling, I list several semantic of Null value handling here. >>> To be noted that, this may not cover all the cases, and the semantics may >>> vary in different DBMSs, so it should totally open to discuss. >>>> I, NULL compare. In ascending order, NULL is smaller than any >>> other value, and NULL == NULL return false. >>>> ii. NULL exists in GroupBy Key, all NULL values are grouped as a >>> single group. >>>> iii. NULL exists in Aggregate columns, ignore NULL in aggregation >>> function. >>>> iv. NULL exists in both side Join key, refer to #i, >>> NULL == NULL return false, no output for NULL Join key. >>>> v. NULL in Scalar expression, expression within >>> NULL(eg. 1 + NULL) return NULL. >>>> vi. NULL in Boolean expression, add an extra result: >>> UNKNOWN, more semantic for Boolean expression in reference #1. >>>> vii. More related function support, like COALESCE, NVL, >>> NANVL, and so on. >>>> >>>> 3. NULL value storage in Table API. >>>> Just set null to Row field value. To mark NULL value in serialized >>> binary record data, normally it use extra flag for each field to mark >>> whether its value is NULL, which would change the data layout of Row >>> object. So any logic that access serialized Row data directly should >>> updated to sync with new data layout, for example, many methods in >>> RowComparator. >>>> >>>> Reference: >>>> 1. Nulls: Nothing to worry about: >>> http://www.oracle.com/technetwork/issue-archive/2005/05-jul/o45sql-097727.html >>> . >>>> 2. Null related functions: >>>> https://oracle-base.com/articles/misc/null-related-functions >>>> >>>> -----Original Message----- >>>> From: ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] On Behalf >>>> Of Stephan Ewen >>>> Sent: Thursday, June 18, 2015 8:43 AM >>>> To: dev@flink.apache.org >>>> Subject: Re: The null in Flink >>>> >>>> Hi! >>>> >>>> I think we actually have two discussions here, both of them important: >>>> >>>> -------------------------------------------------------------- >>>> 1) Null values in the Programming Language APIs >>>> -------------------------------------------------------------- >>>> >>>> Fields in composite types may simply be null pointers. >>>> >>>> In object types: >>>> - primitives members are naturally non-nullable >>>> - all other members are nullable >>>> >>>> => If you want to avoid the overhead of nullability, go with primitive >>> types. >>>> >>>> In Tuples, and derives types (Scala case classes): >>>> - Fields are non-nullable. >>>> >>>> => The reason here is that we initially decided to keep tuples as a very >>> fast data type. Because tuples cannot hold primitives in Java/Scala, we >>> would not have a way to make fast non-nullable fields. The performance of >>> nullable fields affects the key-operations, especially on normalized keys. >>>> We can work around that with some effort, but have not one it so far. >>>> >>>> => In Scala, the Option types is a natural way of elegantly working >>> around that. >>>> >>>> >>>> -------------------------------------------------------------- >>>> 2) Null values in the high-level (logial) APIs >>>> -------------------------------------------------------------- >>>> >>>> This is mainly what Ted was referring to, if I understood him correctly. >>>> >>>> Here, we need to figure out what form of semantical null values in the >>> Table API and later, in SQL. >>>> >>>> Besides deciding what semantics to follow here in the logical APIs, we >>> need to decide what these values confert to/from when switching between >>> logical/physical APIs. >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Mon, Jun 15, 2015 at 10:07 AM, Ted Dunning <ted.dunn...@gmail.com> >>> wrote: >>>> >>>>> On Mon, Jun 15, 2015 at 8:45 AM, Maximilian Michels <m...@apache.org> >>>>> wrote: >>>>> >>>>>> Just to give an idea what null values could cause in Flink: >>>>> DataSet.count() >>>>>> returns the number of elements of all values in a Dataset (null or >>>>>> not) while #834 would ignore null values and aggregate the DataSet >>>>>> without >>>>> them. >>>>> Compare R's na.action. >>>>> >>>>> http://www.ats.ucla.edu/stat/r/faq/missing.htm >>>>> >>> >>> >