The “system columns” feature was working when the Calcite code base was used in a previous project several years ago, though it may have atrophied since.
In that project, we actually had ROWTIME as a system column. It was partially successful, but raised some problems. Consider a streaming join. The source relations have ROWTIME columns, but the join would create an additional ROWTIME column. (1) Should the original ROWTIME columns just disappear? It wasn’t clear. (2) What if you wanted to assign your own ROWTIME from an expression. Is it sufficient to just write ‘expression AS ROWTIME’ in the SELECT clause? This a more profound operation than simply renaming a column. Being a system column implies certain data type, not null, and maybe ordering. Would you also have to apply an ORDER BY? (3) Having system columns in streams conflicts with the goal of using the same SQL for streaming and historic (regular SQL) queries. (4) We found ourselves relying on these columns under the covers in certain operators. The system had become non-relational. So, after that experience I concluded that ROWTIME should be just a regular column. Just a column that we know a lot about: it is a timestamp (usually; although we allow other data types); it is conventionally called ROWTIME but it doesn’t have to be; it is not null; it is usually a sort column (but it may not be - the stream might be k-sorted, e.g. within 10 rows or 10 minutes of being sorted, or sorted within a particular warehouseId; or “sortable"). The “sortable” concept is really powerful. Consider the Orders stream and the Orders table. The stream is infinite, so the only sortable columns are those that are already sorted or are within N rows or T seconds of being sorted, or have some kind of guarantee in terms of watermarks. The table is finite, so everything is sortable. If we want to do an operation such as “GROUP BY FLOOR(ROWTIME TO HOUR)” it is sufficient that ROWTIME is sortable. You do not need ROWTIME to be a system column. My advice is, rather than requiring “blessed” system columns, you have a convention for the names of event-time and processing-time columns, and make your operators consume and produce on those columns explicitly. Julian > On Feb 15, 2017, at 12:56 AM, Timo Walther <[email protected]> wrote: > > Hi everyone, > > we (from Flink) are currently discussing how we can express time-semantics > (event-time or processing-time) in a SQL query. The optimal solution would be > to have two system attributes that are part of every table schema/every row > data type. We could then access it like `SELECT * FROM MyTable ORDER BY > rowtime`. However, it should not be part of the result in an expansion (`*`) > and the user should not modify those attributes (no aliasing, read-only). I > had a look into SqlValidator and there are several lines that contain things > like `includeSystemVars` or `isSystemField` but nothing concrete. Am I right > that this feature is not entirely implemented yet? Which parts would you > touch/override to implement this feature? > > Thanks in advance. > > Regards, > Timo > > >
